Thanks for the feedback Robert. This approach sounds like a better path to
follow. I'll explore it. I agree that we should provide default behavior
that is overall best for our users, and not for one specific use-case such
as Amazon search :).

Mike- TermInSetQuery used to use seekExact, and now uses seekCeil. We
haven't used intersect... yet.

Thanks again for the feedback.

Cheers,
-Greg

On Tue, May 9, 2023 at 11:09 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Besides not being able to use the bloom filter, seekCeil is also just more
> costly than seekExact since it is essentially both .seekExact and .next in
> a single operation.
>
> Are either of the two approaches using the intersect method of TermsEnum?
> It might be faster if the number of terms is over some threshold.
>
> It would require building an Automaton out of the set of terms, which is
> fast with DaciukMihovAutomatonBuilder.  Hmm, I think we should rename this
> class maybe.  I'll open an issue.  Naming is the hardest part!
>
> The Codec can implement this quite efficiently since it can do the
> ping-pong skipping Patrick is referring to on a byte-by-byte basis in each
> of the sources of Term iteration.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, May 5, 2023 at 9:34 PM Patrick Zhai <zhai7...@gmail.com> wrote:
>
>> Hi Greg
>> IMO I still think the seekCeil is a better solution for the default
>> posting format, as it could potentially save time on traversing the FST by
>> doing the ping-pong skipping.
>> I can see that in the case of using bloom filter the seekExact might be
>> better but I'm not sure whether there is a better way than overriding the
>> `getTermsEnum`...
>>
>> Patrick
>>
>> On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote:
>>
>>> Hi folks-
>>>
>>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we
>>> rewrote TermInSetQuery to extend MultiTermQuery. With this change,
>>> TermInSetQuery can now leverage the various "rewrite methods" available to
>>> MultiTermQuery, allowing users to customize the query evaluation strategy
>>> (e.g., postings vs. doc values, etc.), which was a nice win. In the
>>> benchmarks we ran, we didn't see any performance issues.
>>>
>>> In anticipation of 9.6 releasing, I've pulled this change into the
>>> Lucene snapshot we use for Amazon product search, and started running some
>>> additional benchmarks, which have surfaced an interesting issue. One
>>> use-case we have for TermInSetQuery creates a term disjunction over a field
>>> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
>>> bloom filtering can only help with seekExact and not seekCeil, we're seeing
>>> a performance regression (primarily in red-line QPS).
>>>
>>> One way I can think to address this is to move back to a seekExact
>>> approach when creating the filtered TermsEnum used by MultiTermQuery (for
>>> the TermInSetQuery implementation). Because TermInSetQuery can provide all
>>> of its terms up-front, we can have a simpler term intersection
>>> implementation that relies on seekExact over seekCeil. Here's a quick take
>>> on what I'm thinking:
>>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
>>> I've tested this internally and confirmed it solves our QPS regression
>>> problem.
>>>
>>> I'm curious if anyone has an objection to moving back to a seekExact
>>> term intersection approach for TermInSetQuery, or has alternative ideas. I
>>> wonder if I'm overlooking some important factors and focusing too much on
>>> this specific case where the bloom filter interaction is hurting
>>> performance? It seems like seekCeil could provide benefits in some cases
>>> over seekExact by skipping over multiple query terms at a time, so that's a
>>> possible consideration. If we solve for the most common cases by default, I
>>> suppose advanced users could always override TermInSetQuery#getTermsEnum as
>>> necessary (we could take this approach internally for example to work with
>>> our bloom filtering if the best default is to leverage seekCeil). I can
>>> easily turn my quick solution into a PR, but before I do, I wanted to poll
>>> this group for thoughts on the approach or other alternatives I might be
>>> overlooking. Thanks in advance!
>>>
>>> Cheers,
>>> -Greg
>>>
>>

Reply via email to