Thanks Patrick. I tend to agree with you for the default behavior. Bloom
filter usage seems like a bit of a less-common case on the surface at least
(e.g., it's expected behavior for query terms to not be present in a given
segment with enough frequency to justify the additional codec layer). A
primary key-like field is sort of the exception here, where a
TermInSetQuery can be useful for allow/block-listing semantics—and where
bloom filtering can be helpful. As an aside, given that TermInSetQuery
already has some semi-special logic for recognizing primary key-like
fields, and with this additional consideration, it makes me wonder if a
special-purpose IDSet query or something might make sense at some point.

For now though, I like the idea of leaving TermInSetQuery as-is, since
users can extend it and change the behavior of getTerms if they really need
to.

Cheers,
-Greg

On Fri, May 5, 2023 at 6:33 PM Patrick Zhai <zhai7...@gmail.com> wrote:

> Hi Greg
> IMO I still think the seekCeil is a better solution for the default
> posting format, as it could potentially save time on traversing the FST by
> doing the ping-pong skipping.
> I can see that in the case of using bloom filter the seekExact might be
> better but I'm not sure whether there is a better way than overriding the
> `getTermsEnum`...
>
> Patrick
>
> On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote:
>
>> Hi folks-
>>
>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we
>> rewrote TermInSetQuery to extend MultiTermQuery. With this change,
>> TermInSetQuery can now leverage the various "rewrite methods" available to
>> MultiTermQuery, allowing users to customize the query evaluation strategy
>> (e.g., postings vs. doc values, etc.), which was a nice win. In the
>> benchmarks we ran, we didn't see any performance issues.
>>
>> In anticipation of 9.6 releasing, I've pulled this change into the Lucene
>> snapshot we use for Amazon product search, and started running some
>> additional benchmarks, which have surfaced an interesting issue. One
>> use-case we have for TermInSetQuery creates a term disjunction over a field
>> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because
>> bloom filtering can only help with seekExact and not seekCeil, we're seeing
>> a performance regression (primarily in red-line QPS).
>>
>> One way I can think to address this is to move back to a seekExact
>> approach when creating the filtered TermsEnum used by MultiTermQuery (for
>> the TermInSetQuery implementation). Because TermInSetQuery can provide all
>> of its terms up-front, we can have a simpler term intersection
>> implementation that relies on seekExact over seekCeil. Here's a quick take
>> on what I'm thinking:
>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
>> I've tested this internally and confirmed it solves our QPS regression
>> problem.
>>
>> I'm curious if anyone has an objection to moving back to a seekExact term
>> intersection approach for TermInSetQuery, or has alternative ideas. I
>> wonder if I'm overlooking some important factors and focusing too much on
>> this specific case where the bloom filter interaction is hurting
>> performance? It seems like seekCeil could provide benefits in some cases
>> over seekExact by skipping over multiple query terms at a time, so that's a
>> possible consideration. If we solve for the most common cases by default, I
>> suppose advanced users could always override TermInSetQuery#getTermsEnum as
>> necessary (we could take this approach internally for example to work with
>> our bloom filtering if the best default is to leverage seekCeil). I can
>> easily turn my quick solution into a PR, but before I do, I wanted to poll
>> this group for thoughts on the approach or other alternatives I might be
>> overlooking. Thanks in advance!
>>
>> Cheers,
>> -Greg
>>
>

Reply via email to