Thanks Patrick. I tend to agree with you for the default behavior. Bloom filter usage seems like a bit of a less-common case on the surface at least (e.g., it's expected behavior for query terms to not be present in a given segment with enough frequency to justify the additional codec layer). A primary key-like field is sort of the exception here, where a TermInSetQuery can be useful for allow/block-listing semantics—and where bloom filtering can be helpful. As an aside, given that TermInSetQuery already has some semi-special logic for recognizing primary key-like fields, and with this additional consideration, it makes me wonder if a special-purpose IDSet query or something might make sense at some point.
For now though, I like the idea of leaving TermInSetQuery as-is, since users can extend it and change the behavior of getTerms if they really need to. Cheers, -Greg On Fri, May 5, 2023 at 6:33 PM Patrick Zhai <zhai7...@gmail.com> wrote: > Hi Greg > IMO I still think the seekCeil is a better solution for the default > posting format, as it could potentially save time on traversing the FST by > doing the ping-pong skipping. > I can see that in the case of using bloom filter the seekExact might be > better but I'm not sure whether there is a better way than overriding the > `getTermsEnum`... > > Patrick > > On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote: > >> Hi folks- >> >> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we >> rewrote TermInSetQuery to extend MultiTermQuery. With this change, >> TermInSetQuery can now leverage the various "rewrite methods" available to >> MultiTermQuery, allowing users to customize the query evaluation strategy >> (e.g., postings vs. doc values, etc.), which was a nice win. In the >> benchmarks we ran, we didn't see any performance issues. >> >> In anticipation of 9.6 releasing, I've pulled this change into the Lucene >> snapshot we use for Amazon product search, and started running some >> additional benchmarks, which have surfaced an interesting issue. One >> use-case we have for TermInSetQuery creates a term disjunction over a field >> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because >> bloom filtering can only help with seekExact and not seekCeil, we're seeing >> a performance regression (primarily in red-line QPS). >> >> One way I can think to address this is to move back to a seekExact >> approach when creating the filtered TermsEnum used by MultiTermQuery (for >> the TermInSetQuery implementation). Because TermInSetQuery can provide all >> of its terms up-front, we can have a simpler term intersection >> implementation that relies on seekExact over seekCeil. Here's a quick take >> on what I'm thinking: >> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5. >> I've tested this internally and confirmed it solves our QPS regression >> problem. >> >> I'm curious if anyone has an objection to moving back to a seekExact term >> intersection approach for TermInSetQuery, or has alternative ideas. I >> wonder if I'm overlooking some important factors and focusing too much on >> this specific case where the bloom filter interaction is hurting >> performance? It seems like seekCeil could provide benefits in some cases >> over seekExact by skipping over multiple query terms at a time, so that's a >> possible consideration. If we solve for the most common cases by default, I >> suppose advanced users could always override TermInSetQuery#getTermsEnum as >> necessary (we could take this approach internally for example to work with >> our bloom filtering if the best default is to leverage seekCeil). I can >> easily turn my quick solution into a PR, but before I do, I wanted to poll >> this group for thoughts on the approach or other alternatives I might be >> overlooking. Thanks in advance! >> >> Cheers, >> -Greg >> >