Thanks for the feedback Robert. This approach sounds like a better path to follow. I'll explore it. I agree that we should provide default behavior that is overall best for our users, and not for one specific use-case such as Amazon search :).
Mike- TermInSetQuery used to use seekExact, and now uses seekCeil. We haven't used intersect... yet. Thanks again for the feedback. Cheers, -Greg On Tue, May 9, 2023 at 11:09 AM Michael McCandless < luc...@mikemccandless.com> wrote: > Besides not being able to use the bloom filter, seekCeil is also just more > costly than seekExact since it is essentially both .seekExact and .next in > a single operation. > > Are either of the two approaches using the intersect method of TermsEnum? > It might be faster if the number of terms is over some threshold. > > It would require building an Automaton out of the set of terms, which is > fast with DaciukMihovAutomatonBuilder. Hmm, I think we should rename this > class maybe. I'll open an issue. Naming is the hardest part! > > The Codec can implement this quite efficiently since it can do the > ping-pong skipping Patrick is referring to on a byte-by-byte basis in each > of the sources of Term iteration. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, May 5, 2023 at 9:34 PM Patrick Zhai <zhai7...@gmail.com> wrote: > >> Hi Greg >> IMO I still think the seekCeil is a better solution for the default >> posting format, as it could potentially save time on traversing the FST by >> doing the ping-pong skipping. >> I can see that in the case of using bloom filter the seekExact might be >> better but I'm not sure whether there is a better way than overriding the >> `getTermsEnum`... >> >> Patrick >> >> On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote: >> >>> Hi folks- >>> >>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we >>> rewrote TermInSetQuery to extend MultiTermQuery. With this change, >>> TermInSetQuery can now leverage the various "rewrite methods" available to >>> MultiTermQuery, allowing users to customize the query evaluation strategy >>> (e.g., postings vs. doc values, etc.), which was a nice win. In the >>> benchmarks we ran, we didn't see any performance issues. >>> >>> In anticipation of 9.6 releasing, I've pulled this change into the >>> Lucene snapshot we use for Amazon product search, and started running some >>> additional benchmarks, which have surfaced an interesting issue. One >>> use-case we have for TermInSetQuery creates a term disjunction over a field >>> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because >>> bloom filtering can only help with seekExact and not seekCeil, we're seeing >>> a performance regression (primarily in red-line QPS). >>> >>> One way I can think to address this is to move back to a seekExact >>> approach when creating the filtered TermsEnum used by MultiTermQuery (for >>> the TermInSetQuery implementation). Because TermInSetQuery can provide all >>> of its terms up-front, we can have a simpler term intersection >>> implementation that relies on seekExact over seekCeil. Here's a quick take >>> on what I'm thinking: >>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5. >>> I've tested this internally and confirmed it solves our QPS regression >>> problem. >>> >>> I'm curious if anyone has an objection to moving back to a seekExact >>> term intersection approach for TermInSetQuery, or has alternative ideas. I >>> wonder if I'm overlooking some important factors and focusing too much on >>> this specific case where the bloom filter interaction is hurting >>> performance? It seems like seekCeil could provide benefits in some cases >>> over seekExact by skipping over multiple query terms at a time, so that's a >>> possible consideration. If we solve for the most common cases by default, I >>> suppose advanced users could always override TermInSetQuery#getTermsEnum as >>> necessary (we could take this approach internally for example to work with >>> our bloom filtering if the best default is to leverage seekCeil). I can >>> easily turn my quick solution into a PR, but before I do, I wanted to poll >>> this group for thoughts on the approach or other alternatives I might be >>> overlooking. Thanks in advance! >>> >>> Cheers, >>> -Greg >>> >>