Besides not being able to use the bloom filter, seekCeil is also just more costly than seekExact since it is essentially both .seekExact and .next in a single operation.
Are either of the two approaches using the intersect method of TermsEnum? It might be faster if the number of terms is over some threshold. It would require building an Automaton out of the set of terms, which is fast with DaciukMihovAutomatonBuilder. Hmm, I think we should rename this class maybe. I'll open an issue. Naming is the hardest part! The Codec can implement this quite efficiently since it can do the ping-pong skipping Patrick is referring to on a byte-by-byte basis in each of the sources of Term iteration. Mike McCandless http://blog.mikemccandless.com On Fri, May 5, 2023 at 9:34 PM Patrick Zhai <zhai7...@gmail.com> wrote: > Hi Greg > IMO I still think the seekCeil is a better solution for the default > posting format, as it could potentially save time on traversing the FST by > doing the ping-pong skipping. > I can see that in the case of using bloom filter the seekExact might be > better but I'm not sure whether there is a better way than overriding the > `getTermsEnum`... > > Patrick > > On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote: > >> Hi folks- >> >> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we >> rewrote TermInSetQuery to extend MultiTermQuery. With this change, >> TermInSetQuery can now leverage the various "rewrite methods" available to >> MultiTermQuery, allowing users to customize the query evaluation strategy >> (e.g., postings vs. doc values, etc.), which was a nice win. In the >> benchmarks we ran, we didn't see any performance issues. >> >> In anticipation of 9.6 releasing, I've pulled this change into the Lucene >> snapshot we use for Amazon product search, and started running some >> additional benchmarks, which have surfaced an interesting issue. One >> use-case we have for TermInSetQuery creates a term disjunction over a field >> that's using bloom filtering (i.e., BloomFilterPostingsFormat). Because >> bloom filtering can only help with seekExact and not seekCeil, we're seeing >> a performance regression (primarily in red-line QPS). >> >> One way I can think to address this is to move back to a seekExact >> approach when creating the filtered TermsEnum used by MultiTermQuery (for >> the TermInSetQuery implementation). Because TermInSetQuery can provide all >> of its terms up-front, we can have a simpler term intersection >> implementation that relies on seekExact over seekCeil. Here's a quick take >> on what I'm thinking: >> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5. >> I've tested this internally and confirmed it solves our QPS regression >> problem. >> >> I'm curious if anyone has an objection to moving back to a seekExact term >> intersection approach for TermInSetQuery, or has alternative ideas. I >> wonder if I'm overlooking some important factors and focusing too much on >> this specific case where the bloom filter interaction is hurting >> performance? It seems like seekCeil could provide benefits in some cases >> over seekExact by skipping over multiple query terms at a time, so that's a >> possible consideration. If we solve for the most common cases by default, I >> suppose advanced users could always override TermInSetQuery#getTermsEnum as >> necessary (we could take this approach internally for example to work with >> our bloom filtering if the best default is to leverage seekCeil). I can >> easily turn my quick solution into a PR, but before I do, I wanted to poll >> this group for thoughts on the approach or other alternatives I might be >> overlooking. Thanks in advance! >> >> Cheers, >> -Greg >> >