Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Robert Muir
I remember the benefits from Terms.intersect being pretty huge. Rather than simple ping-pong, the whole monster gets handed off directly to the codec's term dictionary implementation. For the default terms dictionary using blocktree, this saves time seeking to terms you don't care about (because

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Greg Miller
Thanks for the feedback Robert. This approach sounds like a better path to follow. I'll explore it. I agree that we should provide default behavior that is overall best for our users, and not for one specific use-case such as Amazon search :). Mike- TermInSetQuery used to use seekExact, and now

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Michael McCandless
Besides not being able to use the bloom filter, seekCeil is also just more costly than seekExact since it is essentially both .seekExact and .next in a single operation. Are either of the two approaches using the intersect method of TermsEnum? It might be faster if the number of terms is over

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Greg Miller
Thanks Patrick. I tend to agree with you for the default behavior. Bloom filter usage seems like a bit of a less-common case on the surface at least (e.g., it's expected behavior for query terms to not be present in a given segment with enough frequency to justify the additional codec layer). A

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-09 Thread Robert Muir
The better solution is to use Terms.intersect. Then the postings format can do the right thing. But this query doesn't use Terms.intersect today, instead doing ping-ponging itself. That's the problem. We must *not* tune our algorithms for amazon's search but instead what is the best for users

Re: TermInSetQuery: seekExact vs. seekCeil

2023-05-05 Thread Patrick Zhai
Hi Greg IMO I still think the seekCeil is a better solution for the default posting format, as it could potentially save time on traversing the FST by doing the ping-pong skipping. I can see that in the case of using bloom filter the seekExact might be better but I'm not sure whether there is a

TermInSetQuery: seekExact vs. seekCeil

2023-05-05 Thread Greg Miller
Hi folks- Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery can now leverage the various "rewrite methods" available to MultiTermQuery, allowing users to customize the query evaluation strategy