I remember the benefits from Terms.intersect being pretty huge. Rather
than simple ping-pong, the whole monster gets handed off directly to
the codec's term dictionary implementation. For the default terms
dictionary using blocktree, this saves time seeking to terms you don't
care about (because the postingsformat is aware of the blocktree
structure). It is probably worth just prototyping on its own enough,
to see if we can get the benefits. It may turn out, you dont need a
bloom anymore after that.

On Tue, May 9, 2023 at 3:24 PM Greg Miller <gsmil...@gmail.com> wrote:
>
> Thanks for the feedback Robert. This approach sounds like a better path to 
> follow. I'll explore it. I agree that we should provide default behavior that 
> is overall best for our users, and not for one specific use-case such as 
> Amazon search :).
>
> Mike- TermInSetQuery used to use seekExact, and now uses seekCeil. We haven't 
> used intersect... yet.
>
> Thanks again for the feedback.
>
> Cheers,
> -Greg
>
> On Tue, May 9, 2023 at 11:09 AM Michael McCandless 
> <luc...@mikemccandless.com> wrote:
>>
>> Besides not being able to use the bloom filter, seekCeil is also just more 
>> costly than seekExact since it is essentially both .seekExact and .next in a 
>> single operation.
>>
>> Are either of the two approaches using the intersect method of TermsEnum?  
>> It might be faster if the number of terms is over some threshold.
>>
>> It would require building an Automaton out of the set of terms, which is 
>> fast with DaciukMihovAutomatonBuilder.  Hmm, I think we should rename this 
>> class maybe.  I'll open an issue.  Naming is the hardest part!
>>
>> The Codec can implement this quite efficiently since it can do the ping-pong 
>> skipping Patrick is referring to on a byte-by-byte basis in each of the 
>> sources of Term iteration.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, May 5, 2023 at 9:34 PM Patrick Zhai <zhai7...@gmail.com> wrote:
>>>
>>> Hi Greg
>>> IMO I still think the seekCeil is a better solution for the default posting 
>>> format, as it could potentially save time on traversing the FST by doing 
>>> the ping-pong skipping.
>>> I can see that in the case of using bloom filter the seekExact might be 
>>> better but I'm not sure whether there is a better way than overriding the 
>>> `getTermsEnum`...
>>>
>>> Patrick
>>>
>>> On Fri, May 5, 2023 at 4:45 PM Greg Miller <gsmil...@gmail.com> wrote:
>>>>
>>>> Hi folks-
>>>>
>>>> Back in GH#12156 (https://github.com/apache/lucene/pull/12156), we rewrote 
>>>> TermInSetQuery to extend MultiTermQuery. With this change, TermInSetQuery 
>>>> can now leverage the various "rewrite methods" available to 
>>>> MultiTermQuery, allowing users to customize the query evaluation strategy 
>>>> (e.g., postings vs. doc values, etc.), which was a nice win. In the 
>>>> benchmarks we ran, we didn't see any performance issues.
>>>>
>>>> In anticipation of 9.6 releasing, I've pulled this change into the Lucene 
>>>> snapshot we use for Amazon product search, and started running some 
>>>> additional benchmarks, which have surfaced an interesting issue. One 
>>>> use-case we have for TermInSetQuery creates a term disjunction over a 
>>>> field that's using bloom filtering (i.e., BloomFilterPostingsFormat). 
>>>> Because bloom filtering can only help with seekExact and not seekCeil, 
>>>> we're seeing a performance regression (primarily in red-line QPS).
>>>>
>>>> One way I can think to address this is to move back to a seekExact 
>>>> approach when creating the filtered TermsEnum used by MultiTermQuery (for 
>>>> the TermInSetQuery implementation). Because TermInSetQuery can provide all 
>>>> of its terms up-front, we can have a simpler term intersection 
>>>> implementation that relies on seekExact over seekCeil. Here's a quick take 
>>>> on what I'm thinking: 
>>>> https://github.com/gsmiller/lucene/commit/e527c5d9b26ee53826b56b270d7c96db18bfaee5.
>>>>  I've tested this internally and confirmed it solves our QPS regression 
>>>> problem.
>>>>
>>>> I'm curious if anyone has an objection to moving back to a seekExact term 
>>>> intersection approach for TermInSetQuery, or has alternative ideas. I 
>>>> wonder if I'm overlooking some important factors and focusing too much on 
>>>> this specific case where the bloom filter interaction is hurting 
>>>> performance? It seems like seekCeil could provide benefits in some cases 
>>>> over seekExact by skipping over multiple query terms at a time, so that's 
>>>> a possible consideration. If we solve for the most common cases by 
>>>> default, I suppose advanced users could always override 
>>>> TermInSetQuery#getTermsEnum as necessary (we could take this approach 
>>>> internally for example to work with our bloom filtering if the best 
>>>> default is to leverage seekCeil). I can easily turn my quick solution into 
>>>> a PR, but before I do, I wanted to poll this group for thoughts on the 
>>>> approach or other alternatives I might be overlooking. Thanks in advance!
>>>>
>>>> Cheers,
>>>> -Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to