Re: Slow DV equivalent of TermInSetQuery

Robert Muir Tue, 26 Oct 2021 15:21:15 -0700

Well if, as I suggest, we use MultiTermQuery + DocValuesRewriteMethod
to implement this, then the choice is yours. just run it against a
"slow IndexReader" and go thru the ordinal map if you choose? There's
nothing stopping you from doing that, and it will do what you want
already.


I just personally don't recommend it for this case. As the number of
documents increases, the ordinal map indirection probably costs more
than the construction cost is worth. Better tradeoff to simply work
per-segment with no indirection. The number of lookupOrds is bounded
in a simple way, unlike faceting, where I would recommend the ordinal
map.


On Tue, Oct 26, 2021 at 6:10 PM Joel Bernstein <[email protected]> wrote:
>
> There are times, particularly in ecommerce and access control, where speed 
> really matters. So, you build stuff that's really fast at query time, with a 
> tradeoff at commit time.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 26, 2021 at 5:31 PM Robert Muir <[email protected]> wrote:
>>
>> Sorry, I don't think there is a need to use any top-level ordinals.
>> none of these docvalues-based query implementations need it.
>>
>> As far as query intersecting an input-stream, that is a big no-go.
>> Lucene Queries need to have correct hashcode/equals/etc.
>>
>> That's why current stuff around this such as TermInSetQuery encode
>> everything into a PrefixCodedTerms.
>>
>> On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein <[email protected]> wrote:
>> >
>> > One more wrinkle for extremely large lists, is pass the list in as an 
>> > InputStream which is a presorted binary representation of the ASIN's and 
>> > slide a BytesRef across the stream and merge it with the SortedDocValues. 
>> > This saves on all the object creation and String overhead for really long 
>> > lists of id's.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> >
>> > On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein <[email protected]> wrote:
>> >>
>> >> If the list of ASIN's is presorted you can quickly merge it with the 
>> >> SortedDocValues and produce a FixedBitSet of the top level ordinals, 
>> >> which can be used as the post filter. This is a nice approach for things 
>> >> like passing in a long list of access control predicates.
>> >>
>> >>
>> >> Joel Bernstein
>> >> http://joelsolr.blogspot.com/
>> >>
>> >>
>> >> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <[email protected]> wrote:
>> >>>
>> >>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these 
>> >>> ideas.
>> >>>
>> >>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <[email protected]> wrote:
>> >>>>
>> >>>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <[email protected]> wrote:
>> >>>> >
>> >>>> > > And then we could make an IndexOrDocValuesQuery with both the 
>> >>>> > > TermInSetQuery and this SDV.newSlowInSetQuery?
>> >>>> >
>> >>>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the 
>> >>>> > "index" query can evaluate its cost (ScorerSupplier#cost) without 
>> >>>> > doing anything costly, which isn't the case for TermInSetQuery.
>> >>>> >
>> >>>> > So we'd need to make some changes. Estimating the cost of a 
>> >>>> > TermInSetQuery in general without seeking the terms is a hard 
>> >>>> > problem, but maybe we could specialize the unique key case to return 
>> >>>> > the number of terms as the cost?
>> >>>>
>> >>>> Yes we know each term in terms dict only has a single document, when
>> >>>> terms.size() == terms.getSumDocFreq(): there's only one posting for
>> >>>> each term.
>> >>>> But we can probably generalize a cost estimation a bit more, just
>> >>>> based on these two stats?
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: [email protected]
>> >>>> For additional commands, e-mail: [email protected]
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Slow DV equivalent of TermInSetQuery

Reply via email to