Re: Dense union of doc IDs

Adrien Grand Mon, 07 Nov 2022 03:06:27 -0800

I have been thinking about a similar feature for conjunctions and
negations. When you have many low-cardinality fields, a good way to speed
up queries on these fields is to configure an index sort on these fields.
This automatically creates large gaps in postings between long runs of
ones, and Lucene naturally takes advantage of it to skip over non-matching
documents. In that case, the bottleneck becomes linearly scanning runs of
ones until the first non-matching doc ID of one of the postings is found.
We could do much better if Lucene could give us the next non-matching doc
ID.


This would also help negations, since after advancing the main clause,
Lucene needs to ask the prohibited clause: "what is your next non-matching
doc ID on or after this doc ID?" in order to figure out 1. if the doc ID of
the main clause is a match and 2. what is the next target for the main
clause. In case of a long run of ones in the prohibited clause, we would
currently linearly scan until the next non-matching doc ID.

I had created a prototype for this by introducing a new
`DocIdSetIterator#peekNextNonMatchingDocID()` method that would default to
returning `docID() + 1`, and patching postings to return `docID() + 128` in
case a postings block would consist of ones. To fully take advantage of
this, we'd create a new postings format that would specialize encoding long
runs of ones.

In summary, I'm also +1 on exploring this idea. :)

On Fri, Nov 4, 2022 at 3:54 PM Greg Miller <gsmil...@gmail.com> wrote:

> +1 to exploring this idea. A couple additional thoughts...
>
> I can imagine real world use cases that would benefit from this beyond the
> super pathological N-1 case. With dense terms, I can believe that the
> disjunction would start to accumulate "blocks" of documents that all match
> as the bitset gets populated. As that starts to happen, it could be very
> beneficial to `advance` other term postings beyond the "block." For
> example, in an ecommerce search engine, a disjunction of "product types"
> would likely exhibit this behavior (where some product types are likely
> pretty dense).
>
> Also, it would be nice to try this same idea in TermInSetQuery. It's very
> similar, but still has its own implementation.
>
> Thanks for raising this idea Froh! I'd be excited to see what you come up
> with, and may have a use-case in mind to benchmark with if you end up with
> a patch to test.
>
> Cheers,
> -Greg
>
> On Fri, Nov 4, 2022 at 6:46 AM Michael Sokolov <msoko...@gmail.com> wrote:
>
>> It sounds like a lot of complexity to handle an unusual edge case, but
>> ... I guess this actually happened? Can you give any sense of the
>> end-user behavior that caused it?
>>
>> On Fri, Nov 4, 2022 at 2:26 AM Patrick Zhai <zhai7...@gmail.com> wrote:
>> >
>> > Hi Froh,
>> >
>> > The idea sounds reasonable to me, altho I wonder whether using
>> CONSTANT_SCORE_BOOLEAN_REWRITE would help with your case since that dense
>> union case should be already handled by disjunction query I suppose?
>> > But that boolean rewrite is subject to max clause limit so it may have
>> some other problems depending on the use case I guess.
>> >
>> > Best
>> > Patrick
>> >
>> >
>> > On Thu, Nov 3, 2022 at 5:15 PM Michael Froh <msf...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I was recently poking around in the createWeight implementation for
>> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
>> queries, and I realized that the worst-case performance could be pretty
>> bad, but (maybe) possible to optimize for.
>> >>
>> >> Imagine if we have a segment with N docs and our MultiTermQuery
>> expands to hit M terms, where each of the M terms matches N-1 docs. (If we
>> matched all N docs, then Greg Miller's recent optimization to replace the
>> MultiTermQuery with a TermQuery would kick in.) In this case, my
>> understanding is that we would iterate through all the terms and pass each
>> one's postings to a DocIdSetBuilder, which will iterate through the
>> postings to set bits. This whole thing would be O(MN), I think.
>> >>
>> >> I was thinking that it would be cool if the DocIdSetBuilder could
>> detect long runs of set bits and advance() each DISI to skip over them
>> (since they're guaranteed not to contribute anything new to the union). In
>> the worst case that I described above, I think it would make the whole
>> thing O(M log N) (assuming advance() takes log time).
>> >>
>> >> At the risk of overcomplicating things, maybe DocIdSetBuilder could
>> use a third ("dense") BulkAdder implementation that kicks in once enough
>> bits are set, to efficiently implement the "or" operation to skip over
>> known (sufficiently long) runs of set bits?
>> >>
>> >> Would something like that be useful? Is the "dense union of doc IDs"
>> case common enough to warrant it?
>> >>
>> >> Thanks,
>> >> Froh
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
Adrien

Re: Dense union of doc IDs

Reply via email to