Re: Dense union of doc IDs

Greg Miller Fri, 04 Nov 2022 07:54:12 -0700

+1 to exploring this idea. A couple additional thoughts...

I can imagine real world use cases that would benefit from this beyond the
super pathological N-1 case. With dense terms, I can believe that the
disjunction would start to accumulate "blocks" of documents that all match
as the bitset gets populated. As that starts to happen, it could be very
beneficial to `advance` other term postings beyond the "block." For
example, in an ecommerce search engine, a disjunction of "product types"
would likely exhibit this behavior (where some product types are likely
pretty dense).


Also, it would be nice to try this same idea in TermInSetQuery. It's very
similar, but still has its own implementation.

Thanks for raising this idea Froh! I'd be excited to see what you come up
with, and may have a use-case in mind to benchmark with if you end up with
a patch to test.

Cheers,
-Greg

On Fri, Nov 4, 2022 at 6:46 AM Michael Sokolov <[email protected]> wrote:

> It sounds like a lot of complexity to handle an unusual edge case, but
> ... I guess this actually happened? Can you give any sense of the
> end-user behavior that caused it?
>
> On Fri, Nov 4, 2022 at 2:26 AM Patrick Zhai <[email protected]> wrote:
> >
> > Hi Froh,
> >
> > The idea sounds reasonable to me, altho I wonder whether using
> CONSTANT_SCORE_BOOLEAN_REWRITE would help with your case since that dense
> union case should be already handled by disjunction query I suppose?
> > But that boolean rewrite is subject to max clause limit so it may have
> some other problems depending on the use case I guess.
> >
> > Best
> > Patrick
> >
> >
> > On Thu, Nov 3, 2022 at 5:15 PM Michael Froh <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> I was recently poking around in the createWeight implementation for
> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
> queries, and I realized that the worst-case performance could be pretty
> bad, but (maybe) possible to optimize for.
> >>
> >> Imagine if we have a segment with N docs and our MultiTermQuery expands
> to hit M terms, where each of the M terms matches N-1 docs. (If we matched
> all N docs, then Greg Miller's recent optimization to replace the
> MultiTermQuery with a TermQuery would kick in.) In this case, my
> understanding is that we would iterate through all the terms and pass each
> one's postings to a DocIdSetBuilder, which will iterate through the
> postings to set bits. This whole thing would be O(MN), I think.
> >>
> >> I was thinking that it would be cool if the DocIdSetBuilder could
> detect long runs of set bits and advance() each DISI to skip over them
> (since they're guaranteed not to contribute anything new to the union). In
> the worst case that I described above, I think it would make the whole
> thing O(M log N) (assuming advance() takes log time).
> >>
> >> At the risk of overcomplicating things, maybe DocIdSetBuilder could use
> a third ("dense") BulkAdder implementation that kicks in once enough bits
> are set, to efficiently implement the "or" operation to skip over known
> (sufficiently long) runs of set bits?
> >>
> >> Would something like that be useful? Is the "dense union of doc IDs"
> case common enough to warrant it?
> >>
> >> Thanks,
> >> Froh
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Dense union of doc IDs

Reply via email to