Dense union of doc IDs

Michael Froh Thu, 03 Nov 2022 17:15:57 -0700

Hi,

I was recently poking around in the createWeight implementation for
MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
queries, and I realized that the worst-case performance could be pretty
bad, but (maybe) possible to optimize for.


Imagine if we have a segment with N docs and our MultiTermQuery expands to
hit M terms, where each of the M terms matches N-1 docs. (If we matched all
N docs, then Greg Miller's recent optimization to replace the
MultiTermQuery with a TermQuery would kick in.) In this case, my
understanding is that we would iterate through all the terms and pass each
one's postings to a DocIdSetBuilder, which will iterate through the
postings to set bits. This whole thing would be O(MN), I think.

I was thinking that it would be cool if the DocIdSetBuilder could detect
long runs of set bits and advance() each DISI to skip over them (since
they're guaranteed not to contribute anything new to the union). In the
worst case that I described above, I think it would make the whole thing
O(M log N) (assuming advance() takes log time).

At the risk of overcomplicating things, maybe DocIdSetBuilder could use a
third ("dense") BulkAdder implementation that kicks in once enough bits are
set, to efficiently implement the "or" operation to skip over known
(sufficiently long) runs of set bits?

Would something like that be useful? Is the "dense union of doc IDs" case
common enough to warrant it?

Thanks,
Froh

Dense union of doc IDs

Reply via email to