Hi, I was recently poking around in the createWeight implementation for MultiTermQueryConstantScoreWrapper to get to the bottom of some slow queries, and I realized that the worst-case performance could be pretty bad, but (maybe) possible to optimize for.
Imagine if we have a segment with N docs and our MultiTermQuery expands to hit M terms, where each of the M terms matches N-1 docs. (If we matched all N docs, then Greg Miller's recent optimization to replace the MultiTermQuery with a TermQuery would kick in.) In this case, my understanding is that we would iterate through all the terms and pass each one's postings to a DocIdSetBuilder, which will iterate through the postings to set bits. This whole thing would be O(MN), I think. I was thinking that it would be cool if the DocIdSetBuilder could detect long runs of set bits and advance() each DISI to skip over them (since they're guaranteed not to contribute anything new to the union). In the worst case that I described above, I think it would make the whole thing O(M log N) (assuming advance() takes log time). At the risk of overcomplicating things, maybe DocIdSetBuilder could use a third ("dense") BulkAdder implementation that kicks in once enough bits are set, to efficiently implement the "or" operation to skip over known (sufficiently long) runs of set bits? Would something like that be useful? Is the "dense union of doc IDs" case common enough to warrant it? Thanks, Froh