Re: Dense union of doc IDs

LuXugang Mon, 07 Nov 2022 04:20:44 -0800

+1 If we would have a new BulkAdder and it could detect long runs of set bits, 
It also could be at least used in LRUQueryCache to cache part dense docs 
instead of always building a huge BitSet by maxDoc?


Xugang

https://www.amazingkoala.com.cn




> On Nov 4, 2022, at 08:15, Michael Froh <msf...@gmail.com> wrote:
> 
> Hi,
> 
> I was recently poking around in the createWeight implementation for 
> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow queries, 
> and I realized that the worst-case performance could be pretty bad, but 
> (maybe) possible to optimize for.
> 
> Imagine if we have a segment with N docs and our MultiTermQuery expands to 
> hit M terms, where each of the M terms matches N-1 docs. (If we matched all N 
> docs, then Greg Miller's recent optimization to replace the MultiTermQuery 
> with a TermQuery would kick in.) In this case, my understanding is that we 
> would iterate through all the terms and pass each one's postings to a 
> DocIdSetBuilder, which will iterate through the postings to set bits. This 
> whole thing would be O(MN), I think.
> 
> I was thinking that it would be cool if the DocIdSetBuilder could detect long 
> runs of set bits and advance() each DISI to skip over them (since they're 
> guaranteed not to contribute anything new to the union). In the worst case 
> that I described above, I think it would make the whole thing O(M log N) 
> (assuming advance() takes log time). 
> 
> At the risk of overcomplicating things, maybe DocIdSetBuilder could use a 
> third ("dense") BulkAdder implementation that kicks in once enough bits are 
> set, to efficiently implement the "or" operation to skip over known 
> (sufficiently long) runs of set bits?
> 
> Would something like that be useful? Is the "dense union of doc IDs" case 
> common enough to warrant it?
> 
> Thanks,
> Froh

Re: Dense union of doc IDs

Reply via email to