Computing weight.count() cheaply in the face of deletes?

Michael Froh Fri, 02 Feb 2024 14:50:35 -0800

Hi,

On OpenSearch, we've been taking advantage of the various O(1)
Weight#count() implementations to quickly compute various aggregations
without needing to iterate over all the matching documents (at least when
the top-level query is functionally a match-all at the segment level). Of
course, from what I've seen, every clever Weight#count()
implementation falls apart (returns -1) in the face of deletes.


I was thinking that we could still handle small numbers of deletes
efficiently if only we could get a DocIdSetIterator for deleted docs.

Like suppose you're doing a date histogram aggregation, you could get the
counts for each bucket from the points tree (ignoring deletes), then
iterate through the deleted docs and decrement their contribution from the
relevant bucket (determined based on a docvalues lookup). Assuming the
number of deleted docs is small, it should be cheap, right?

The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
not great for iteration. I'm imagining adding a supplementary "deleted docs
iterator" that could sit next to the FixedBitSet if and only if the number
of deletes is "small". Is there a better way that I should be thinking
about this?

Thanks,
Froh

Computing weight.count() cheaply in the face of deletes?

Reply via email to