Hey Michael,
You are right, iterating all deletes with nextClearBit() would
run in O(maxDoc). I am coming from the other direction, where I'm
expecting the number of deletes to be more in the order of 1%-5%
of the doc ID space, so a separate int[] would use lots of heap
and probably not help that much compared with nextClearBit(). My
mental model is that the two most common use-cases are
append-only workloads, where there are no deletes at all, and
update workloads, which would commonly have several percents of
deleted docs. It's not clear to me how common it is to have very
few deletes.
On Tue, Feb 6, 2024 at 7:03 AM Michael Froh <msf...@gmail.com> wrote:
Thanks Adrien!
My thinking with a separate iterator was that nextClearBit()
is relatively expensive (O(maxDoc) to traverse everything, I
think). The solution I was imagining would involve an
index-time change to output, say, an int[] of deleted docIDs
if the number is sufficiently small (like maybe less than
1000). Then the livedocs interface could optionally return a
cheap deleted docs iterator (i.e. only if the number of
deleted docs is less than the threshold). Technically, the
cost would be O(1), since we set a constant bound on the
effort and fail otherwise. :)
I think 1000 doc value lookups would be cheap, but I don't
know if the guarantee is cheap enough to make it into
Weight#count.
That said, I'm going to see if iterating with nextClearBit()
is sufficiently cheap. Hmm... precomputing that int[] for
deleted docIDs on refresh could be an option too.
Thanks again,
Froh
On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand
<jpou...@gmail.com> wrote:
Hi Michael,
Indeed, only MatchAllDocsQuery knows how to produce a
count when there are deletes.
Your idea sounds good to me, do you actually need a side
car iterator for deletes, or could you use a
nextClearBit() operation on the bit set?
I don't think we can fold it into Weight#count since
there is an expectation that it is negligible compared
with the cost of a naive count, but we may be able to do
it in IndexSearcher#count or on the OpenSearch side.
Le ven. 2 févr. 2024, 23:50, Michael Froh
<msf...@gmail.com> a écrit :
Hi,
On OpenSearch, we've been taking advantage of the
various O(1) Weight#count() implementations to
quickly compute various aggregations without needing
to iterate over all the matching documents (at least
when the top-level query is functionally a match-all
at the segment level). Of course, from what I've
seen, every clever Weight#count()
implementation falls apart (returns -1) in the face
of deletes.
I was thinking that we could still handle small
numbers of deletes efficiently if only we could get a
DocIdSetIterator for deleted docs.
Like suppose you're doing a date histogram
aggregation, you could get the counts for each bucket
from the points tree (ignoring deletes), then iterate
through the deleted docs and decrement their
contribution from the relevant bucket (determined
based on a docvalues lookup). Assuming the number of
deleted docs is small, it should be cheap, right?
The current LiveDocs implementation is just a
FixedBitSet, so AFAIK it's not great for iteration.
I'm imagining adding a supplementary "deleted docs
iterator" that could sit next to the FixedBitSet if
and only if the number of deletes is "small". Is
there a better way that I should be thinking about this?
Thanks,
Froh
--
Adrien