Hi,
A SparseBitset impl for DELETES would be fine if the model in Lucene
would encode deleted docs (it did that in earlier times). As deletes are
sparse (deletes are in most cases <40%), this would help to make the
iterator cheaper.
Uwe
Am 06.02.2024 um 09:01 schrieb Adrien Grand:
Hey Michael,
You are right, iterating all deletes with nextClearBit() would run in
O(maxDoc). I am coming from the other direction, where I'm expecting
the number of deletes to be more in the order of 1%-5% of the doc ID
space, so a separate int[] would use lots of heap and probably not
help that much compared with nextClearBit(). My mental model is that
the two most common use-cases are append-only workloads, where there
are no deletes at all, and update workloads, which would commonly have
several percents of deleted docs. It's not clear to me how common it
is to have very few deletes.
On Tue, Feb 6, 2024 at 7:03 AM Michael Froh <msf...@gmail.com> wrote:
Thanks Adrien!
My thinking with a separate iterator was that nextClearBit() is
relatively expensive (O(maxDoc) to traverse everything, I think).
The solution I was imagining would involve an index-time change to
output, say, an int[] of deleted docIDs if the number is
sufficiently small (like maybe less than 1000). Then the livedocs
interface could optionally return a cheap deleted docs iterator
(i.e. only if the number of deleted docs is less than the
threshold). Technically, the cost would be O(1), since we set a
constant bound on the effort and fail otherwise. :)
I think 1000 doc value lookups would be cheap, but I don't know if
the guarantee is cheap enough to make it into Weight#count.
That said, I'm going to see if iterating with nextClearBit() is
sufficiently cheap. Hmm... precomputing that int[] for
deleted docIDs on refresh could be an option too.
Thanks again,
Froh
On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand <jpou...@gmail.com>
wrote:
Hi Michael,
Indeed, only MatchAllDocsQuery knows how to produce a count
when there are deletes.
Your idea sounds good to me, do you actually need a side car
iterator for deletes, or could you use a nextClearBit()
operation on the bit set?
I don't think we can fold it into Weight#count since there is
an expectation that it is negligible compared with the cost of
a naive count, but we may be able to do it in
IndexSearcher#count or on the OpenSearch side.
Le ven. 2 févr. 2024, 23:50, Michael Froh <msf...@gmail.com> a
écrit :
Hi,
On OpenSearch, we've been taking advantage of the various
O(1) Weight#count() implementations to quickly compute
various aggregations without needing to iterate over all
the matching documents (at least when the top-level query
is functionally a match-all at the segment level). Of
course, from what I've seen, every clever Weight#count()
implementation falls apart (returns -1) in the face of
deletes.
I was thinking that we could still handle small numbers of
deletes efficiently if only we could get a
DocIdSetIterator for deleted docs.
Like suppose you're doing a date histogram aggregation,
you could get the counts for each bucket from the points
tree (ignoring deletes), then iterate through the deleted
docs and decrement their contribution from the relevant
bucket (determined based on a docvalues lookup). Assuming
the number of deleted docs is small, it should be cheap,
right?
The current LiveDocs implementation is just a FixedBitSet,
so AFAIK it's not great for iteration. I'm imagining
adding a supplementary "deleted docs iterator" that could
sit next to the FixedBitSet if and only if the number of
deletes is "small". Is there a better way that I should be
thinking about this?
Thanks,
Froh
--
Adrien
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de