Good point, I opened an issue to discuss this:
https://github.com/apache/lucene/issues/13084.

Did we actually use a sparse bit set to encode deleted docs before? I don't
recall that.

On Tue, Feb 6, 2024 at 2:42 PM Uwe Schindler <u...@thetaphi.de> wrote:

> Hi,
>
> A SparseBitset impl for DELETES would be fine if the model in Lucene would
> encode deleted docs (it did that in earlier times). As deletes are sparse
> (deletes are in most cases <40%), this would help to make the iterator
> cheaper.
> Uwe
>
> Am 06.02.2024 um 09:01 schrieb Adrien Grand:
>
> Hey Michael,
>
> You are right, iterating all deletes with nextClearBit() would run in
> O(maxDoc). I am coming from the other direction, where I'm expecting the
> number of deletes to be more in the order of 1%-5% of the doc ID space, so
> a separate int[] would use lots of heap and probably not help that much
> compared with nextClearBit(). My mental model is that the two most common
> use-cases are append-only workloads, where there are no deletes at all, and
> update workloads, which would commonly have several percents of deleted
> docs. It's not clear to me how common it is to have very few deletes.
>
> On Tue, Feb 6, 2024 at 7:03 AM Michael Froh <msf...@gmail.com> wrote:
>
>> Thanks Adrien!
>>
>> My thinking with a separate iterator was that nextClearBit() is
>> relatively expensive (O(maxDoc) to traverse everything, I think). The
>> solution I was imagining would involve an index-time change to output, say,
>> an int[] of deleted docIDs if the number is sufficiently small (like maybe
>> less than 1000). Then the livedocs interface could optionally return a
>> cheap deleted docs iterator (i.e. only if the number of deleted docs is
>> less than the threshold). Technically, the cost would be O(1), since we set
>> a constant bound on the effort and fail otherwise. :)
>>
>> I think 1000 doc value lookups would be cheap, but I don't know if the
>> guarantee is cheap enough to make it into Weight#count.
>>
>> That said, I'm going to see if iterating with nextClearBit() is
>> sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
>> refresh could be an option too.
>>
>> Thanks again,
>> Froh
>>
>> On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand <jpou...@gmail.com> wrote:
>>
>>> Hi Michael,
>>>
>>> Indeed, only MatchAllDocsQuery knows how to produce a count when there
>>> are deletes.
>>>
>>> Your idea sounds good to me, do you actually need a side car iterator
>>> for deletes, or could you use a nextClearBit() operation on the bit set?
>>>
>>> I don't think we can fold it into Weight#count since there is an
>>> expectation that it is negligible compared with the cost of a naive count,
>>> but we may be able to do it in IndexSearcher#count or on the OpenSearch
>>> side.
>>>
>>> Le ven. 2 févr. 2024, 23:50, Michael Froh <msf...@gmail.com> a écrit :
>>>
>>>> Hi,
>>>>
>>>> On OpenSearch, we've been taking advantage of the various O(1)
>>>> Weight#count() implementations to quickly compute various aggregations
>>>> without needing to iterate over all the matching documents (at least when
>>>> the top-level query is functionally a match-all at the segment level). Of
>>>> course, from what I've seen, every clever Weight#count()
>>>> implementation falls apart (returns -1) in the face of deletes.
>>>>
>>>> I was thinking that we could still handle small numbers of deletes
>>>> efficiently if only we could get a DocIdSetIterator for deleted docs.
>>>>
>>>> Like suppose you're doing a date histogram aggregation, you could get
>>>> the counts for each bucket from the points tree (ignoring deletes), then
>>>> iterate through the deleted docs and decrement their contribution from the
>>>> relevant bucket (determined based on a docvalues lookup). Assuming the
>>>> number of deleted docs is small, it should be cheap, right?
>>>>
>>>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK
>>>> it's not great for iteration. I'm imagining adding a supplementary "deleted
>>>> docs iterator" that could sit next to the FixedBitSet if and only if the
>>>> number of deletes is "small". Is there a better way that I should be
>>>> thinking about this?
>>>>
>>>> Thanks,
>>>> Froh
>>>>
>>>
>
> --
> Adrien
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

-- 
Adrien

Reply via email to