Hi,

A SparseBitset impl for DELETES would be fine if the model in Lucene would encode deleted docs (it did that in earlier times). As deletes are sparse (deletes are in most cases <40%), this would help to make the iterator cheaper.

Uwe

Am 06.02.2024 um 09:01 schrieb Adrien Grand:
Hey Michael,

You are right, iterating all deletes with nextClearBit() would run in O(maxDoc). I am coming from the other direction, where I'm expecting the number of deletes to be more in the order of 1%-5% of the doc ID space, so a separate int[] would use lots of heap and probably not help that much compared with nextClearBit(). My mental model is that the two most common use-cases are append-only workloads, where there are no deletes at all, and update workloads, which would commonly have several percents of deleted docs. It's not clear to me how common it is to have very few deletes.

On Tue, Feb 6, 2024 at 7:03 AM Michael Froh <msf...@gmail.com> wrote:

    Thanks Adrien!

    My thinking with a separate iterator was that nextClearBit() is
    relatively expensive (O(maxDoc) to traverse everything, I think).
    The solution I was imagining would involve an index-time change to
    output, say, an int[] of deleted docIDs if the number is
    sufficiently small (like maybe less than 1000). Then the livedocs
    interface could optionally return a cheap deleted docs iterator
    (i.e. only if the number of deleted docs is less than the
    threshold). Technically, the cost would be O(1), since we set a
    constant bound on the effort and fail otherwise. :)

    I think 1000 doc value lookups would be cheap, but I don't know if
    the guarantee is cheap enough to make it into Weight#count.

    That said, I'm going to see if iterating with nextClearBit() is
    sufficiently cheap. Hmm... precomputing that int[] for
    deleted docIDs on refresh could be an option too.

    Thanks again,
    Froh

    On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand <jpou...@gmail.com>
    wrote:

        Hi Michael,

        Indeed, only MatchAllDocsQuery knows how to produce a count
        when there are deletes.

        Your idea sounds good to me, do you actually need a side car
        iterator for deletes, or could you use a nextClearBit()
        operation on the bit set?

        I don't think we can fold it into Weight#count since there is
        an expectation that it is negligible compared with the cost of
        a naive count, but we may be able to do it in
        IndexSearcher#count or on the OpenSearch side.

        Le ven. 2 févr. 2024, 23:50, Michael Froh <msf...@gmail.com> a
        écrit :

            Hi,

            On OpenSearch, we've been taking advantage of the various
            O(1) Weight#count() implementations to quickly compute
            various aggregations without needing to iterate over all
            the matching documents (at least when the top-level query
            is functionally a match-all at the segment level). Of
            course, from what I've seen, every clever Weight#count()
            implementation falls apart (returns -1) in the face of
            deletes.

            I was thinking that we could still handle small numbers of
            deletes efficiently if only we could get a
            DocIdSetIterator for deleted docs.

            Like suppose you're doing a date histogram aggregation,
            you could get the counts for each bucket from the points
            tree (ignoring deletes), then iterate through the deleted
            docs and decrement their contribution from the relevant
            bucket (determined based on a docvalues lookup). Assuming
            the number of deleted docs is small, it should be cheap,
            right?

            The current LiveDocs implementation is just a FixedBitSet,
            so AFAIK it's not great for iteration. I'm imagining
            adding a supplementary "deleted docs iterator" that could
            sit next to the FixedBitSet if and only if the number of
            deletes is "small". Is there a better way that I should be
            thinking about this?

            Thanks,
            Froh



--
Adrien

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de

Reply via email to