Good point, I opened an issue to discuss this: https://github.com/apache/lucene/issues/13084.
Did we actually use a sparse bit set to encode deleted docs before? I don't recall that. On Tue, Feb 6, 2024 at 2:42 PM Uwe Schindler <u...@thetaphi.de> wrote: > Hi, > > A SparseBitset impl for DELETES would be fine if the model in Lucene would > encode deleted docs (it did that in earlier times). As deletes are sparse > (deletes are in most cases <40%), this would help to make the iterator > cheaper. > Uwe > > Am 06.02.2024 um 09:01 schrieb Adrien Grand: > > Hey Michael, > > You are right, iterating all deletes with nextClearBit() would run in > O(maxDoc). I am coming from the other direction, where I'm expecting the > number of deletes to be more in the order of 1%-5% of the doc ID space, so > a separate int[] would use lots of heap and probably not help that much > compared with nextClearBit(). My mental model is that the two most common > use-cases are append-only workloads, where there are no deletes at all, and > update workloads, which would commonly have several percents of deleted > docs. It's not clear to me how common it is to have very few deletes. > > On Tue, Feb 6, 2024 at 7:03 AM Michael Froh <msf...@gmail.com> wrote: > >> Thanks Adrien! >> >> My thinking with a separate iterator was that nextClearBit() is >> relatively expensive (O(maxDoc) to traverse everything, I think). The >> solution I was imagining would involve an index-time change to output, say, >> an int[] of deleted docIDs if the number is sufficiently small (like maybe >> less than 1000). Then the livedocs interface could optionally return a >> cheap deleted docs iterator (i.e. only if the number of deleted docs is >> less than the threshold). Technically, the cost would be O(1), since we set >> a constant bound on the effort and fail otherwise. :) >> >> I think 1000 doc value lookups would be cheap, but I don't know if the >> guarantee is cheap enough to make it into Weight#count. >> >> That said, I'm going to see if iterating with nextClearBit() is >> sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on >> refresh could be an option too. >> >> Thanks again, >> Froh >> >> On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand <jpou...@gmail.com> wrote: >> >>> Hi Michael, >>> >>> Indeed, only MatchAllDocsQuery knows how to produce a count when there >>> are deletes. >>> >>> Your idea sounds good to me, do you actually need a side car iterator >>> for deletes, or could you use a nextClearBit() operation on the bit set? >>> >>> I don't think we can fold it into Weight#count since there is an >>> expectation that it is negligible compared with the cost of a naive count, >>> but we may be able to do it in IndexSearcher#count or on the OpenSearch >>> side. >>> >>> Le ven. 2 févr. 2024, 23:50, Michael Froh <msf...@gmail.com> a écrit : >>> >>>> Hi, >>>> >>>> On OpenSearch, we've been taking advantage of the various O(1) >>>> Weight#count() implementations to quickly compute various aggregations >>>> without needing to iterate over all the matching documents (at least when >>>> the top-level query is functionally a match-all at the segment level). Of >>>> course, from what I've seen, every clever Weight#count() >>>> implementation falls apart (returns -1) in the face of deletes. >>>> >>>> I was thinking that we could still handle small numbers of deletes >>>> efficiently if only we could get a DocIdSetIterator for deleted docs. >>>> >>>> Like suppose you're doing a date histogram aggregation, you could get >>>> the counts for each bucket from the points tree (ignoring deletes), then >>>> iterate through the deleted docs and decrement their contribution from the >>>> relevant bucket (determined based on a docvalues lookup). Assuming the >>>> number of deleted docs is small, it should be cheap, right? >>>> >>>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK >>>> it's not great for iteration. I'm imagining adding a supplementary "deleted >>>> docs iterator" that could sit next to the FixedBitSet if and only if the >>>> number of deletes is "small". Is there a better way that I should be >>>> thinking about this? >>>> >>>> Thanks, >>>> Froh >>>> >>> > > -- > Adrien > > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de > eMail: u...@thetaphi.de > > -- Adrien