Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Uwe Schindler

Hi,

my response was a bit unclear. Before Lucene 4.0 we saved *deletions* in 
a bitset (1 = doc deleted), so you were able to use the DocIdSetIterator 
provided directly. At this point there was no sparse implementation.


My idea was more about this: "Because we marked *deleted* docs (not live 
docs) in the bitset, the cardinality of the Bitset was small and a 
sparse one would work well". Of course we can just invert on set/get to 
make use of a SparseFixedBitSet.


Uwe

Am 06.02.2024 um 21:05 schrieb Adrien Grand:
Good point, I opened an issue to discuss this: 
https://github.com/apache/lucene/issues/13084.


Did we actually use a sparse bit set to encode deleted docs before? I 
don't recall that.


On Tue, Feb 6, 2024 at 2:42 PM Uwe Schindler  wrote:

Hi,

A SparseBitset impl for DELETES would be fine if the model in
Lucene would encode deleted docs (it did that in earlier times).
As deletes are sparse (deletes are in most cases <40%), this would
help to make the iterator cheaper.

Uwe

Am 06.02.2024 um 09:01 schrieb Adrien Grand:

Hey Michael,

You are right, iterating all deletes with nextClearBit() would
run in O(maxDoc). I am coming from the other direction, where I'm
expecting the number of deletes to be more in the order of 1%-5%
of the doc ID space, so a separate int[] would use lots of heap
and probably not help that much compared with nextClearBit(). My
mental model is that the two most common use-cases are
append-only workloads, where there are no deletes at all, and
update workloads, which would commonly have several percents of
deleted docs. It's not clear to me how common it is to have very
few deletes.

On Tue, Feb 6, 2024 at 7:03 AM Michael Froh  wrote:

Thanks Adrien!

My thinking with a separate iterator was that nextClearBit()
is relatively expensive (O(maxDoc) to traverse everything, I
think). The solution I was imagining would involve an
index-time change to output, say, an int[] of deleted docIDs
if the number is sufficiently small (like maybe less than
1000). Then the livedocs interface could optionally return a
cheap deleted docs iterator (i.e. only if the number of
deleted docs is less than the threshold). Technically, the
cost would be O(1), since we set a constant bound on the
effort and fail otherwise. :)

I think 1000 doc value lookups would be cheap, but I don't
know if the guarantee is cheap enough to make it into
Weight#count.

That said, I'm going to see if iterating with nextClearBit()
is sufficiently cheap. Hmm... precomputing that int[] for
deleted docIDs on refresh could be an option too.

Thanks again,
Froh

On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand
 wrote:

Hi Michael,

Indeed, only MatchAllDocsQuery knows how to produce a
count when there are deletes.

Your idea sounds good to me, do you actually need a side
car iterator for deletes, or could you use a
nextClearBit() operation on the bit set?

I don't think we can fold it into Weight#count since
there is an expectation that it is negligible compared
with the cost of a naive count, but we may be able to do
it in IndexSearcher#count or on the OpenSearch side.

Le ven. 2 févr. 2024, 23:50, Michael Froh
 a écrit :

Hi,

On OpenSearch, we've been taking advantage of the
various O(1) Weight#count() implementations to
quickly compute various aggregations without needing
to iterate over all the matching documents (at least
when the top-level query is functionally a match-all
at the segment level). Of course, from what I've
seen, every clever Weight#count()
implementation falls apart (returns -1) in the face
of deletes.

I was thinking that we could still handle small
numbers of deletes efficiently if only we could get a
DocIdSetIterator for deleted docs.

Like suppose you're doing a date histogram
aggregation, you could get the counts for each bucket
from the points tree (ignoring deletes), then iterate
through the deleted docs and decrement their
contribution from the relevant bucket (determined
based on a docvalues lookup). Assuming the number of
deleted docs is small, it should be cheap, right?

The current LiveDocs implementation is just a
FixedBitSet, so AFAIK it's not great for iteration.
I'm imagining adding a supplementary "deleted docs
iterator" that 

Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Adrien Grand
Good point, I opened an issue to discuss this:
https://github.com/apache/lucene/issues/13084.

Did we actually use a sparse bit set to encode deleted docs before? I don't
recall that.

On Tue, Feb 6, 2024 at 2:42 PM Uwe Schindler  wrote:

> Hi,
>
> A SparseBitset impl for DELETES would be fine if the model in Lucene would
> encode deleted docs (it did that in earlier times). As deletes are sparse
> (deletes are in most cases <40%), this would help to make the iterator
> cheaper.
> Uwe
>
> Am 06.02.2024 um 09:01 schrieb Adrien Grand:
>
> Hey Michael,
>
> You are right, iterating all deletes with nextClearBit() would run in
> O(maxDoc). I am coming from the other direction, where I'm expecting the
> number of deletes to be more in the order of 1%-5% of the doc ID space, so
> a separate int[] would use lots of heap and probably not help that much
> compared with nextClearBit(). My mental model is that the two most common
> use-cases are append-only workloads, where there are no deletes at all, and
> update workloads, which would commonly have several percents of deleted
> docs. It's not clear to me how common it is to have very few deletes.
>
> On Tue, Feb 6, 2024 at 7:03 AM Michael Froh  wrote:
>
>> Thanks Adrien!
>>
>> My thinking with a separate iterator was that nextClearBit() is
>> relatively expensive (O(maxDoc) to traverse everything, I think). The
>> solution I was imagining would involve an index-time change to output, say,
>> an int[] of deleted docIDs if the number is sufficiently small (like maybe
>> less than 1000). Then the livedocs interface could optionally return a
>> cheap deleted docs iterator (i.e. only if the number of deleted docs is
>> less than the threshold). Technically, the cost would be O(1), since we set
>> a constant bound on the effort and fail otherwise. :)
>>
>> I think 1000 doc value lookups would be cheap, but I don't know if the
>> guarantee is cheap enough to make it into Weight#count.
>>
>> That said, I'm going to see if iterating with nextClearBit() is
>> sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
>> refresh could be an option too.
>>
>> Thanks again,
>> Froh
>>
>> On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand  wrote:
>>
>>> Hi Michael,
>>>
>>> Indeed, only MatchAllDocsQuery knows how to produce a count when there
>>> are deletes.
>>>
>>> Your idea sounds good to me, do you actually need a side car iterator
>>> for deletes, or could you use a nextClearBit() operation on the bit set?
>>>
>>> I don't think we can fold it into Weight#count since there is an
>>> expectation that it is negligible compared with the cost of a naive count,
>>> but we may be able to do it in IndexSearcher#count or on the OpenSearch
>>> side.
>>>
>>> Le ven. 2 févr. 2024, 23:50, Michael Froh  a écrit :
>>>
 Hi,

 On OpenSearch, we've been taking advantage of the various O(1)
 Weight#count() implementations to quickly compute various aggregations
 without needing to iterate over all the matching documents (at least when
 the top-level query is functionally a match-all at the segment level). Of
 course, from what I've seen, every clever Weight#count()
 implementation falls apart (returns -1) in the face of deletes.

 I was thinking that we could still handle small numbers of deletes
 efficiently if only we could get a DocIdSetIterator for deleted docs.

 Like suppose you're doing a date histogram aggregation, you could get
 the counts for each bucket from the points tree (ignoring deletes), then
 iterate through the deleted docs and decrement their contribution from the
 relevant bucket (determined based on a docvalues lookup). Assuming the
 number of deleted docs is small, it should be cheap, right?

 The current LiveDocs implementation is just a FixedBitSet, so AFAIK
 it's not great for iteration. I'm imagining adding a supplementary "deleted
 docs iterator" that could sit next to the FixedBitSet if and only if the
 number of deletes is "small". Is there a better way that I should be
 thinking about this?

 Thanks,
 Froh

>>>
>
> --
> Adrien
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

-- 
Adrien


Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Uwe Schindler

Hi,

A SparseBitset impl for DELETES would be fine if the model in Lucene 
would encode deleted docs (it did that in earlier times). As deletes are 
sparse (deletes are in most cases <40%), this would help to make the 
iterator cheaper.


Uwe

Am 06.02.2024 um 09:01 schrieb Adrien Grand:

Hey Michael,

You are right, iterating all deletes with nextClearBit() would run in 
O(maxDoc). I am coming from the other direction, where I'm expecting 
the number of deletes to be more in the order of 1%-5% of the doc ID 
space, so a separate int[] would use lots of heap and probably not 
help that much compared with nextClearBit(). My mental model is that 
the two most common use-cases are append-only workloads, where there 
are no deletes at all, and update workloads, which would commonly have 
several percents of deleted docs. It's not clear to me how common it 
is to have very few deletes.


On Tue, Feb 6, 2024 at 7:03 AM Michael Froh  wrote:

Thanks Adrien!

My thinking with a separate iterator was that nextClearBit() is
relatively expensive (O(maxDoc) to traverse everything, I think).
The solution I was imagining would involve an index-time change to
output, say, an int[] of deleted docIDs if the number is
sufficiently small (like maybe less than 1000). Then the livedocs
interface could optionally return a cheap deleted docs iterator
(i.e. only if the number of deleted docs is less than the
threshold). Technically, the cost would be O(1), since we set a
constant bound on the effort and fail otherwise. :)

I think 1000 doc value lookups would be cheap, but I don't know if
the guarantee is cheap enough to make it into Weight#count.

That said, I'm going to see if iterating with nextClearBit() is
sufficiently cheap. Hmm... precomputing that int[] for
deleted docIDs on refresh could be an option too.

Thanks again,
Froh

On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand 
wrote:

Hi Michael,

Indeed, only MatchAllDocsQuery knows how to produce a count
when there are deletes.

Your idea sounds good to me, do you actually need a side car
iterator for deletes, or could you use a nextClearBit()
operation on the bit set?

I don't think we can fold it into Weight#count since there is
an expectation that it is negligible compared with the cost of
a naive count, but we may be able to do it in
IndexSearcher#count or on the OpenSearch side.

Le ven. 2 févr. 2024, 23:50, Michael Froh  a
écrit :

Hi,

On OpenSearch, we've been taking advantage of the various
O(1) Weight#count() implementations to quickly compute
various aggregations without needing to iterate over all
the matching documents (at least when the top-level query
is functionally a match-all at the segment level). Of
course, from what I've seen, every clever Weight#count()
implementation falls apart (returns -1) in the face of
deletes.

I was thinking that we could still handle small numbers of
deletes efficiently if only we could get a
DocIdSetIterator for deleted docs.

Like suppose you're doing a date histogram aggregation,
you could get the counts for each bucket from the points
tree (ignoring deletes), then iterate through the deleted
docs and decrement their contribution from the relevant
bucket (determined based on a docvalues lookup). Assuming
the number of deleted docs is small, it should be cheap,
right?

The current LiveDocs implementation is just a FixedBitSet,
so AFAIK it's not great for iteration. I'm imagining
adding a supplementary "deleted docs iterator" that could
sit next to the FixedBitSet if and only if the number of
deletes is "small". Is there a better way that I should be
thinking about this?

Thanks,
Froh



--
Adrien


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: Computing weight.count() cheaply in the face of deletes?

2024-02-06 Thread Adrien Grand
Hey Michael,

You are right, iterating all deletes with nextClearBit() would run in
O(maxDoc). I am coming from the other direction, where I'm expecting the
number of deletes to be more in the order of 1%-5% of the doc ID space, so
a separate int[] would use lots of heap and probably not help that much
compared with nextClearBit(). My mental model is that the two most common
use-cases are append-only workloads, where there are no deletes at all, and
update workloads, which would commonly have several percents of deleted
docs. It's not clear to me how common it is to have very few deletes.

On Tue, Feb 6, 2024 at 7:03 AM Michael Froh  wrote:

> Thanks Adrien!
>
> My thinking with a separate iterator was that nextClearBit() is relatively
> expensive (O(maxDoc) to traverse everything, I think). The solution I was
> imagining would involve an index-time change to output, say, an int[] of
> deleted docIDs if the number is sufficiently small (like maybe less than
> 1000). Then the livedocs interface could optionally return a cheap deleted
> docs iterator (i.e. only if the number of deleted docs is less than the
> threshold). Technically, the cost would be O(1), since we set a constant
> bound on the effort and fail otherwise. :)
>
> I think 1000 doc value lookups would be cheap, but I don't know if the
> guarantee is cheap enough to make it into Weight#count.
>
> That said, I'm going to see if iterating with nextClearBit() is
> sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
> refresh could be an option too.
>
> Thanks again,
> Froh
>
> On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand  wrote:
>
>> Hi Michael,
>>
>> Indeed, only MatchAllDocsQuery knows how to produce a count when there
>> are deletes.
>>
>> Your idea sounds good to me, do you actually need a side car iterator for
>> deletes, or could you use a nextClearBit() operation on the bit set?
>>
>> I don't think we can fold it into Weight#count since there is an
>> expectation that it is negligible compared with the cost of a naive count,
>> but we may be able to do it in IndexSearcher#count or on the OpenSearch
>> side.
>>
>> Le ven. 2 févr. 2024, 23:50, Michael Froh  a écrit :
>>
>>> Hi,
>>>
>>> On OpenSearch, we've been taking advantage of the various O(1)
>>> Weight#count() implementations to quickly compute various aggregations
>>> without needing to iterate over all the matching documents (at least when
>>> the top-level query is functionally a match-all at the segment level). Of
>>> course, from what I've seen, every clever Weight#count()
>>> implementation falls apart (returns -1) in the face of deletes.
>>>
>>> I was thinking that we could still handle small numbers of deletes
>>> efficiently if only we could get a DocIdSetIterator for deleted docs.
>>>
>>> Like suppose you're doing a date histogram aggregation, you could get
>>> the counts for each bucket from the points tree (ignoring deletes), then
>>> iterate through the deleted docs and decrement their contribution from the
>>> relevant bucket (determined based on a docvalues lookup). Assuming the
>>> number of deleted docs is small, it should be cheap, right?
>>>
>>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
>>> not great for iteration. I'm imagining adding a supplementary "deleted docs
>>> iterator" that could sit next to the FixedBitSet if and only if the number
>>> of deletes is "small". Is there a better way that I should be thinking
>>> about this?
>>>
>>> Thanks,
>>> Froh
>>>
>>

-- 
Adrien


Re: Computing weight.count() cheaply in the face of deletes?

2024-02-05 Thread Michael Froh
Thanks Adrien!

My thinking with a separate iterator was that nextClearBit() is relatively
expensive (O(maxDoc) to traverse everything, I think). The solution I was
imagining would involve an index-time change to output, say, an int[] of
deleted docIDs if the number is sufficiently small (like maybe less than
1000). Then the livedocs interface could optionally return a cheap deleted
docs iterator (i.e. only if the number of deleted docs is less than the
threshold). Technically, the cost would be O(1), since we set a constant
bound on the effort and fail otherwise. :)

I think 1000 doc value lookups would be cheap, but I don't know if the
guarantee is cheap enough to make it into Weight#count.

That said, I'm going to see if iterating with nextClearBit() is
sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
refresh could be an option too.

Thanks again,
Froh

On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand  wrote:

> Hi Michael,
>
> Indeed, only MatchAllDocsQuery knows how to produce a count when there are
> deletes.
>
> Your idea sounds good to me, do you actually need a side car iterator for
> deletes, or could you use a nextClearBit() operation on the bit set?
>
> I don't think we can fold it into Weight#count since there is an
> expectation that it is negligible compared with the cost of a naive count,
> but we may be able to do it in IndexSearcher#count or on the OpenSearch
> side.
>
> Le ven. 2 févr. 2024, 23:50, Michael Froh  a écrit :
>
>> Hi,
>>
>> On OpenSearch, we've been taking advantage of the various O(1)
>> Weight#count() implementations to quickly compute various aggregations
>> without needing to iterate over all the matching documents (at least when
>> the top-level query is functionally a match-all at the segment level). Of
>> course, from what I've seen, every clever Weight#count()
>> implementation falls apart (returns -1) in the face of deletes.
>>
>> I was thinking that we could still handle small numbers of deletes
>> efficiently if only we could get a DocIdSetIterator for deleted docs.
>>
>> Like suppose you're doing a date histogram aggregation, you could get the
>> counts for each bucket from the points tree (ignoring deletes), then
>> iterate through the deleted docs and decrement their contribution from the
>> relevant bucket (determined based on a docvalues lookup). Assuming the
>> number of deleted docs is small, it should be cheap, right?
>>
>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
>> not great for iteration. I'm imagining adding a supplementary "deleted docs
>> iterator" that could sit next to the FixedBitSet if and only if the number
>> of deletes is "small". Is there a better way that I should be thinking
>> about this?
>>
>> Thanks,
>> Froh
>>
>


Re: Computing weight.count() cheaply in the face of deletes?

2024-02-02 Thread Adrien Grand
Hi Michael,

Indeed, only MatchAllDocsQuery knows how to produce a count when there are
deletes.

Your idea sounds good to me, do you actually need a side car iterator for
deletes, or could you use a nextClearBit() operation on the bit set?

I don't think we can fold it into Weight#count since there is an
expectation that it is negligible compared with the cost of a naive count,
but we may be able to do it in IndexSearcher#count or on the OpenSearch
side.

Le ven. 2 févr. 2024, 23:50, Michael Froh  a écrit :

> Hi,
>
> On OpenSearch, we've been taking advantage of the various O(1)
> Weight#count() implementations to quickly compute various aggregations
> without needing to iterate over all the matching documents (at least when
> the top-level query is functionally a match-all at the segment level). Of
> course, from what I've seen, every clever Weight#count()
> implementation falls apart (returns -1) in the face of deletes.
>
> I was thinking that we could still handle small numbers of deletes
> efficiently if only we could get a DocIdSetIterator for deleted docs.
>
> Like suppose you're doing a date histogram aggregation, you could get the
> counts for each bucket from the points tree (ignoring deletes), then
> iterate through the deleted docs and decrement their contribution from the
> relevant bucket (determined based on a docvalues lookup). Assuming the
> number of deleted docs is small, it should be cheap, right?
>
> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
> not great for iteration. I'm imagining adding a supplementary "deleted docs
> iterator" that could sit next to the FixedBitSet if and only if the number
> of deletes is "small". Is there a better way that I should be thinking
> about this?
>
> Thanks,
> Froh
>