[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

ASF subversion and git services (Jira) Wed, 02 Mar 2022 23:50:06 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500563#comment-17500563
 ]


ASF subversion and git services commented on LUCENE-10311:
----------------------------------------------------------

Commit ca73ed1c2842b10c338f1d27ec54cead69ac090e in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ca73ed1 ]

LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually 
approximate). (#710)

This computes a pop count on a sample of the longs that back the bitset.

Quick benchmarks suggest that this runs 5x-10x faster than
`FixedBitSet#cardinality` depending on the length of the bitset.

> Should DocIdSetBuilder have different implementations for point and terms?
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-10311
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10311
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ignacio Vera
>            Priority: Major
>          Time Spent: 8h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder has two API implementations, one for terms queries and one 
> for point values queries. In each cases they are used in totally different 
> way.
> For terms the API looks like:
>  
> {code:java}
> /**
>  * Add the content of the provided {@link DocIdSetIterator} to this builder. 
> NOTE: if you need to
>  * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you 
> should rather use {@link
>  * RoaringDocIdSet.Builder}.
>  */
> void add(DocIdSetIterator iter) throws IOException;
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build() 
> {code}
>  
> For Point Values it looks like:
>  
> {code:java}
> /**
>  * Utility class to efficiently add many docs in one go.
>  *
>  * @see DocIdSetBuilder#grow
>  */
> public abstract static class BulkAdder {
>   public abstract void add(int doc);
>   public void add(DocIdSetIterator iterator) throws IOException {
>     int docID;
>     while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
>       add(docID);
>     }
>   }
> }
> /**
>  * Reserve space and return a {@link BulkAdder} object that can be used to 
> add up to {@code
>  * numDocs} documents.
>  */
> /** Build a {@link DocIdSet} from the accumulated doc IDs. */
> DocIdSet build()  public BulkAdder grow(int numDocs) 
> {code}
>  
>  
> This is becoming trappy for new developments in the PointValue API.
> 1) When we call #grow() from the PointValues API, we are not telling the 
> builder how many docs we are going to add (as we don't really know it) but 
> the number of points we are about to visit. This number can be bigger than 
> Integer.MAX_VALUE. Until now, we get around this issue by making sure we 
> don't call this API when we need to add more than Integer.MAX_VALUE points. 
> In that case we will navigate the tree down until the number of points is 
> reduced and they can fit in an int.
> This has work well until now because we are calling grow from inside the BKD 
> reader, and the BKD writer/reader makes sure than the number of points in a 
> leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which 
> does not enforce that the number of points on a leaf needs to fit in an int.  
> This causes friction and inconsistency in the API.
>  
> 2) This a secondary issue that I found when thinking in this issue. In 
> Lucene- we added the possibility to add a `DocIdSetIterator` from the 
> PointValues API.  Therefore there are two ways to add those kind of objects 
> to a DocIdSetBuilder which can end up in different results:
>  
> {code:java}
> {
>   // Terms API
>   docIdSetBuilder.add(docIdSetIterator); 
> }
> {
>   // Point values API
>   docIdSetBuilder.grow(doc).add(docIdSetIterator)
> }{code}
>  
> I wonder if we need to rethink this API, should we have different 
> implementation for Terms and Point values?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10311) Should DocIdSetBuilder have different implementations for point and terms?

Reply via email to