[ https://issues.apache.org/jira/browse/LUCENE-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500563#comment-17500563 ]
ASF subversion and git services commented on LUCENE-10311: ---------------------------------------------------------- Commit ca73ed1c2842b10c338f1d27ec54cead69ac090e in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ca73ed1 ] LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate). (#710) This computes a pop count on a sample of the longs that back the bitset. Quick benchmarks suggest that this runs 5x-10x faster than `FixedBitSet#cardinality` depending on the length of the bitset. > Should DocIdSetBuilder have different implementations for point and terms? > -------------------------------------------------------------------------- > > Key: LUCENE-10311 > URL: https://issues.apache.org/jira/browse/LUCENE-10311 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Ignacio Vera > Priority: Major > Time Spent: 8h > Remaining Estimate: 0h > > DocIdSetBuilder has two API implementations, one for terms queries and one > for point values queries. In each cases they are used in totally different > way. > For terms the API looks like: > > {code:java} > /** > * Add the content of the provided {@link DocIdSetIterator} to this builder. > NOTE: if you need to > * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you > should rather use {@link > * RoaringDocIdSet.Builder}. > */ > void add(DocIdSetIterator iter) throws IOException; > /** Build a {@link DocIdSet} from the accumulated doc IDs. */ > DocIdSet build() > {code} > > For Point Values it looks like: > > {code:java} > /** > * Utility class to efficiently add many docs in one go. > * > * @see DocIdSetBuilder#grow > */ > public abstract static class BulkAdder { > public abstract void add(int doc); > public void add(DocIdSetIterator iterator) throws IOException { > int docID; > while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { > add(docID); > } > } > } > /** > * Reserve space and return a {@link BulkAdder} object that can be used to > add up to {@code > * numDocs} documents. > */ > /** Build a {@link DocIdSet} from the accumulated doc IDs. */ > DocIdSet build() public BulkAdder grow(int numDocs) > {code} > > > This is becoming trappy for new developments in the PointValue API. > 1) When we call #grow() from the PointValues API, we are not telling the > builder how many docs we are going to add (as we don't really know it) but > the number of points we are about to visit. This number can be bigger than > Integer.MAX_VALUE. Until now, we get around this issue by making sure we > don't call this API when we need to add more than Integer.MAX_VALUE points. > In that case we will navigate the tree down until the number of points is > reduced and they can fit in an int. > This has work well until now because we are calling grow from inside the BKD > reader, and the BKD writer/reader makes sure than the number of points in a > leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which > does not enforce that the number of points on a leaf needs to fit in an int. > This causes friction and inconsistency in the API. > > 2) This a secondary issue that I found when thinking in this issue. In > Lucene- we added the possibility to add a `DocIdSetIterator` from the > PointValues API. Therefore there are two ways to add those kind of objects > to a DocIdSetBuilder which can end up in different results: > > {code:java} > { > // Terms API > docIdSetBuilder.add(docIdSetIterator); > } > { > // Point values API > docIdSetBuilder.grow(doc).add(docIdSetIterator) > }{code} > > I wonder if we need to rethink this API, should we have different > implementation for Terms and Point values? > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org