[ https://issues.apache.org/jira/browse/LUCENE-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ignacio Vera updated LUCENE-10429: ---------------------------------- Summary: DocIdsetBuilder implementation is inconsistent with DocIdSetBuilder#grow contract (was: DocIdsetBuilder implementation is inconsistent with DocIdsetBuilder#grow contract ) > DocIdsetBuilder implementation is inconsistent with DocIdSetBuilder#grow > contract > ---------------------------------------------------------------------------------- > > Key: LUCENE-10429 > URL: https://issues.apache.org/jira/browse/LUCENE-10429 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Ignacio Vera > Priority: Major > > Currently the contract of DocIdSetBulder#grow says: > {noformat} > /** > * Reserve space and return a \{@link BulkAdder} object that can be used to > add up to \{@code * numDocs}documents. > */ > public BulkAdder grow(int numDocs) > {noformat} > > I would expect that from the PointValues API I could call this method by > using the following: > {noformat} > DocIdSetBulder#grow((int) Math.min(docCount, node.size())); > {noformat} > But it seems it is not true as the implementation expects that the method is > grow for all the call to growAdder, counting duplicated documents. Therefore > we have this other implementation of addAll instead, to make happy the > implementation: > {noformat} > public void addAll(PointValues.IntersectVisitor visitor, boolean grown) > throws IOException { > if (grown == false) { > final long size = size(); > if (size <= Integer.MAX_VALUE) { > visitor.grow((int) size); > grown = true; > } > } > if (isLeafNode()) { > // Leaf node > leafNodes.seek(getLeafBlockFP()); > // How many points are stored in this leaf cell: > int count = leafNodes.readVInt(); > // No need to call grow(), it has been called up-front > docIdsWriter.readInts(leafNodes, count, visitor); > } else { > pushLeft(); > addAll(visitor, grown); > pop(); > pushRight(); > addAll(visitor, grown); > pop(); > } > } > {noformat} > Therefore we have three options here: > 1) Modify the grow API to reflect that it can be called more than > Integer#MAX_VALUE, and therefore change the input parameter from int to long. > Note that this method is exposed due to the points API so we have tried this > implementation in LUCENE-10311 by creating a specific implementation of > DocIdSetBuilder for points. This has so far been rejected. > 2) Modify the implementation of DocIdSetBuilder. Currently the issue is that > we have a counter inside the implementation that it is used to compute the > cost for the dense case of the final iterator. Therefore we need to change > the way we compute the cost. > The proposal here is to change the way we compute cost from: > {noformat} > final long cost = Math.round(counter / numValuesPerDoc); > {noformat} > Which might underestimate the cost of the iterator to the following that > overestimate the cost: > {noformat} > final long cost = Math.min(counter, docCount)) > {noformat} > I lack of intuition of how this might affect performance down the line. One > thing I notice is that for the Terms API (that is when we add docs using a > DocIdSetIterator via DocIdSetBuilder#add), we ignore the counter in the dense > case, so we are already providing a totally wrong cost on that case! > 3) If none of this proposals is successful we should at least update the java > docs to reflect reality. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org