[ 
https://issues.apache.org/jira/browse/LUCENE-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera updated LUCENE-10429:
----------------------------------
    Summary: DocIdsetBuilder implementation is inconsistent with 
DocIdSetBuilder#grow contract   (was: DocIdsetBuilder implementation is 
inconsistent with DocIdsetBuilder#grow contract )

> DocIdsetBuilder implementation is inconsistent with DocIdSetBuilder#grow 
> contract 
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-10429
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10429
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ignacio Vera
>            Priority: Major
>
> Currently the contract of DocIdSetBulder#grow says:
> {noformat}
> /**
>  * Reserve space and return a \{@link BulkAdder} object that can be used to 
> add up to \{@code * numDocs}documents.
> */
> public BulkAdder grow(int numDocs)
> {noformat}
>  
> I would expect that from the PointValues API I could call this method by 
> using the following:
> {noformat}
> DocIdSetBulder#grow((int) Math.min(docCount, node.size()));
> {noformat}
> But it seems it is not true as the implementation expects that the method is 
> grow for all the call to growAdder, counting duplicated documents. Therefore 
> we have this other implementation of addAll instead, to make happy the 
> implementation:
> {noformat}
>  public void addAll(PointValues.IntersectVisitor visitor, boolean grown) 
> throws IOException {
>       if (grown == false) {
>         final long size = size();
>         if (size <= Integer.MAX_VALUE) {
>           visitor.grow((int) size);
>           grown = true;
>         }
>       }
>       if (isLeafNode()) {
>         // Leaf node
>         leafNodes.seek(getLeafBlockFP());
>         // How many points are stored in this leaf cell:
>         int count = leafNodes.readVInt();
>         // No need to call grow(), it has been called up-front
>         docIdsWriter.readInts(leafNodes, count, visitor);
>       } else {
>         pushLeft();
>         addAll(visitor, grown);
>         pop();
>         pushRight();
>         addAll(visitor, grown);
>         pop();
>       }
>     }
> {noformat}
> Therefore we have three options here:
> 1) Modify the grow API to reflect that it can be called more than 
> Integer#MAX_VALUE, and therefore change the input parameter from int to long. 
> Note that this method is exposed due to the points API so we have tried this 
> implementation in LUCENE-10311 by creating a specific implementation of 
> DocIdSetBuilder for points. This has so far been rejected.
> 2) Modify the implementation of DocIdSetBuilder. Currently the issue is that 
> we have a counter inside the implementation that it is used to compute the 
> cost for the dense case of the final iterator. Therefore we need to change 
> the way we compute the cost.
> The proposal here is to change the way we compute cost from:
> {noformat}
>  final long cost = Math.round(counter / numValuesPerDoc);
> {noformat}
> Which might underestimate the cost of the iterator to the following that 
> overestimate the cost:
> {noformat}
>  final long cost = Math.min(counter, docCount))
> {noformat}
> I lack of intuition of how this might affect performance down the line. One 
> thing I notice is that for the Terms API (that is when we add docs using a 
> DocIdSetIterator via DocIdSetBuilder#add), we ignore the counter in the dense 
> case, so we are already providing a totally wrong cost on that case! 
> 3) If none of this proposals is successful we should at least update the java 
> docs to reflect reality.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to