Ignacio Vera created LUCENE-10429:
-------------------------------------

             Summary: DocIdsetBuilder implementation implementation is 
inconsistent with DocIdsetBuilder#grow contract 
                 Key: LUCENE-10429
                 URL: https://issues.apache.org/jira/browse/LUCENE-10429
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Ignacio Vera


Currently the contract of DocIdSetBulder#grow says:

{noformat}
/**
 * Reserve space and return a \{@link BulkAdder} object that can be used to add 
up to \{@code * numDocs}documents.
*/
public BulkAdder grow(int numDocs)
{noformat}

 
I would expect that from the PointValues API I could call this method by using 
the following:

{noformat}
DocIdSetBulder#grow((int) Math.min(docCount, node.size()));
{noformat}

But it seems it is not true as the implementation expects that the method is 
grow for all the call to growAdder, counting duplicated documents. Therefore we 
have this other implementation of addAll instead, to make happy the 
implementation:

{noformat}
 public void addAll(PointValues.IntersectVisitor visitor, boolean grown) throws 
IOException {
      if (grown == false) {
        final long size = size();
        if (size <= Integer.MAX_VALUE) {
          visitor.grow((int) size);
          grown = true;
        }
      }
      if (isLeafNode()) {
        // Leaf node
        leafNodes.seek(getLeafBlockFP());
        // How many points are stored in this leaf cell:
        int count = leafNodes.readVInt();
        // No need to call grow(), it has been called up-front
        docIdsWriter.readInts(leafNodes, count, visitor);
      } else {
        pushLeft();
        addAll(visitor, grown);
        pop();
        pushRight();
        addAll(visitor, grown);
        pop();
      }
    }

{noformat}

Therefore we have three options here:

1) Modify the grow API to reflect that it can be called more than 
Integer#MAX_VALUE, and therefore change the input parameter from int to long. 
Note that this method is exposed due to the points API so we have tried this 
implementation in LUCENE-10311 by creating a specific implementation of 
DocIdSetBuilder for points. This has so far been rejected.

2) Modify the implementation of DocIdSetBuilder. Currently the issue is that we 
have a counter inside the implementation that it is used to compute the cost 
for the dense case of the final iterator. Therefore we need to change the way 
we compute the cost.
The proposal here is to change the way we compute cost from:
{noformat}
 final long cost = Math.round(counter / numValuesPerDoc);
{noformat}
Which might underestimate the cost of the iterator to the following that 
overestimate the cost:
{noformat}
 final long cost = Math.min(counter, docCount))
{noformat}
I lack of intuition of how this might affect performance down the line. One 
thing I notice is that for the Terms API (that is when we add docs using a 
DocIdSetIterator via DocIdSetBuilder#add), we ignore the counter in the dense 
case, so we are already providing a totally wrong cost on that case! 

3) If none of this proposals is successful we should at least update the java 
docs to reflect reality.






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to