[GitHub] [lucene] jpountz commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

GitBox Tue, 09 Nov 2021 09:29:47 -0800


jpountz commented on a change in pull request #7:
URL: https://github.com/apache/lucene/pull/7#discussion_r745805971




##########
File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java
##########
@@ -227,8 +228,59 @@ protected PointValues() {}
     CELL_CROSSES_QUERY
   };
 
+  /** Create a new {@link PointTree} to navigate the index */
+  public abstract PointTree getPointTree() throws IOException;
+
   /**
-   * We recurse the BKD tree, using a provided instance of this to guide the 
recursion.
+   * Basic operations to read the KD-tree.
+   *
+   * @lucene.experimental
+   */
+  public interface PointTree extends Cloneable {
+
+    /**
+     * Clone, the current node becomes the root of the new tree. The method 
should not be called
+     * after a successful call to {@link #moveToParent()}

Review comment:
       I still find it quite trappy that whether you can `clone()` doesn't 
depend on the node itself, but on how you got there. Sorry I think I asked 
several times already, but would it be an option to lift that restriction?

##########
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java
##########
@@ -279,7 +437,91 @@ public int getNumLeaves() {
         numLeaves = rightMostLeafNode - leftMostLeafNode + 1 + leafNodeOffset;
       }
       assert numLeaves == getNumLeavesSlow(nodeID) : numLeaves + " " + 
getNumLeavesSlow(nodeID);
-      return numLeaves;
+      return rightMostLeafNode == this.rightMostLeafNode
+          ? (long) (numLeaves - 1) * config.maxPointsInLeafNode + 
lastLeafNodePointCount
+          : (long) numLeaves * config.maxPointsInLeafNode;
+    }
+
+    @Override
+    public void visitDocIDs(PointValues.IntersectVisitor visitor) throws 
IOException {
+      addAll(visitor, false);
+    }
+
+    public void addAll(PointValues.IntersectVisitor visitor, boolean grown) 
throws IOException {
+      if (grown == false) {
+        final long size = size();
+        if (size <= Integer.MAX_VALUE) {
+          visitor.grow((int) size);
+          grown = true;
+        }

Review comment:
       maybe leave a comment about the fact that if `size()` is greater than 
MAX_VALUE then we'll grow when recursing?

##########
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
##########
@@ -328,31 +288,88 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
           // Not deleted!
           docID = mappedDocID;
           System.arraycopy(
-              packedValues,
-              index * bkd.config.packedBytesLength,
-              state.scratchDataPackedValue,
+              mergeIntersectsVisitor.packedValues,
+              index * packedBytesLength,
+              packedValue,
               0,
-              bkd.config.packedBytesLength);
+              packedBytesLength);
+          return true;
+        }
+      }
+    }
+
+    private boolean collectNextLeaf() throws IOException {
+      assert pointTree.moveToChild() == false;
+      mergeIntersectsVisitor.reset();
+      do {
+        if (pointTree.moveToSibling()) {
+          // move to first child of this node and collect docs
+          while (pointTree.moveToChild()) {}
+          pointTree.visitDocValues(mergeIntersectsVisitor);
           return true;
         }
+      } while (pointTree.moveToParent());
+      return false;
+    }
+  }
+
+  private static class MergeIntersectsVisitor implements IntersectVisitor {
+
+    int docsInBlock = 0;
+    byte[] packedValues;
+    int[] docIDs;
+    private final int packedBytesLength;
+
+    MergeIntersectsVisitor(int packedBytesLength) {
+      this.docIDs = new int[0];
+      this.packedValues = new byte[0];
+      this.packedBytesLength = packedBytesLength;
+    }
+
+    void reset() {
+      docsInBlock = 0;
+    }
+
+    @Override
+    public void grow(int count) {
+      if (docIDs.length - docsInBlock < count) {
+        docIDs = ArrayUtil.grow(docIDs, count + docsInBlock);
+        packedValues = ArrayUtil.grow(packedValues, (count + docsInBlock) * 
packedBytesLength);

Review comment:
       I think you need to do this to make sure that you can never end up in a 
case where `docIDs` is large-enough, but not `packedValues`.
   ```suggestion
           packedValues = ArrayUtil.growExact(packedValues, docIDs.length * 
packedBytesLength);
   ```

##########
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
##########
@@ -328,31 +288,88 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
           // Not deleted!
           docID = mappedDocID;
           System.arraycopy(
-              packedValues,
-              index * bkd.config.packedBytesLength,
-              state.scratchDataPackedValue,
+              mergeIntersectsVisitor.packedValues,
+              index * packedBytesLength,
+              packedValue,
               0,
-              bkd.config.packedBytesLength);
+              packedBytesLength);
+          return true;
+        }
+      }
+    }
+
+    private boolean collectNextLeaf() throws IOException {
+      assert pointTree.moveToChild() == false;
+      mergeIntersectsVisitor.reset();
+      do {
+        if (pointTree.moveToSibling()) {
+          // move to first child of this node and collect docs
+          while (pointTree.moveToChild()) {}
+          pointTree.visitDocValues(mergeIntersectsVisitor);
           return true;
         }
+      } while (pointTree.moveToParent());
+      return false;
+    }
+  }
+
+  private static class MergeIntersectsVisitor implements IntersectVisitor {
+
+    int docsInBlock = 0;
+    byte[] packedValues;
+    int[] docIDs;
+    private final int packedBytesLength;
+
+    MergeIntersectsVisitor(int packedBytesLength) {
+      this.docIDs = new int[0];
+      this.packedValues = new byte[0];
+      this.packedBytesLength = packedBytesLength;
+    }
+
+    void reset() {
+      docsInBlock = 0;
+    }
+
+    @Override
+    public void grow(int count) {
+      if (docIDs.length - docsInBlock < count) {
+        docIDs = ArrayUtil.grow(docIDs, count + docsInBlock);
+        packedValues = ArrayUtil.grow(packedValues, (count + docsInBlock) * 
packedBytesLength);

Review comment:
       Do we need to care about not overflowing the max array length on 
`packedValues`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java
##########
@@ -279,7 +437,91 @@ public int getNumLeaves() {
         numLeaves = rightMostLeafNode - leftMostLeafNode + 1 + leafNodeOffset;
       }
       assert numLeaves == getNumLeavesSlow(nodeID) : numLeaves + " " + 
getNumLeavesSlow(nodeID);
-      return numLeaves;
+      return rightMostLeafNode == this.rightMostLeafNode
+          ? (long) (numLeaves - 1) * config.maxPointsInLeafNode + 
lastLeafNodePointCount
+          : (long) numLeaves * config.maxPointsInLeafNode;
+    }
+
+    @Override
+    public void visitDocIDs(PointValues.IntersectVisitor visitor) throws 
IOException {
+      addAll(visitor, false);
+    }
+
+    public void addAll(PointValues.IntersectVisitor visitor, boolean grown) 
throws IOException {
+      if (grown == false) {
+        final long size = size();
+        if (size <= Integer.MAX_VALUE) {
+          visitor.grow((int) size);
+          grown = true;
+        }

Review comment:
       I wonder about the implications of trying to `grow` as early as possible 
while the current implementation only grows at the leaf level. It looks like it 
could lead to loading the entire dataset in memory with our current merging 
logic? Maybe we should keep only growing at the leaf level for now and look 
into growing earlier in a follow-up?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] jpountz commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

Reply via email to