[GitHub] [lucene-solr] msokolov commented on a change in pull request #2239: LUCENE-9695: don't merge deleted vectors

GitBox Mon, 25 Jan 2021 19:54:24 -0800


msokolov commented on a change in pull request #2239:
URL: https://github.com/apache/lucene-solr/pull/2239#discussion_r563309172




##########
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##########
@@ -153,13 +153,12 @@ public int nextDoc() throws IOException {
     private final DocIDMerger<VectorValuesSub> docIdMerger;
     private final int[] ordBase;
     private final int cost;
-    private final int size;
+    private int size;
 
     private int docId;
     private VectorValuesSub current;
-    // For each doc with a vector, record its ord in the segments being 
merged. This enables random
-    // access into the
-    // unmerged segments using the ords from the merged segment.
+    /* For each doc with a vector, record its ord in the segments being 
merged. This enables random access into the unmerged segments using the ords 
from the merged segment.
+     */

Review comment:
       Hmmm I thought spotless would wrap this line, but it doesn't seem to 
complain about it

##########
File path: lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
##########
@@ -578,6 +572,8 @@ private int createIndex(Path docsPath, Path indexPath) 
throws IOException {
     IndexWriterConfig iwc = new 
IndexWriterConfig().setOpenMode(IndexWriterConfig.OpenMode.CREATE);
     // iwc.setMergePolicy(NoMergePolicy.INSTANCE);
     iwc.setRAMBufferSizeMB(1994d);
+    iwc.setMaxBufferedDocs(10000);

Review comment:
       Oh I did not mean to include this change here. Probably we will want to 
have some command line parameter to control this, but for now having the 
default be to index a large segment is probably better

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##########
@@ -194,6 +197,7 @@ public int nextDoc() throws IOException {
       current = docIdMerger.next();
       if (current == null) {
         docId = NO_MORE_DOCS;
+        size = ord;

Review comment:
       Thanks, yes I did.

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
##########
@@ -124,21 +126,84 @@ public void testMerge() throws Exception {
     }
   }
 
-  private void dumpGraph(KnnGraphValues values, int size) throws IOException {
+  /**
+   * Verify that we get the *same* graph by indexing one segment as we do by 
indexing two segments
+   * and merging.
+   */
+  public void testMergeProducesSameGraph() throws Exception {

Review comment:
       thanks, yes this flushed out the two problems I saw, so I'm pretty 
confident they are fixed now, after running this a few 100 times. I had also 
wanted to add a test asserting that KNN search precision remains above some 
threshold, but sadly with random vectors, it would always eventually fail, even 
though mostly it would succeed, so not a very useful unit test and I removed 
it. Probably we can add something to luceneutil

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -748,29 +751,107 @@ public void testRandom() throws Exception {
             assertEquals(dimension, v.length);
             String idString = 
ctx.reader().document(docId).getField("id").stringValue();
             int id = Integer.parseInt(idString);
-            assertArrayEquals(idString, values[id], v, 0);
-            ++valueCount;
+            if (ctx.reader().getLiveDocs() == null || 
ctx.reader().getLiveDocs().get(docId)) {
+              assertArrayEquals(idString, values[id], v, 0);
+              ++valueCount;
+            } else {
+              assertNull(values[id]);
+            }
           }
         }
         assertEquals(numValues, valueCount);
-        assertEquals(numValues, totalSize);
+        assertEquals(numValues, totalSize - numDeletes);
+      }
+    }
+  }
+
+  /**
+   * Index random vectors, sometimes skipping documents, sometimes deleting a 
document, sometimes
+   * merging, sometimes sorting the index, and verify that the expected values 
can be read back
+   * consistently.
+   */
+  public void testRandom2() throws Exception {

Review comment:
       I forgot why I had done this (there is already a testRandom), so I added 
some comments explaining how it differs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2239: LUCENE-9695: don't merge deleted vectors

Reply via email to