[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

GitBox Thu, 14 Jul 2022 06:47:24 -0700


mayya-sharipova commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r921175386



##########
lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java:
##########
@@ -26,233 +26,153 @@
 import org.apache.lucene.codecs.KnnVectorsWriter;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.ArrayUtil;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
-import org.apache.lucene.util.Counter;
 import org.apache.lucene.util.RamUsageEstimator;
 
 /**
- * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes.
+ * Buffers up pending vector value(s) per doc, then flushes when segment 
flushes. Used for {@code
+ * SimpleTextKnnVectorsWriter} and for vectors writers before v 9.3 .
  *
  * @lucene.experimental
  */
-class VectorValuesWriter {
-
-  private final FieldInfo fieldInfo;
-  private final Counter iwBytesUsed;
-  private final List<float[]> vectors = new ArrayList<>();
-  private final DocsWithFieldSet docsWithField;
-
-  private int lastDocID = -1;
-
-  private long bytesUsed;
-
-  VectorValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
-    this.fieldInfo = fieldInfo;
-    this.iwBytesUsed = iwBytesUsed;
-    this.docsWithField = new DocsWithFieldSet();
-    this.bytesUsed = docsWithField.ramBytesUsed();
-    if (iwBytesUsed != null) {
-      iwBytesUsed.addAndGet(bytesUsed);
+public abstract class VectorValuesWriter extends KnnVectorsWriter {

Review Comment:
   @jtibshirani 
   
   > I was thinking we would rewrite SimpleTextKnnVectorsWriter to implement 
the new interface directly. For example it'd define its own 
KnnFieldVectorsWriter where addValue writes to the vectors data file directly.
   
   
   I studied `SimpleTextKnnVectorsWriter` a little bit more, and understood we 
can't organize it this way.  It needs to buffer vectors from all documents for 
each field, and only then write buffered vectors for each field to the vectors 
data file; as it is organized field by field basis, not by docs (as for example 
stored fields).  If there was only a single vector field, we could potentially 
write to the vectors data file directly as you suggested.
   
   Thus, we have to stick with BufferingKnnVectorsWriter for 
`SimpleTextKnnVectorsWriter`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

Reply via email to