kaivalnp commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3294554802

   > What if it only did so within one document, which would enable this 
"compile KNN prefilter to separate field's HNSW graph during indexing" 
efficiently? But not across documents
   
   Thanks @mikemccand, I like this a lot!
   
   It _may_ be possible -- but deviates from how we merge vectors: today, the 
[`KnnVectorsWriter`](https://github.com/apache/lucene/blob/ef9128ffe0588c2e1ebee8b200133f318f6f5ff9/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L44)
 
[indexes](https://github.com/apache/lucene/blob/ef9128ffe0588c2e1ebee8b200133f318f6f5ff9/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L49-L50)
 and 
[merges](https://github.com/apache/lucene/blob/ef9128ffe0588c2e1ebee8b200133f318f6f5ff9/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L56-L57)
 vectors per-field, but it'll need to be document-by-document to avoid 
buffering all vectors across multiple segments on-heap. This would, however, 
avoid the need to sort the vectors in the index! (the file could be structured 
document-wise instead, see below)
   
   Today, the `.vec` file looks something like:
   
   ```
   --- field 1 begin:
   (vector for field 1, document d1) // position x0
   (vector for field 1, document d2)
   (vector for field 1, document d3)
   --- field 1 end, field 2 begin:
   (vector for field 2, document d1) // position x1
   (vector for field 2, document d3)
   --- field 2 end, field 3 begin:
   (vector for field 3, document d1) // position x2
   (vector for field 3, document d2)
   --- field 3 end, and so on..
   ```
   
   The `.vem` file contains tuples to map fields -> vector "blocks", as 
`(position, length)`
   The "block" info for `field 1` looks like `(x0, x1 - x0)`
   The "block" info for `field 2` looks like `(x1, x2 - x1)`
   ..and so on
   
   The `ordToDoc` mapping for `field 1` looks like `{0 -> d1, 1 -> d2, 2 -> d3}`
   The `ordToDoc` mapping for `field 2` looks like `{0 -> d1, 1 -> d3}`
   The `ordToDoc` mapping for `field 3` looks like `{0 -> d1, 1 -> d2}`
   ..and so on
   
   ---
   
   Now if we de-duplicate vectors _within_ a document (i.e. only store one 
unique copy of the vector, _across_ fields), then the boundary between "blocks" 
is broken. We could structure the file by documents instead:
   
   ```
   --- document d1 begin:
   (vector for field 1, document d1) // position x0
   (vector for field 2, document d1) // position x1
   (vector for field 3, document d1) // position x2
   --- document d1 end, document 2 begin:
   (vector for field 1, document d2) // position x3
   (vector for field 3, document d2) // position x4
   --- document d2 end, document 3 begin:
   (vector for field 1, document d3) // position x5
   (vector for field 2, document d3) // position x6
   --- document d3 end, and so on..
   ```
   
   ..so an additional `ordToVec` mapping is needed (the `ordToDoc` mapping will 
remain unchanged)
   
   The `ordToVec` mapping for `field 1` looks like `{0 -> x0, 1 -> x3, 2 -> x5}`
   The `ordToVec` mapping for `field 2` looks like `{0 -> x1, 1 -> x4}`
   The `ordToVec` mapping for `field 3` looks like `{0 -> x2, 1 -> x6}`
   
   We'll need to merge two `.vec` files document-by-document instead, in order 
to identify and de-dup vectors, and immediately write to the new segment 
without having to buffer everything in memory (if we still merged per-field)
   
   ---
   
   Pros (against segment-level de-duping):
   - Faster indexing / merge (avoids sorting)
   - Gain back some data locality benefits if a vector reorder policy is used
   - Need not break the current convention of neighbor ordinals of HNSW nodes 
being in increasing order (if we did segment-level de-duping, these neighbor 
ordinals would need to be stored in increasing order of underlying vectors to 
regain _some_ data locality benefits)
   - `ordToVec` would compress well like you mentioned
   
   Cons (against segment-level de-duping):
   - Larger index (cannot de-dup _across_ fields)
   
   Cons (in general):
   - We'll lose some data locality benefits (since vectors _within_ a field are 
now spread out across the file, instead of being present in contiguous chunks). 
This is true for both document-level and segment-level de-duping
   
   I'm really leaning towards the per-document de-duping now!
   
   ---
   
   > and then `fieldB = fieldA.newLinkedField("name")`, so `fieldB` knows its 
sharing from `fieldA`'s vector
   
   This would shift the responsibility of "identifying" duplicates to the user, 
and IMO Lucene should take care of this internally. It can be done by 
maintaining a document-level set of vectors during indexing and merge, and will 
(probably) not add a lot of burden, time / space / code complexity wise..
   
   > Isn't this how the default Codec (currently `Lucene103Codec`) works? It's 
per-field HNSW format, but by default all fields use the same instance of 
`KnnVectorsWriter` when writing a segment?
   
   While this does use a single `KnnVectorsWriter` -- the 
[`PerFieldKnnVectorsFormat`](https://github.com/apache/lucene/blob/613c42855b5cbd5de7ce261bd289fbcd892e5a9f/lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java#L62)
 creates some internal writers 
[here](https://github.com/apache/lucene/blob/613c42855b5cbd5de7ce261bd289fbcd892e5a9f/lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java#L115)
 that may be different for different fields (even if they both use the same 
flat format internally) -- which we'd like to avoid..
   
   I'll first attempt to hack luceneutil to (explicitly) create new vector 
fields to "simulate" a filter as this proposal would achieve (internally) -- 
and we can judge from benchmarks..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to