kaivalnp commented on issue #14758: URL: https://github.com/apache/lucene/issues/14758#issuecomment-3294554802
> What if it only did so within one document, which would enable this "compile KNN prefilter to separate field's HNSW graph during indexing" efficiently? But not across documents Thanks @mikemccand, I like this a lot! It _may_ be possible -- but deviates from how we merge vectors: today, the [`KnnVectorsWriter`](https://github.com/apache/lucene/blob/ef9128ffe0588c2e1ebee8b200133f318f6f5ff9/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L44) [indexes](https://github.com/apache/lucene/blob/ef9128ffe0588c2e1ebee8b200133f318f6f5ff9/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L49-L50) and [merges](https://github.com/apache/lucene/blob/ef9128ffe0588c2e1ebee8b200133f318f6f5ff9/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L56-L57) vectors per-field, but it'll need to be document-by-document to avoid buffering all vectors across multiple segments on-heap. This would, however, avoid the need to sort the vectors in the index! (the file could be structured document-wise instead, see below) Today, the `.vec` file looks something like: ``` --- field 1 begin: (vector for field 1, document d1) // position x0 (vector for field 1, document d2) (vector for field 1, document d3) --- field 1 end, field 2 begin: (vector for field 2, document d1) // position x1 (vector for field 2, document d3) --- field 2 end, field 3 begin: (vector for field 3, document d1) // position x2 (vector for field 3, document d2) --- field 3 end, and so on.. ``` The `.vem` file contains tuples to map fields -> vector "blocks", as `(position, length)` The "block" info for `field 1` looks like `(x0, x1 - x0)` The "block" info for `field 2` looks like `(x1, x2 - x1)` ..and so on The `ordToDoc` mapping for `field 1` looks like `{0 -> d1, 1 -> d2, 2 -> d3}` The `ordToDoc` mapping for `field 2` looks like `{0 -> d1, 1 -> d3}` The `ordToDoc` mapping for `field 3` looks like `{0 -> d1, 1 -> d2}` ..and so on --- Now if we de-duplicate vectors _within_ a document (i.e. only store one unique copy of the vector, _across_ fields), then the boundary between "blocks" is broken. We could structure the file by documents instead: ``` --- document d1 begin: (vector for field 1, document d1) // position x0 (vector for field 2, document d1) // position x1 (vector for field 3, document d1) // position x2 --- document d1 end, document 2 begin: (vector for field 1, document d2) // position x3 (vector for field 3, document d2) // position x4 --- document d2 end, document 3 begin: (vector for field 1, document d3) // position x5 (vector for field 2, document d3) // position x6 --- document d3 end, and so on.. ``` ..so an additional `ordToVec` mapping is needed (the `ordToDoc` mapping will remain unchanged) The `ordToVec` mapping for `field 1` looks like `{0 -> x0, 1 -> x3, 2 -> x5}` The `ordToVec` mapping for `field 2` looks like `{0 -> x1, 1 -> x4}` The `ordToVec` mapping for `field 3` looks like `{0 -> x2, 1 -> x6}` We'll need to merge two `.vec` files document-by-document instead, in order to identify and de-dup vectors, and immediately write to the new segment without having to buffer everything in memory (if we still merged per-field) --- Pros (against segment-level de-duping): - Faster indexing / merge (avoids sorting) - Gain back some data locality benefits if a vector reorder policy is used - Need not break the current convention of neighbor ordinals of HNSW nodes being in increasing order (if we did segment-level de-duping, these neighbor ordinals would need to be stored in increasing order of underlying vectors to regain _some_ data locality benefits) - `ordToVec` would compress well like you mentioned Cons (against segment-level de-duping): - Larger index (cannot de-dup _across_ fields) Cons (in general): - We'll lose some data locality benefits (since vectors _within_ a field are now spread out across the file, instead of being present in contiguous chunks). This is true for both document-level and segment-level de-duping I'm really leaning towards the per-document de-duping now! --- > and then `fieldB = fieldA.newLinkedField("name")`, so `fieldB` knows its sharing from `fieldA`'s vector This would shift the responsibility of "identifying" duplicates to the user, and IMO Lucene should take care of this internally. It can be done by maintaining a document-level set of vectors during indexing and merge, and will (probably) not add a lot of burden, time / space / code complexity wise.. > Isn't this how the default Codec (currently `Lucene103Codec`) works? It's per-field HNSW format, but by default all fields use the same instance of `KnnVectorsWriter` when writing a segment? While this does use a single `KnnVectorsWriter` -- the [`PerFieldKnnVectorsFormat`](https://github.com/apache/lucene/blob/613c42855b5cbd5de7ce261bd289fbcd892e5a9f/lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java#L62) creates some internal writers [here](https://github.com/apache/lucene/blob/613c42855b5cbd5de7ce261bd289fbcd892e5a9f/lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java#L115) that may be different for different fields (even if they both use the same flat format internally) -- which we'd like to avoid.. I'll first attempt to hack luceneutil to (explicitly) create new vector fields to "simulate" a filter as this proposal would achieve (internally) -- and we can judge from benchmarks.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org