Re: [PR] Add a de-duplicating flat vector format [lucene]

via GitHub Wed, 29 Apr 2026 21:22:45 -0700


kaivalnp commented on PR #15979:
URL: https://github.com/apache/lucene/pull/15979#issuecomment-4349663487

Iterated with the AI to iron out some performance bottlenecks.

**Disclaimer**: There's generous use of AI involved in the code, so
functions may be more verbose than required! I have reviewed **most** of it
locally, and will do a refactor to try and make it cleaner / more human
friendly soon.

#### Current Approach

See `dedup-vectors-format-design.md` in this PR for the AI-generated
summary, but the flow is roughly:
1. During indexing, maintain a per-field `List` of vectors on-heap ([same as
the flat
format](https://github.com/apache/lucene/blob/23d400cabd8e63d26d594ec94e866ef5e22d2a12/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java#L319)).
2. During flush (occurs field-wise):
1. Raw vectors across _all_ fields are stored in a single "vector
dictionary".
2. Iterate over the `List` of vectors in a field, compute hash for each
vector and store it in a map of `hash -> (field number, ordinal in field)`.
3. On hash collision, check for equality with the vector referenced in
the map:
1. IF vectors are equal, do not write a separate copy to the
"vector dictionary" and point to the existing entry.
2. ELSE use a hash probing strategy (move to `hash + 1`, then
`hash + 2`, and so on).
4. At the end, only unique vectors are stored in the dictionary.
5. Each field stores an additional `ord -> position in dictionary`
mapping (also see "Query Time Cost").
3. During merge (also occurs field-wise):
1. Get a view of vectors being merged across segments, as an iterator
of `(docid, vector, original segment idx, ord in original segment)`.
2. Compute hash of each vector, and maintain a map of `hash -> (field
number, original segment idx, ord in original segment)`.
3. On hash collision, we have two paths:
1. IF both vectors (the one being merged + the one that
collided) come from the _same segment, which is also de-duplicating_: we re-use
the vector equality that has been resolved earlier (i.e. whether both vectors
point to the same location in the original segment's dictionary).
2. ELSE load the other vector onto heap for an explicit
equality check. This happens when:
1. Hash collisions occur across segments, where
equality has not been resolved earlier.
2. When a normal flat format was used earlier, and we
switched to the de-duplicating one now (e.g. during Lucene upgrade).
4. The logic to write to the dictionary is the same as flush.

#### Indexing Time Complexity

1. One additional hash of each vector during indexing and merge (equivalent
to one vector similarity computation done during HNSW?).
2. Adding into the hash map (constant on average, but expansion of map on
reaching capacity can add overhead).
3. Equality checks for all duplicated vectors + hash-collisions.
1. During indexing, these are all on-heap -- adding a fixed cost.
2. During merging, this is not expensive when both vectors are present
in the same segment (already resolved during indexing, which is re-used).
3. However, cross-segment equality checks can become costly: loading
vectors in a random-access fashion onto heap (page cache misses).
4. Perhaps we can use a hash with more bits to reduce false positives,
or only guarantee de-duplication within a document?

#### Query Time Cost

1. One additional lookup per vector operation:
1. Earlier: vectors were stored contiguously, so the offset in the flat
file was a direct computation of `ord * vectorByteSize`.
2. Now, vectors across all fields are stored in a single file, with
multiple ordinals possibly referencing the same vector (i.e. no longer 1:1).
3. The offset in the "vector dictionary" is something like
`ordToPosition[ord] * vectorByteSize` instead.
4. Currently, `ordToPosition` is an array stored on-heap, perhaps this
can be index-backed too?

All-in-all, it does not seem _too_ expensive compared to the flat format?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add a de-duplicating flat vector format [lucene]

Reply via email to