kaivalnp commented on PR #15979:
URL: https://github.com/apache/lucene/pull/15979#issuecomment-4349663487

   Iterated with the AI to iron out some performance bottlenecks.
   
   **Disclaimer**: There's generous use of AI involved in the code, so 
functions may be more verbose than required! I have reviewed **most** of it 
locally, and will do a refactor to try and make it cleaner / more human 
friendly soon.
   
   #### Current Approach
   
   See `dedup-vectors-format-design.md` in this PR for the AI-generated 
summary, but the flow is roughly:
   1. During indexing, maintain a per-field `List` of vectors on-heap ([same as 
the flat 
format](https://github.com/apache/lucene/blob/23d400cabd8e63d26d594ec94e866ef5e22d2a12/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java#L319)).
   2. During flush (occurs field-wise):
        1. Raw vectors across _all_ fields are stored in a single "vector 
dictionary".
        2. Iterate over the `List` of vectors in a field, compute hash for each 
vector and store it in a map of `hash -> (field number, ordinal in field)`.
        3. On hash collision, check for equality with the vector referenced in 
the map:
                1. IF vectors are equal, do not write a separate copy to the 
"vector dictionary" and point to the existing entry.
                2. ELSE use a hash probing strategy (move to `hash + 1`, then 
`hash + 2`, and so on).
        4. At the end, only unique vectors are stored in the dictionary.
        5. Each field stores an additional `ord -> position in dictionary` 
mapping (also see "Query Time Cost").
   3. During merge (also occurs field-wise):
        1. Get a view of vectors being merged across segments, as an iterator 
of `(docid, vector, original segment idx, ord in original segment)`.
        2. Compute hash of each vector, and maintain a map of `hash -> (field 
number, original segment idx, ord in original segment)`.
        3. On hash collision, we have two paths:
                1. IF both vectors (the one being merged + the one that 
collided) come from the _same segment, which is also de-duplicating_: we re-use 
the vector equality that has been resolved earlier (i.e. whether both vectors 
point to the same location in the original segment's dictionary).
                2. ELSE load the other vector onto heap for an explicit 
equality check. This happens when:
                        1. Hash collisions occur across segments, where 
equality has not been resolved earlier.
                        2. When a normal flat format was used earlier, and we 
switched to the de-duplicating one now (e.g. during Lucene upgrade).
        4. The logic to write to the dictionary is the same as flush.
   
   #### Indexing Time Complexity
   
   1. One additional hash of each vector during indexing and merge (equivalent 
to one vector similarity computation done during HNSW?).
   2. Adding into the hash map (constant on average, but expansion of map on 
reaching capacity can add overhead).
   3. Equality checks for all duplicated vectors + hash-collisions.
        1. During indexing, these are all on-heap -- adding a fixed cost.
        2. During merging, this is not expensive when both vectors are present 
in the same segment (already resolved during indexing, which is re-used).
        3. However, cross-segment equality checks can become costly: loading 
vectors in a random-access fashion onto heap (page cache misses).
        4. Perhaps we can use a hash with more bits to reduce false positives, 
or only guarantee de-duplication within a document?
   
   #### Query Time Cost
   
   1. One additional lookup per vector operation:
        1. Earlier: vectors were stored contiguously, so the offset in the flat 
file was a direct computation of `ord * vectorByteSize`.
        2. Now, vectors across all fields are stored in a single file, with 
multiple ordinals possibly referencing the same vector (i.e. no longer 1:1).
        3. The offset in the "vector dictionary" is something like 
`ordToPosition[ord] * vectorByteSize` instead.
        4. Currently, `ordToPosition` is an array stored on-heap, perhaps this 
can be index-backed too?
   
   All-in-all, it does not seem _too_ expensive compared to the flat format?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to