kaivalnp commented on PR #15979: URL: https://github.com/apache/lucene/pull/15979#issuecomment-4349663487
Iterated with the AI to iron out some performance bottlenecks. **Disclaimer**: There's generous use of AI involved in the code, so functions may be more verbose than required! I have reviewed **most** of it locally, and will do a refactor to try and make it cleaner / more human friendly soon. #### Current Approach See `dedup-vectors-format-design.md` in this PR for the AI-generated summary, but the flow is roughly: 1. During indexing, maintain a per-field `List` of vectors on-heap ([same as the flat format](https://github.com/apache/lucene/blob/23d400cabd8e63d26d594ec94e866ef5e22d2a12/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java#L319)). 2. During flush (occurs field-wise): 1. Raw vectors across _all_ fields are stored in a single "vector dictionary". 2. Iterate over the `List` of vectors in a field, compute hash for each vector and store it in a map of `hash -> (field number, ordinal in field)`. 3. On hash collision, check for equality with the vector referenced in the map: 1. IF vectors are equal, do not write a separate copy to the "vector dictionary" and point to the existing entry. 2. ELSE use a hash probing strategy (move to `hash + 1`, then `hash + 2`, and so on). 4. At the end, only unique vectors are stored in the dictionary. 5. Each field stores an additional `ord -> position in dictionary` mapping (also see "Query Time Cost"). 3. During merge (also occurs field-wise): 1. Get a view of vectors being merged across segments, as an iterator of `(docid, vector, original segment idx, ord in original segment)`. 2. Compute hash of each vector, and maintain a map of `hash -> (field number, original segment idx, ord in original segment)`. 3. On hash collision, we have two paths: 1. IF both vectors (the one being merged + the one that collided) come from the _same segment, which is also de-duplicating_: we re-use the vector equality that has been resolved earlier (i.e. whether both vectors point to the same location in the original segment's dictionary). 2. ELSE load the other vector onto heap for an explicit equality check. This happens when: 1. Hash collisions occur across segments, where equality has not been resolved earlier. 2. When a normal flat format was used earlier, and we switched to the de-duplicating one now (e.g. during Lucene upgrade). 4. The logic to write to the dictionary is the same as flush. #### Indexing Time Complexity 1. One additional hash of each vector during indexing and merge (equivalent to one vector similarity computation done during HNSW?). 2. Adding into the hash map (constant on average, but expansion of map on reaching capacity can add overhead). 3. Equality checks for all duplicated vectors + hash-collisions. 1. During indexing, these are all on-heap -- adding a fixed cost. 2. During merging, this is not expensive when both vectors are present in the same segment (already resolved during indexing, which is re-used). 3. However, cross-segment equality checks can become costly: loading vectors in a random-access fashion onto heap (page cache misses). 4. Perhaps we can use a hash with more bits to reduce false positives, or only guarantee de-duplication within a document? #### Query Time Cost 1. One additional lookup per vector operation: 1. Earlier: vectors were stored contiguously, so the offset in the flat file was a direct computation of `ord * vectorByteSize`. 2. Now, vectors across all fields are stored in a single file, with multiple ordinals possibly referencing the same vector (i.e. no longer 1:1). 3. The offset in the "vector dictionary" is something like `ordToPosition[ord] * vectorByteSize` instead. 4. Currently, `ordToPosition` is an array stored on-heap, perhaps this can be index-backed too? All-in-all, it does not seem _too_ expensive compared to the flat format? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
