benwtrent opened a new pull request, #14078:
URL: https://github.com/apache/lucene/pull/14078
This provides a binary vector format for vectors. The key ideas are:
- Centroid centered vectors
- Asymmetric quantization
- Individually optimized scalar quantization
This allows Lucene to have a single scalar quantization format that allows
for high quality vector retrieval, even down to a single bit.
For all similarity types, on disk it looks like.
| quantized vector | lower quantile | upper quantile| additional correction
| sum quantized components|
| - | - |- | - | - |
| (vector_dimension/8) bytes | float | float | float | short |
During segment merge & HNSW building, another temporary file is written
containing the query quantized vectors over the configured centroids. One
downside is that this temporary file will actually be larger than the regular
vector index. This is because we use asymmetric quantization to keep good
information around. But once the merge is complete, this file is deleted. I
think eventually, this can be removed.
Here are the results for _Recall@10|50_
| Dataset | old PR | this one | Improvement |
| --- | --- | --- | --- |
| Cohere 768 | 0.933 | 0.938 | 0.5% |
| Cohere 1024 | 0.932 | 0.945 | 1.3% |
| E5-Small-v2 | 0.972 | 0.975 | 0.3% |
| GIST-1M | 0.740 | 0.989 | 24.9% |
Even with the optimization step, indexing time with HNSW is only marginally
increased.
| Dataset | OLD PR | This One | Difference |
| --- | --- | --- | --- |
| Cohere 768 | 368.62s | 372.95s | +1% |
| Cohere 1024 | 307.09s | 314.08s | +2% |
| E5-Small-v2 | 227.37s | 229.83s | < +1% |
The consistent improvement in recall and flexibility for various bits makes
this format and quantization technique much preferred.
Eventually, we should consider moving scalar quantization to utilize this
new optimized quantizer. Though, the on disk format and scoring will change,
so, I didn't do that in this PR.
supersedes: https://github.com/apache/lucene/pull/13651
Co-Authors: @tveasey @john-wagster @mayya-sharipova
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]