Re: [PR] Add a de-duplicating flat vector format [lucene]

via GitHub Thu, 23 Apr 2026 21:44:04 -0700


kaivalnp commented on PR #15979:
URL: https://github.com/apache/lucene/pull/15979#issuecomment-4310658702


   `luceneutil` KNN benchmarks on Cohere v3 vectors, 1024d, `DOT_PRODUCT` 
similarity, 1M vectors, 10K queries, no quantization.
   
   Important: `filterStrategy = index-time-filter` is used here (added in 
https://github.com/mikemccand/luceneutil/pull/468), which simply creates a new 
vector field with randomly selected `filterSelectivity` proportion of docs, and 
uses the smaller field for search.
   
   `main`
   
   ```
   recall  latency(ms)  netCPU  visited  index(s)  index_docs/s  force_merge(s) 
 index_size(MB)  filterSelectivity
    0.988        2.987   2.982     8122    239.73       4171.41          418.68 
        6077.99               0.50
    0.991        2.398   2.397     7415    211.53       4727.37          296.96 
        4862.33               0.20
   ```
   
   This PR
   
   ```
   recall  latency(ms)  netCPU  visited  index(s)  index_docs/s  force_merge(s) 
 index_size(MB)  filterSelectivity
    0.988        2.724   2.720     8129    248.46       4024.79          466.43 
        4130.17               0.50
    0.991        2.173   2.170     7414    214.40       4664.16          396.85 
        4085.20               0.20
   ```
   
   As we can imagine, if the same vector is indexed in a second field for 50% 
of docs, the index size increases by ~50% on `main` (roughly 4GB -> 6GB). 
However with the de-duplicating vector format, the same vectors are re-used and 
there is (almost) no increase in index size!
   
   Indexing is slightly slower (<5%) + merges are non-trivially slower (up to 
33%, currently looking into speeding this up).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add a de-duplicating flat vector format [lucene]

Reply via email to