kaivalnp commented on PR #15979: URL: https://github.com/apache/lucene/pull/15979#issuecomment-4310658702
`luceneutil` KNN benchmarks on Cohere v3 vectors, 1024d, `DOT_PRODUCT` similarity, 1M vectors, 10K queries, no quantization. Important: `filterStrategy = index-time-filter` is used here (added in https://github.com/mikemccand/luceneutil/pull/468), which simply creates a new vector field with randomly selected `filterSelectivity` proportion of docs, and uses the smaller field for search. `main` ``` recall latency(ms) netCPU visited index(s) index_docs/s force_merge(s) index_size(MB) filterSelectivity 0.988 2.987 2.982 8122 239.73 4171.41 418.68 6077.99 0.50 0.991 2.398 2.397 7415 211.53 4727.37 296.96 4862.33 0.20 ``` This PR ``` recall latency(ms) netCPU visited index(s) index_docs/s force_merge(s) index_size(MB) filterSelectivity 0.988 2.724 2.720 8129 248.46 4024.79 466.43 4130.17 0.50 0.991 2.173 2.170 7414 214.40 4664.16 396.85 4085.20 0.20 ``` As we can imagine, if the same vector is indexed in a second field for 50% of docs, the index size increases by ~50% on `main` (roughly 4GB -> 6GB). However with the de-duplicating vector format, the same vectors are re-used and there is (almost) no increase in index size! Indexing is slightly slower (<5%) + merges are non-trivially slower (up to 33%, currently looking into speeding this up). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
