Re: [I] Support multiple HNSW graphs backed by the same vectors [lucene]

via GitHub Fri, 19 Sep 2025 07:17:07 -0700


mikemccand commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3312379720


   These are awesome results @kaivalnp!  And this was only 200K docs -- with 
larger indices would the gains be more or less?
   
   Also, it's quite disturbing that even at a not-so-restrictive filter (50%), 
the recall is already quite a bit worse than the index-time filter (0.882 vs 
0.929)?  And then it gets worse -- at a 5% accept filter it's 0.830 vs 0.980.
   
   Let's get your luceneutil changes merged -- this is useful for benchmarking 
-- I'll try to review that PR soon.
   
   These gains are awesome -- and for starters Lucene users can simply 
duplicate their vectors (like you did for this test), wasting index storage.  
That already works today.
   
   So ... how do we make progress on NOT wasting disk storing (dedup somehow)?  
If we dedup only within one doc, it is (maybe?) simpler than dedup across docs? 
 Only downside is probably some Lucene users will have substantial dups across 
docs, and it would've helped them?  But perhaps such users should use 
index-time joins instead... or maybe they are not realizing they have dups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support multiple HNSW graphs backed by the same vectors [lucene]

Reply via email to