mikemccand commented on issue #14758: URL: https://github.com/apache/lucene/issues/14758#issuecomment-3312379720
These are awesome results @kaivalnp! And this was only 200K docs -- with larger indices would the gains be more or less? Also, it's quite disturbing that even at a not-so-restrictive filter (50%), the recall is already quite a bit worse than the index-time filter (0.882 vs 0.929)? And then it gets worse -- at a 5% accept filter it's 0.830 vs 0.980. Let's get your luceneutil changes merged -- this is useful for benchmarking -- I'll try to review that PR soon. These gains are awesome -- and for starters Lucene users can simply duplicate their vectors (like you did for this test), wasting index storage. That already works today. So ... how do we make progress on NOT wasting disk storing (dedup somehow)? If we dedup only within one doc, it is (maybe?) simpler than dedup across docs? Only downside is probably some Lucene users will have substantial dups across docs, and it would've helped them? But perhaps such users should use index-time joins instead... or maybe they are not realizing they have dups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
