mikemccand commented on PR #15472: URL: https://github.com/apache/lucene/pull/15472#issuecomment-3655554236
Thanks @abernardi597 -- this sounds like awesome progress. It's curious how recall of JVector is not as good with no quantization, but then as we quantize it gets better (vs Lucene), and then exceeds Lucene at extreme 1 bit quantization. Do you have 8 bit quantization results for that [last table](https://github.com/apache/lucene/pull/15472#issuecomment-3647265760)? We could see if it participates in that trend too... I see the JVector indices are somewhat larger (~15%) than the Lucene indices in your [last table](https://github.com/apache/lucene/pull/15472#issuecomment-3647265760)? How did you make that table, with the alternating HNSW / jVector rows? Is this part of your pending PR changes for luceneutil? That's a nice capability! We could use it for comparing Faiss as well. Maybe open a separate luceneutil spinoff for that? Does this mean the HNSW graph is still bushier in JVector? Or that maybe JVector is less efficient in how it stores its graph (which would make sense -- it's optimizing for fewer disk seeks, not necessarily total storage?). Actually `knnPerfTest.py` in luceneutil prints stats about the graphs, at least for Lucene's HNSW `KnnVectorsFormat`, like this: ``` Leaf 0 has 3 layers Leaf 0 has 103976 documents Graph level=2 size=2, Fanout min=1, mean=1.00, max=1, meandelta=0.00 % 0 10 20 30 40 50 60 70 80 90 100 0 1 1 1 1 1 1 1 1 1 1 Graph level=1 size=1549, Fanout min=1, mean=14.82, max=64, meandelta=5443.59 % 0 10 20 30 40 50 60 70 80 90 100 0 4 6 8 10 13 15 18 21 28 64 Graph level=0 size=103976, Fanout min=1, mean=18.57, max=128, meandelta=4003.06 % 0 10 20 30 40 50 60 70 80 90 100 0 5 7 9 11 14 17 21 27 38 128 ``` This is telling us the percentiles for node-neighbor-count I think. So graph level=0 (the bottom-most one that has 1:1 vector <-> node) at P40 nodes have 11 neighbors. If we implement that for JVector then we could draw a more direct comparison of how their HNSW graphs? Maybe open a spinoff issue in luceneutil? Does JVector even expose enough transparency/APIs to compute these stats? We could at least compare the mean neighbor count of both for a coarse comparison. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
