Re: [PR] [WIP] Revisiting JVector codec [lucene]

via GitHub Mon, 15 Dec 2025 05:09:01 -0800


mikemccand commented on PR #15472:
URL: https://github.com/apache/lucene/pull/15472#issuecomment-3655554236


   Thanks @abernardi597 -- this sounds like awesome progress.
   
   It's curious how recall of JVector is not as good with no quantization, but 
then as we quantize it gets better (vs Lucene), and then exceeds Lucene at 
extreme 1 bit quantization.
   
   Do you have 8 bit quantization results for that [last 
table](https://github.com/apache/lucene/pull/15472#issuecomment-3647265760)?  
We could see if it participates in that trend too...
   
   I see the JVector indices are somewhat larger (~15%) than the Lucene indices 
in your [last 
table](https://github.com/apache/lucene/pull/15472#issuecomment-3647265760)?
   
   How did you make that table, with the alternating HNSW / jVector rows?  Is 
this part of your pending PR changes for luceneutil?  That's a nice capability! 
 We could use it for comparing Faiss as well.  Maybe open a separate luceneutil 
spinoff for that?
   
   Does this mean the HNSW graph is still bushier in JVector?  Or that maybe 
JVector is less efficient in how it stores its graph (which would make sense -- 
it's optimizing for fewer disk seeks, not necessarily total storage?).  
Actually `knnPerfTest.py` in luceneutil prints stats about the graphs, at least 
for Lucene's HNSW `KnnVectorsFormat`, like this:
   
   ```
   Leaf 0 has 3 layers
   Leaf 0 has 103976 documents
   Graph level=2 size=2, Fanout min=1, mean=1.00, max=1, meandelta=0.00
   %   0  10  20  30  40  50  60  70  80  90 100
       0   1   1   1   1   1   1   1   1   1   1
   Graph level=1 size=1549, Fanout min=1, mean=14.82, max=64, meandelta=5443.59
   %   0  10  20  30  40  50  60  70  80  90 100
       0   4   6   8  10  13  15  18  21  28  64
   Graph level=0 size=103976, Fanout min=1, mean=18.57, max=128, 
meandelta=4003.06
   %   0  10  20  30  40  50  60  70  80  90 100
       0   5   7   9  11  14  17  21  27  38 128
   ```
   
   This is telling us the percentiles for node-neighbor-count I think.  So 
graph level=0 (the bottom-most one that has 1:1 vector  <-> node) at P40 nodes 
have 11 neighbors.
   
   If we implement that for JVector then we could draw a more direct comparison 
of how their HNSW graphs?  Maybe open a spinoff issue in luceneutil?  Does 
JVector even expose enough transparency/APIs to compute these stats?  We could 
at least compare the mean neighbor count of both for a coarse comparison.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP] Revisiting JVector codec [lucene]

Reply via email to