mayya-sharipova commented on pull request #536:
URL: https://github.com/apache/lucene/pull/536#issuecomment-1003460979


   >  I think it would be prudent to check the size increase/decrease from this 
change for some dataset/parameter choices
   
   I've checked the index sizes and the size actually increased by 4-5%:
   
   **glove-100-angular**
   Before the change: 517 Mb; after the change: 542 Mb 
   
   **sift-128-euclidean**
   Before the change: 542M; after change: 564M
   
   With a proposed design even if we save space by not storing offsets, we 
encode each node's neighbours as *Int* instead of the current *VInt*, which 
causes more disk usage.
   
   On the upside:
   - we save heap memory as we don't need to load offsets, which saves  for an 
index with 1M docs approximately : 1'000'000 nodes * 8 bytes for each node = 8M 
bytes (doesn't look that much but if there are many indices with many vector 
fields and more docs this increases proportionally; for a field with 100M docs 
it will be 800Mb)
   - no performance degradation because of an extra step to calculate offsets: 
   
   **glove-100-angular**
   
   |             | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   | ----------- | --------------: | -----------: | ---------------: | 
------------: |
   | n_cands=10  |           0.496 |     3549.216 |            0.481 |      
4027.582 |
   | n_cands=20  |           0.560 |     3423.073 |            0.553 |      
3369.245 |
   | n_cands=40  |           0.635 |     2686.622 |            0.631 |      
2633.146 |
   | n_cands=80  |           0.708 |     1889.805 |            0.707 |      
1890.202 |
   | n_cands=120 |           0.747 |     1476.286 |            0.748 |      
1451.970 |
   | n_cands=200 |           0.790 |     1037.742 |            0.791 |      
1013.580 |
   | n_cands=400 |           0.840 |      607.183 |            0.841 |       
572.152 |
   | n_cands=600 |           0.865 |      433.513 |            0.865 |       
402.504 |
   | n_cands=800 |           0.880 |      341.052 |            0.881 |       
320.057 |
   
   **sift-128-euclidean**
   
   |             | baseline recall | baseline QPS | candidate recall | 
candidate QPS |
   | ----------- | --------------: | -----------: | ---------------: | 
------------: |
   | n_cands=10  |           0.747 |     3891.531 |            0.745 |      
3926.015 |
   | n_cands=20  |           0.817 |     3359.364 |            0.817 |      
3365.934 |
   | n_cands=40  |           0.889 |     2590.605 |            0.889 |      
2568.544 |
   | n_cands=80  |           0.944 |     1798.558 |            0.944 |      
1806.776 |
   | n_cands=120 |           0.964 |     1383.721 |            0.964 |      
1425.713 |
   | n_cands=200 |           0.983 |      973.862 |            0.983 |      
1002.114 |
   | n_cands=400 |           0.994 |      586.816 |            0.994 |       
599.229 |
   | n_cands=600 |           0.997 |      427.128 |            0.997 |       
437.296 |
   | n_cands=800 |           0.998 |      341.178 |            0.998 |       
349.690 |
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to