[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571594#comment-17571594
 ] 

Michael Sokolov commented on LUCENE-10404:
------------------------------------------

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.991    0.92   10000   50      64      500     150     12068
0.996    1.11   10000   100     64      500     200     0
0.999    1.45   10000   200     64      500     300     0
1.000    1.94   10000   400     64      500     500     0
0.955    2.53   100000  50      64      500     150     463142
0.973    3.03   100000  100     64      500     200     0
0.988    4.44   100000  200     64      500     300     0
0.997    6.57   100000  400     64      500     500     0
0.895    3.44   1000000 50      64      500     150     9811483
0.920    4.33   1000000 100     64      500     200     0
0.950    6.20   1000000 200     64      500     300     0
0.974    9.53   1000000 400     64      500     500     0
}}

h3. IntIntHashMap

{{
recall  latency nDoc    fanout  maxConn beamWidth       visited index ms
0.991    1.03   10000   50      64      500     150     13274
0.996    1.24   10000   100     64      500     200     0
0.999    1.62   10000   200     64      500     300     0
1.000    2.09   10000   400     64      500     500     0
0.955    2.47   100000  50      64      500     150     485131
0.973    3.31   100000  100     64      500     200     0
0.988    4.66   100000  200     64      500     300     0
0.997    7.26   100000  400     64      500     500     0
0.895    3.58   1000000 50      64      500     150     10173818
0.920    4.49   1000000 100     64      500     200     0
0.950    6.45   1000000 200     64      500     300     0
0.974    9.91   1000000 400     64      500     500     0
}}


> Use hash set for visited nodes in HNSW search?
> ----------------------------------------------
>
>                 Key: LUCENE-10404
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10404
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to