Julie Tibshirani created LUCENE-10404:
-----------------------------------------

             Summary: Use hash set for visited nodes in HNSW search?
                 Key: LUCENE-10404
                 URL: https://issues.apache.org/jira/browse/LUCENE-10404
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Julie Tibshirani


While searching each layer, HNSW tracks the nodes it has already visited using 
a BitSet. We could look into using something like IntHashSet instead. I tried 
out the idea quickly by switching to IntIntHashMap (which has already been 
copied from hppc) and saw an improvement in index performance. 

**Baseline:** 760896 msec to write vectors
**Using IntIntHashMap:** 733017 msec to write vectors

I noticed search performance actually got a little bit worse with the change -- 
that is something to look into.

For background, it's good to be aware that HNSW can visit a lot of nodes. For 
example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
visits ~1000 - 15,000 docs depending on the recall. This number can increase 
when searching with deleted docs, especially if you hit a "pathological" case 
where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to