[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julie Tibshirani updated LUCENE-10404: -------------------------------------- Description: While searching each layer, HNSW tracks the nodes it has already visited using a BitSet. We could look into using something like IntHashSet instead. I tried out the idea quickly by switching to IntIntHashMap (which has already been copied from hppc) and saw an improvement in index performance. *Baseline:* 760896 msec to write vectors *Using IntIntHashMap:* 733017 msec to write vectors I noticed search performance actually got a little bit worse with the change -- that is something to look into. For background, it's good to be aware that HNSW can visit a lot of nodes. For example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search visits ~1000 - 15,000 docs depending on the recall. This number can increase when searching with deleted docs, especially if you hit a "pathological" case where the deleted docs happen to be closest to the query vector. was: While searching each layer, HNSW tracks the nodes it has already visited using a BitSet. We could look into using something like IntHashSet instead. I tried out the idea quickly by switching to IntIntHashMap (which has already been copied from hppc) and saw an improvement in index performance. **Baseline:** 760896 msec to write vectors **Using IntIntHashMap:** 733017 msec to write vectors I noticed search performance actually got a little bit worse with the change -- that is something to look into. For background, it's good to be aware that HNSW can visit a lot of nodes. For example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search visits ~1000 - 15,000 docs depending on the recall. This number can increase when searching with deleted docs, especially if you hit a "pathological" case where the deleted docs happen to be closest to the query vector. > Use hash set for visited nodes in HNSW search? > ---------------------------------------------- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Julie Tibshirani > Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org