[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

Michael Sokolov (Jira) Fri, 22 Jul 2022 15:11:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570195#comment-17570195
 ]


Michael Sokolov commented on LUCENE-10404:
------------------------------------------

The default `topK` in KnnGraphTester is 100, so these test runs are maintaining 
results queues of 100 or 150 (when searching). During indexing this is driven 
by beamWidth, and 32/64 is lower than is typical, I think. Still I think it's 
encouraging that we see gains in both searching (when the queue size is 
100-150) and indexing, when it is 32-64.

I won't be able to run more tests for a few days, but I agree that it would be 
interesting to see how the gains correlate with the queue sizes. But I was 
motivated to get some quick look! Will run some more exhaustive tests next week.

> Use hash set for visited nodes in HNSW search?
> ----------------------------------------------
>
>                 Key: LUCENE-10404
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10404
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

Reply via email to