[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572514#comment-17572514 ] Michael Sokolov commented on LUCENE-10404: -- Yes I think your understanding is correct - another difference is the dimension of the vectors which was 256 in the first case and 100 in the second. I think this second one will tend to emphasize the cost of the graph traversal, since dot-product costs will be less as a proportion of the overall time spent. Indeed I think some specialization would help there. > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572080#comment-17572080 ] Julie Tibshirani commented on LUCENE-10404: --- Checking I understand the numbers: in addition to indexing slowing down, it looks like search latency is a bit worse too. The main difference is that the graph is better connected (higher maxConn) and we explore more nodes during index (beamWidth). The slowdown is not huge but significant (~5% slower for both index + search). This would support the theory that for larger numbers of 'visited' nodes, the IntIntHashMap solution doesn't perform as well. Maybe we could consider a more custom approach as Adrien suggests ? > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594 ] Michael Sokolov commented on LUCENE-10404: -- Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost h3. baseline {{ recall latency nDocfanout maxConn beamWidth visited index ms 0.9910.92 1 50 64 500 150 12068 0.9961.11 1 100 64 500 200 0 0.9991.45 1 200 64 500 300 0 1.0001.94 1 400 64 500 500 0 0.9552.53 10 50 64 500 150 463142 0.9733.03 10 100 64 500 200 0 0.9884.44 10 200 64 500 300 0 0.9976.57 10 400 64 500 500 0 0.8953.44 100 50 64 500 150 9811483 0.9204.33 100 100 64 500 200 0 0.9506.20 100 200 64 500 300 0 0.9749.53 100 400 64 500 500 0 }} h3. IntIntHashMap {{ recall latency nDocfanout maxConn beamWidth visited index ms 0.9911.03 1 50 64 500 150 13274 0.9961.24 1 100 64 500 200 0 0.9991.62 1 200 64 500 300 0 1.0002.09 1 400 64 500 500 0 0.9552.47 10 50 64 500 150 485131 0.9733.31 10 100 64 500 200 0 0.9884.66 10 200 64 500 300 0 0.9977.26 10 400 64 500 500 0 0.8953.58 100 50 64 500 150 10173818 0.9204.49 100 100 64 500 200 0 0.9506.45 100 200 64 500 300 0 0.9749.91 100 400 64 500 500 0 }} > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570195#comment-17570195 ] Michael Sokolov commented on LUCENE-10404: -- The default `topK` in KnnGraphTester is 100, so these test runs are maintaining results queues of 100 or 150 (when searching). During indexing this is driven by beamWidth, and 32/64 is lower than is typical, I think. Still I think it's encouraging that we see gains in both searching (when the queue size is 100-150) and indexing, when it is 32-64. I won't be able to run more tests for a few days, but I agree that it would be interesting to see how the gains correlate with the queue sizes. But I was motivated to get some quick look! Will run some more exhaustive tests next week. > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570111#comment-17570111 ] Julie Tibshirani commented on LUCENE-10404: --- Those numbers look good! Is my understanding right that these experiments use k=10, and fanout = 0 and 50? Maybe we could also try with a high fanout (like 100 or 500) to double-check the case when we need to visit a larger number of nodes. > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569429#comment-17569429 ] Michael Sokolov commented on LUCENE-10404: -- I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does seem to be a small speedup. I haven't had a chance to run luceneutil nor look at profiler output, but here are some numbers from KnnGraphTester for an internal dataset. The numbers can be a bit noisy, but are consistently better for the hash map version. h3. IntIntHashMap {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.935 0.37 1 0 16 32 100 1566}} {{0.965 0.49 1 50 16 32 150 0}} {{0.962 0.41 1 0 16 64 100 2655}} {{0.982 0.57 1 50 16 64 150 0}} {{0.941 0.38 1 0 32 32 100 1473}} {{0.969 0.51 1 50 32 32 150 0}} {{0.966 0.45 1 0 32 64 100 2611}} {{0.985 0.59 1 50 32 64 150 0}} {{0.907 0.52 10 0 16 32 100 19850}} {{0.940 0.72 10 50 16 32 150 0}} {{0.941 0.60 10 0 16 64 100 38614}} {{0.966 0.84 10 50 16 64 150 0}} {{0.916 0.55 10 0 32 32 100 19243}} {{0.949 0.75 10 50 32 32 150 0}} {{0.952 0.66 10 0 32 64 100 38205}} {{0.973 0.93 10 50 32 64 150 0}} {{0.859 0.66 100 0 16 32 100 273112}} {{0.897 0.92 100 50 16 32 150 0}} {{0.917 0.85 100 0 16 64 100 523325}} {{0.946 1.06 100 50 16 64 150 0}} {{0.874 0.80 100 0 32 32 100 274816}} {{0.913 1.05 100 50 32 32 150 0}} {{0.929 0.98 100 0 32 64 100 564762}} h3. baseline {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.935 0.38 1 0 16 32 100 1614}} {{0.965 0.50 1 50 16 32 150 0}} {{0.962 0.45 1 0 16 64 100 2687}} {{0.982 0.57 1 50 16 64 150 0}} {{0.941 0.40 1 0 32 32 100 1504}} {{0.969 0.51 1 50 32 32 150 0}} {{0.966 0.44 1 0 32 64 100 2652}} {{0.985 0.58 1 50 32 64 150 0}} {{0.907 0.54 10 0 16 32 100 21449}} {{0.940 0.74 10 50 16 32 150 0}} {{0.941 0.64 10 0 16 64 100 39962}} {{0.966 0.88 10 50 16 64 150 0}} {{0.916 0.59 10 0 32 32 100 20554}} {{0.949 0.80 10 50 32 32 150 0}} {{0.952 0.72 10 0 32 64 100 40980}} {{0.973 1.04 10 50 32 64 150 0}} {{0.859 0.75 100 0 16 32 100 300514}} {{0.897 0.96 100 50 16 32 150 0}} {{0.917 0.84 100 0 16 64 100 563259}} {{0.946 1.12 100 50 16 64 150 0}} {{0.874 0.86 100 0 32 32 100 303186}} {{0.913 1.09 100 50 32 32 150 0}} {{0.929 1.04 100 0 32 64 100 580725}} {{0.958 1.38 100 50 32 64 150 0}} > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569420#comment-17569420 ] Michael Sokolov commented on LUCENE-10404: -- I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does seem to be a small speedup. I haven't had a chance to run luceneutil nor look at profiler output, but here are some numbers from KnnGraphTester for an internal dataset. The numbers can be a bit noisy, but are consistently better for the hash map version. h3. IntIntHashMap {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.935 0.37 1 0 16 32 100 1566}} {{0.965 0.49 1 50 16 32 150 0}} {{0.962 0.41 1 0 16 64 100 2655}} {{0.982 0.57 1 50 16 64 150 0}} {{0.941 0.38 1 0 32 32 100 1473}} {{0.969 0.51 1 50 32 32 150 0}} {{0.966 0.45 1 0 32 64 100 2611}} {{0.985 0.59 1 50 32 64 150 0}} {{0.907 0.52 10 0 16 32 100 19850}} {{0.940 0.72 10 50 16 32 150 0}} {{0.941 0.60 10 0 16 64 100 38614}} {{0.966 0.84 10 50 16 64 150 0}} {{0.916 0.55 10 0 32 32 100 19243}} {{0.949 0.75 10 50 32 32 150 0}} {{0.952 0.66 10 0 32 64 100 38205}} {{0.973 0.93 10 50 32 64 150 0}} {{0.859 0.66 100 0 16 32 100 273112}} {{{}0.897 0.92 100 50 16 32 150 0{}}}{{{}0.917 0.85 100 0 16 64 100 523325 0.946 1.06 100 50 16 64 150 0 {}}} h3. baseline {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.935 0.38 1 0 16 32 100 1614}} {{0.965 0.50 1 50 16 32 150 0}} {{0.962 0.45 1 0 16 64 100 2687}} {{0.982 0.57 1 50 16 64 150 0}} {{0.941 0.40 1 0 32 32 100 1504}} {{0.969 0.51 1 50 32 32 150 0}} {{0.966 0.44 1 0 32 64 100 2652}} {{0.985 0.58 1 50 32 64 150 0}} {{0.907 0.54 10 0 16 32 100 21449}} {{0.940 0.74 10 50 16 32 150 0}} {{0.941 0.64 10 0 16 64 100 39962}} {{0.966 0.88 10 50 16 64 150 0}} {{0.916 0.59 10 0 32 32 100 20554}} {{0.949 0.80 10 50 32 32 150 0}} {{0.952 0.72 10 0 32 64 100 40980}} {{0.973 1.04 10 50 32 64 150 0}} {{0.859 0.75 100 0 16 32 100 300514}} {{0.897 0.96 100 50 16 32 150 0}} {{0.917 0.84 100 0 16 64 100 563259}} {{0.946 1.12 100 50 16 64 150 0}} {{0.874 0.86 100 0 32 32 100 303186}} {{0.913 1.09 100 50 32 32 150 0}} {{0.929 1.04 100 0 32 64 100 580725}} {{0.958 1.38 100 50 32 64 150 0}} > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569237#comment-17569237 ] Julie Tibshirani commented on LUCENE-10404: --- As a note, LUCENE-10592 changes the index strategy so that we build the graph as each document is added, instead of waiting until 'flush'. In the PR, graph building still shares a single FixedBitSet to track the 'visited' set, but it's continuously resized since we don't know the full number of docs up-front. So maybe switching to a hash set could help even more after that change is merged. > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486894#comment-17486894 ] Adrien Grand commented on LUCENE-10404: --- {quote}For example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search visits ~1000 - 15,000 docs depending on the recall. {quote} So if some searches at index time need to visit ~15k docs, then likely we'd need a hash set with a backing array that has 32,768 entries (smallest power of two that is 2x greater than the number of items in the hash set). Since we'd never shrink the backing array, this means that even when we only visit 1,000 nodes, we would then have to clear the 32k entries of the backing array, so the hashset#clear calls might not be free (though still cheaper than on FixedBitSet). This makes me wonder if we should consider using a specialized hash set implementation for this use-case, e.g. by using a FixedBitSet to track used buckets instead of sentinel values (which I believe is what HPPC's IntHashSet is doing). This would make {{IntHashSet#add}} calls a bit more expensive (maybe?), but then clearing the Set would only consist of clearing the BitSet that tracks used slots instead of the backing array, so we'd need to clear 32,768 bits = 4kB of data instead of 32,768*4=128kB. > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org