[
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569420#comment-17569420
]
Michael Sokolov edited comment on LUCENE-10404 at 7/21/22 1:39 PM:
-------------------------------------------------------------------
I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does
seem to be a small speedup. I haven't had a chance to run luceneutil nor look
at profiler output, but here are some numbers from KnnGraphTester for an
internal dataset. The numbers can be a bit noisy, but are consistently better
for the hash map version.
h3. IntIntHashMap
{{{{recall latency nDoc fanout maxConn beamWidth visited index ms}}}}
{{{{0.935 0.37 10000 0 16 32 100 1566}}}}
{{{{0.965 0.49 10000 50 16 32 150 0}}}}
{{{{0.962 0.41 10000 0 16 64 100 2655}}}}
{{{{0.982 0.57 10000 50 16 64 150 0}}}}
{{{{0.941 0.38 10000 0 32 32 100 1473}}}}
{{{{0.969 0.51 10000 50 32 32 150 0}}}}
{{{{0.966 0.45 10000 0 32 64 100 2611}}}}
{{{{0.985 0.59 10000 50 32 64 150 0}}}}
{{{{0.907 0.52 100000 0 16 32 100 19850}}}}
{{{{0.940 0.72 100000 50 16 32 150 0}}}}
{{{{0.941 0.60 100000 0 16 64 100 38614}}}}
{{{{0.966 0.84 100000 50 16 64 150 0}}}}
{{{{0.916 0.55 100000 0 32 32 100 19243}}}}
{{{{0.949 0.75 100000 50 32 32 150 0}}}}
{{{{0.952 0.66 100000 0 32 64 100 38205}}}}
{{{{0.973 0.93 100000 50 32 64 150 0}}}}
{{{{0.859 0.66 1000000 0 16 32 100 273112}}}}
{{{{0.897 0.92 1000000 50 16 32 150 0}}}}
{{0.917 0.85 1000000 0 16 64 100 523325}}
{{0.946 1.06 1000000 50 16 64 150 0}}
more to come – pushed ctrl-enter instead of enter ...
h3. baseline
{{recall latency nDoc fanout maxConn beamWidth visited index ms}}
{{0.935 0.38 10000 0 16 32 100 1614}}
{{0.965 0.50 10000 50 16 32 150 0}}
{{0.962 0.45 10000 0 16 64 100 2687}}
{{0.982 0.57 10000 50 16 64 150 0}}
{{0.941 0.40 10000 0 32 32 100 1504}}
{{0.969 0.51 10000 50 32 32 150 0}}
{{0.966 0.44 10000 0 32 64 100 2652}}
{{0.985 0.58 10000 50 32 64 150 0}}
{{0.907 0.54 100000 0 16 32 100 21449}}
{{0.940 0.74 100000 50 16 32 150 0}}
{{0.941 0.64 100000 0 16 64 100 39962}}
{{0.966 0.88 100000 50 16 64 150 0}}
{{0.916 0.59 100000 0 32 32 100 20554}}
{{0.949 0.80 100000 50 32 32 150 0}}
{{0.952 0.72 100000 0 32 64 100 40980}}
{{0.973 1.04 100000 50 32 64 150 0}}
{{0.859 0.75 1000000 0 16 32 100 300514}}
{{0.897 0.96 1000000 50 16 32 150 0}}
{{0.917 0.84 1000000 0 16 64 100 563259}}
{{0.946 1.12 1000000 50 16 64 150 0}}
{{0.874 0.86 1000000 0 32 32 100 303186}}
{{0.913 1.09 1000000 50 32 32 150 0}}
{{0.929 1.04 1000000 0 32 64 100 580725}}
{{0.958 1.38 1000000 50 32 64 150 0}}
was (Author: sokolov):
I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does
seem to be a small speedup. I haven't had a chance to run luceneutil nor look
at profiler output, but here are some numbers from KnnGraphTester for an
internal dataset. The numbers can be a bit noisy, but are consistently better
for the hash map version.
h3. IntIntHashMap
{{recall latency nDoc fanout maxConn beamWidth visited index ms}}
{{0.935 0.37 10000 0 16 32 100 1566}}
{{0.965 0.49 10000 50 16 32 150 0}}
{{0.962 0.41 10000 0 16 64 100 2655}}
{{0.982 0.57 10000 50 16 64 150 0}}
{{0.941 0.38 10000 0 32 32 100 1473}}
{{0.969 0.51 10000 50 32 32 150 0}}
{{0.966 0.45 10000 0 32 64 100 2611}}
{{0.985 0.59 10000 50 32 64 150 0}}
{{0.907 0.52 100000 0 16 32 100 19850}}
{{0.940 0.72 100000 50 16 32 150 0}}
{{0.941 0.60 100000 0 16 64 100 38614}}
{{0.966 0.84 100000 50 16 64 150 0}}
{{0.916 0.55 100000 0 32 32 100 19243}}
{{0.949 0.75 100000 50 32 32 150 0}}
{{0.952 0.66 100000 0 32 64 100 38205}}
{{0.973 0.93 100000 50 32 64 150 0}}
{{0.859 0.66 1000000 0 16 32 100 273112}}
{{{}0.897 0.92 1000000 50 16 32 150 0{}}}{{{}0.917
0.85 1000000 0 16 64 100 523325
0.946 1.06 1000000 50 16 64 150 0
{}}}
h3. baseline
{{recall latency nDoc fanout maxConn beamWidth visited index ms}}
{{0.935 0.38 10000 0 16 32 100 1614}}
{{0.965 0.50 10000 50 16 32 150 0}}
{{0.962 0.45 10000 0 16 64 100 2687}}
{{0.982 0.57 10000 50 16 64 150 0}}
{{0.941 0.40 10000 0 32 32 100 1504}}
{{0.969 0.51 10000 50 32 32 150 0}}
{{0.966 0.44 10000 0 32 64 100 2652}}
{{0.985 0.58 10000 50 32 64 150 0}}
{{0.907 0.54 100000 0 16 32 100 21449}}
{{0.940 0.74 100000 50 16 32 150 0}}
{{0.941 0.64 100000 0 16 64 100 39962}}
{{0.966 0.88 100000 50 16 64 150 0}}
{{0.916 0.59 100000 0 32 32 100 20554}}
{{0.949 0.80 100000 50 32 32 150 0}}
{{0.952 0.72 100000 0 32 64 100 40980}}
{{0.973 1.04 100000 50 32 64 150 0}}
{{0.859 0.75 1000000 0 16 32 100 300514}}
{{0.897 0.96 1000000 50 16 32 150 0}}
{{0.917 0.84 1000000 0 16 64 100 563259}}
{{0.946 1.12 1000000 50 16 64 150 0}}
{{0.874 0.86 1000000 0 32 32 100 303186}}
{{0.913 1.09 1000000 50 32 32 150 0}}
{{0.929 1.04 1000000 0 32 64 100 580725}}
{{0.958 1.38 1000000 50 32 64 150 0}}
> Use hash set for visited nodes in HNSW search?
> ----------------------------------------------
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Julie Tibshirani
> Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited
> using a BitSet. We could look into using something like IntHashSet instead. I
> tried out the idea quickly by switching to IntIntHashMap (which has already
> been copied from hppc) and saw an improvement in index performance.
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search
> visits ~1000 - 15,000 docs depending on the recall. This number can increase
> when searching with deleted docs, especially if you hit a "pathological" case
> where the deleted docs happen to be closest to the query vector.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]