subject:"\[jira\] \[Commented\] \(LUCENE\-10404\) Use hash set for visited nodes in HNSW search\?"

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-28 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572514#comment-17572514
 ] 

Michael Sokolov commented on LUCENE-10404:
--

Yes I think your understanding is correct - another difference is the dimension 
of the vectors which was 256 in the first case and 100 in the second. I think 
this second one will tend to emphasize the cost of the graph traversal, since 
dot-product costs will be less as a proportion of the overall time spent. 
Indeed I think some specialization would help there.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-27 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572080#comment-17572080
 ] 

Julie Tibshirani commented on LUCENE-10404:
---

Checking I understand the numbers: in addition to indexing slowing down, it 
looks like search latency is a bit worse too. The main difference is that the 
graph is better connected (higher maxConn) and we explore more nodes during 
index (beamWidth). The slowdown is not huge but significant (~5% slower for 
both index + search).

 

This would support the theory that for larger numbers of 'visited' nodes, the 
IntIntHashMap solution doesn't perform as well. Maybe we could consider a more 
custom approach as Adrien suggests ?

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-26 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594
 ] 

Michael Sokolov commented on LUCENE-10404:
--

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0
}}

h3. IntIntHashMap

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}


> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-22 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570195#comment-17570195
 ] 

Michael Sokolov commented on LUCENE-10404:
--

The default `topK` in KnnGraphTester is 100, so these test runs are maintaining 
results queues of 100 or 150 (when searching). During indexing this is driven 
by beamWidth, and 32/64 is lower than is typical, I think. Still I think it's 
encouraging that we see gains in both searching (when the queue size is 
100-150) and indexing, when it is 32-64.

I won't be able to run more tests for a few days, but I agree that it would be 
interesting to see how the gains correlate with the queue sizes. But I was 
motivated to get some quick look! Will run some more exhaustive tests next week.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-22 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570111#comment-17570111
 ] 

Julie Tibshirani commented on LUCENE-10404:
---

Those numbers look good! Is my understanding right that these experiments use 
k=10, and fanout = 0 and 50? Maybe we could also try with a high fanout (like 
100 or 500) to double-check the case when we need to visit a larger number of 
nodes.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-21 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569429#comment-17569429
 ] 

Michael Sokolov commented on LUCENE-10404:
--

I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does 
seem to be a small speedup. I haven't had a chance to run luceneutil nor look 
at profiler output, but here are some numbers from KnnGraphTester for an 
internal dataset. The numbers can be a bit noisy, but are consistently better 
for the hash map version.
h3. IntIntHashMap

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.37   1   0       16      32      100     1566}}
{{0.965    0.49   1   50      16      32      150     0}}
{{0.962    0.41   1   0       16      64      100     2655}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.38   1   0       32      32      100     1473}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.45   1   0       32      64      100     2611}}
{{0.985    0.59   1   50      32      64      150     0}}
{{0.907    0.52   10  0       16      32      100     19850}}
{{0.940    0.72   10  50      16      32      150     0}}
{{0.941    0.60   10  0       16      64      100     38614}}
{{0.966    0.84   10  50      16      64      150     0}}
{{0.916    0.55   10  0       32      32      100     19243}}
{{0.949    0.75   10  50      32      32      150     0}}
{{0.952    0.66   10  0       32      64      100     38205}}
{{0.973    0.93   10  50      32      64      150     0}}
{{0.859    0.66   100 0       16      32      100     273112}}
{{0.897    0.92   100 50      16      32      150     0}}
{{0.917    0.85   100 0       16      64      100     523325}}
{{0.946    1.06   100 50      16      64      150     0}}
{{0.874    0.80   100 0       32      32      100     274816}}
{{0.913    1.05   100 50      32      32      150     0}}
{{0.929    0.98   100 0       32      64      100     564762}}
h3. baseline

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.38   1   0       16      32      100     1614}}
{{0.965    0.50   1   50      16      32      150     0}}
{{0.962    0.45   1   0       16      64      100     2687}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.40   1   0       32      32      100     1504}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.44   1   0       32      64      100     2652}}
{{0.985    0.58   1   50      32      64      150     0}}
{{0.907    0.54   10  0       16      32      100     21449}}
{{0.940    0.74   10  50      16      32      150     0}}
{{0.941    0.64   10  0       16      64      100     39962}}
{{0.966    0.88   10  50      16      64      150     0}}
{{0.916    0.59   10  0       32      32      100     20554}}
{{0.949    0.80   10  50      32      32      150     0}}
{{0.952    0.72   10  0       32      64      100     40980}}
{{0.973    1.04   10  50      32      64      150     0}}
{{0.859    0.75   100 0       16      32      100     300514}}
{{0.897    0.96   100 50      16      32      150     0}}
{{0.917    0.84   100 0       16      64      100     563259}}
{{0.946    1.12   100 50      16      64      150     0}}
{{0.874    0.86   100 0       32      32      100     303186}}
{{0.913    1.09   100 50      32      32      150     0}}
{{0.929    1.04   100 0       32      64      100     580725}}
{{0.958    1.38   100 50      32      64      150     0}}

 

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-21 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569420#comment-17569420
 ] 

Michael Sokolov commented on LUCENE-10404:
--

I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does 
seem to be a small speedup. I haven't had a chance to run luceneutil nor look 
at profiler output, but here are some numbers from KnnGraphTester for an 
internal dataset. The numbers can be a bit noisy, but are consistently better 
for the hash map version.
h3. IntIntHashMap

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.37   1   0       16      32      100     1566}}
{{0.965    0.49   1   50      16      32      150     0}}
{{0.962    0.41   1   0       16      64      100     2655}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.38   1   0       32      32      100     1473}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.45   1   0       32      64      100     2611}}
{{0.985    0.59   1   50      32      64      150     0}}
{{0.907    0.52   10  0       16      32      100     19850}}
{{0.940    0.72   10  50      16      32      150     0}}
{{0.941    0.60   10  0       16      64      100     38614}}
{{0.966    0.84   10  50      16      64      150     0}}
{{0.916    0.55   10  0       32      32      100     19243}}
{{0.949    0.75   10  50      32      32      150     0}}
{{0.952    0.66   10  0       32      64      100     38205}}
{{0.973    0.93   10  50      32      64      150     0}}
{{0.859    0.66   100 0       16      32      100     273112}}
{{{}0.897    0.92   100 50      16      32      150     0{}}}{{{}0.917    
0.85   100 0       16      64      100     523325
0.946    1.06   100 50      16      64      150     0
{}}}
h3. baseline

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.38   1   0       16      32      100     1614}}
{{0.965    0.50   1   50      16      32      150     0}}
{{0.962    0.45   1   0       16      64      100     2687}}
{{0.982    0.57   1   50      16      64      150     0}}
{{0.941    0.40   1   0       32      32      100     1504}}
{{0.969    0.51   1   50      32      32      150     0}}
{{0.966    0.44   1   0       32      64      100     2652}}
{{0.985    0.58   1   50      32      64      150     0}}
{{0.907    0.54   10  0       16      32      100     21449}}
{{0.940    0.74   10  50      16      32      150     0}}
{{0.941    0.64   10  0       16      64      100     39962}}
{{0.966    0.88   10  50      16      64      150     0}}
{{0.916    0.59   10  0       32      32      100     20554}}
{{0.949    0.80   10  50      32      32      150     0}}
{{0.952    0.72   10  0       32      64      100     40980}}
{{0.973    1.04   10  50      32      64      150     0}}
{{0.859    0.75   100 0       16      32      100     300514}}
{{0.897    0.96   100 50      16      32      150     0}}
{{0.917    0.84   100 0       16      64      100     563259}}
{{0.946    1.12   100 50      16      64      150     0}}
{{0.874    0.86   100 0       32      32      100     303186}}
{{0.913    1.09   100 50      32      32      150     0}}
{{0.929    1.04   100 0       32      64      100     580725}}
{{0.958    1.38   100 50      32      64      150     0}}

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-20 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569237#comment-17569237
 ] 

Julie Tibshirani commented on LUCENE-10404:
---

As a note, LUCENE-10592 changes the index strategy so that we build the graph 
as each document is added, instead of waiting until 'flush'. In the PR, graph 
building still shares a single FixedBitSet to track the 'visited' set, but it's 
continuously resized since we don't know the full number of docs up-front. So 
maybe switching to a hash set could help even more after that change is merged.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-02-04 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486894#comment-17486894
 ] 

Adrien Grand commented on LUCENE-10404:
---

{quote}For example, on the glove-100-angular dataset with ~1.2 million docs, 
HNSW search visits ~1000 - 15,000 docs depending on the recall.
{quote}
So if some searches at index time need to visit ~15k docs, then likely we'd 
need a hash set with a backing array that has 32,768 entries (smallest power of 
two that is 2x greater than the number of items in the hash set).

Since we'd never shrink the backing array, this means that even when we only 
visit 1,000 nodes, we would then have to clear the 32k entries of the backing 
array, so the hashset#clear calls might not be free (though still cheaper than 
on FixedBitSet).

This makes me wonder if we should consider using a specialized hash set 
implementation for this use-case, e.g. by using a FixedBitSet to track used 
buckets instead of sentinel values (which I believe is what HPPC's IntHashSet 
is doing). This would make {{IntHashSet#add}} calls a bit more expensive 
(maybe?), but then clearing the Set would only consist of clearing the BitSet 
that tracks used slots instead of the backing array, so we'd need to clear 
32,768 bits = 4kB of data instead of 32,768*4=128kB.

> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

9 matches

Site Navigation

Mail list logo

Footer information