[
https://issues.apache.org/jira/browse/LUCENE-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Sokolov updated LUCENE-9644:
------------------------------------
Description:
This will replace the simple nearest neighbor selection with a criterion that
takes into account the distance of the neighbors from each other. It is seen to
provide dramatically improved recall on at least two datasets, and is what is
being used by our reference implementation, hnswlib. The basic idea is that
when selecting the nearest neighbors to associate with a new node added to the
graph, we filter using a diversity criterion. If a candidate neighbor is closer
to an already-added (closer to the new node) neighbor than it is to the new
node, then we pass over it, moving on to more-distant, but presumably more
diverse neighbors. The same criterion is also (re-) applied to the neighbor
nodes' neighbors, since we add the links bidirectionally.
h2. Results:
h3. GloVe/Wikipedia
h4. baseline
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
ms||
|0.643| 0.77| 100000| 50| 32| 64| 1742| 22830|
|0.671| 0.95| 100000| 100| 32| 64| 2141| 0|
|0.704| 1.32| 100000| 200| 32| 64| 2923| 0|
|0.739| 2.04| 100000| 400| 32| 64| 4382| 0|
|0.470| 0.91| 1000000| 50| 32| 64| 2068| 337081|
|0.496| 1.21| 1000000| 100| 32| 64| 2548| 0|
|0.533| 1.77| 1000000| 200| 32| 64| 3479| 0|
|0.573| 2.58| 1000000| 400| 32| 64| 5257| 0|
h4. diverse
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
ms||
|0.801| 0.57| 100000| 50| 32| 64| 593| 17985|
|0.840| 0.67| 100000| 100| 32| 64| 738| 0|
|0.883| 0.97| 100000| 200| 32| 64| 1018| 0|
|0.921| 1.36| 100000| 400| 32| 64| 1502| 0|
|0.723| 0.71| 1000000| 50| 32| 64| 860| 298383|
|0.761| 0.77| 1000000| 100| 32| 64| 1058| 0|
|0.806| 1.06| 1000000| 200| 32| 64| 1442| 0|
|0.854| 1.67| 1000000| 400| 32| 64| 2159| 0|
h3. Dataset from work:
h4. baseline
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
ms||
|0.933| 1.41| 100000| 50| 32| 64| 1496| 35462|
|0.948| 1.39| 100000| 100| 32| 64| 1872| 0|
|0.961| 2.10| 100000| 200| 32| 64| 2591| 0|
|0.972| 3.04| 100000| 400| 32| 64| 3939| 0|
|0.827| 1.34| 1000000| 50| 32| 64| 1676| 535802|
|0.854| 1.76| 1000000| 100| 32| 64| 2056| 0|
|0.887| 2.47| 1000000| 200| 32| 64| 2761| 0|
|0.907| 3.75| 1000000| 400| 32| 64| 4129| 0|
h4. diverse
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
ms||
|0.966| 1.18| 100000| 50| 32| 64| 1480| 37656|
|0.977| 1.46| 100000| 100| 32| 64| 1832| 0|
|0.988| 2.00| 100000| 200| 32| 64| 2472| 0|
|0.995| 3.14| 100000| 400| 32| 64| 3629| 0|
|0.944| 1.34| 1000000| 50| 32| 64| 1780| 526834|
|0.959| 1.71| 1000000| 100| 32| 64| 2222| 0|
|0.975| 2.30| 1000000| 200| 32| 64| 3041| 0|
|0.986| 3.56| 1000000| 400| 32| 64| 4543| 0|
was:
This will replace the simple nearest neighbor selection with a criterion that
takes into account the distance of the neighbors from each other. It is seen to
provide dramatically improved recall on at least two datasets, and is what is
being used by our reference implementation, hnswlib. The basic idea is that
when selecting the nearest neighbors to associate with a new node added to the
graph, we filter using a diversity criterion. If a candidate neighbor is closer
to an already-added (closer to the new node) neighbor than it is to the new
node, then we pass over it, moving on to more-distant, but presumably more
diverse neighbors. The same criterion is also (re-) applied to the neighbor
nodes' neighbors, since we add the links bidirectionally.
## Results:
### GloVe/Wikipedia
#### baseline
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
ms||
|0.643| 0.77| 100000| 50| 32| 64| 1742| 22830|
|0.671| 0.95| 100000| 100| 32| 64| 2141| 0|
|0.704| 1.32| 100000| 200| 32| 64| 2923| 0|
|0.739| 2.04| 100000| 400| 32| 64| 4382| 0|
|0.470| 0.91| 1000000| 50| 32| 64| 2068| 337081|
|0.496| 1.21| 1000000| 100| 32| 64| 2548| 0|
|0.533| 1.77| 1000000| 200| 32| 64| 3479| 0|
|0.573| 2.58| 1000000| 400| 32| 64| 5257| 0|
#### diverse
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
ms||
|0.801| 0.57| 100000| 50| 32| 64| 593| 17985|
|0.840| 0.67| 100000| 100| 32| 64| 738| 0|
|0.883| 0.97| 100000| 200| 32| 64| 1018| 0|
|0.921| 1.36| 100000| 400| 32| 64| 1502| 0|
|0.723| 0.71| 1000000| 50| 32| 64| 860| 298383|
|0.761| 0.77| 1000000| 100| 32| 64| 1058| 0|
|0.806| 1.06| 1000000| 200| 32| 64| 1442| 0|
|0.854| 1.67| 1000000| 400| 32| 64| 2159| 0|
### Dataset from work:
#### baseline
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
ms||
|0.933| 1.41| 100000| 50| 32| 64| 1496| 35462|
|0.948| 1.39| 100000| 100| 32| 64| 1872| 0|
|0.961| 2.10| 100000| 200| 32| 64| 2591| 0|
|0.972| 3.04| 100000| 400| 32| 64| 3939| 0|
|0.827| 1.34| 1000000| 50| 32| 64| 1676| 535802|
|0.854| 1.76| 1000000| 100| 32| 64| 2056| 0|
|0.887| 2.47| 1000000| 200| 32| 64| 2761| 0|
|0.907| 3.75| 1000000| 400| 32| 64| 4129| 0|
#### diverse
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
ms||
|0.966| 1.18| 100000| 50| 32| 64| 1480| 37656|
|0.977| 1.46| 100000| 100| 32| 64| 1832| 0|
|0.988| 2.00| 100000| 200| 32| 64| 2472| 0|
|0.995| 3.14| 100000| 400| 32| 64| 3629| 0|
|0.944| 1.34| 1000000| 50| 32| 64| 1780| 526834|
|0.959| 1.71| 1000000| 100| 32| 64| 2222| 0|
|0.975| 2.30| 1000000| 200| 32| 64| 3041| 0|
|0.986| 3.56| 1000000| 400| 32| 64| 4543| 0|
> HNSW diverse neighbor selection heuristic
> -----------------------------------------
>
> Key: LUCENE-9644
> URL: https://issues.apache.org/jira/browse/LUCENE-9644
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael Sokolov
> Priority: Major
>
> This will replace the simple nearest neighbor selection with a criterion that
> takes into account the distance of the neighbors from each other. It is seen
> to provide dramatically improved recall on at least two datasets, and is what
> is being used by our reference implementation, hnswlib. The basic idea is
> that when selecting the nearest neighbors to associate with a new node added
> to the graph, we filter using a diversity criterion. If a candidate neighbor
> is closer to an already-added (closer to the new node) neighbor than it is to
> the new node, then we pass over it, moving on to more-distant, but presumably
> more diverse neighbors. The same criterion is also (re-) applied to the
> neighbor nodes' neighbors, since we add the links bidirectionally.
> h2. Results:
> h3. GloVe/Wikipedia
> h4. baseline
> ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
> ms||
> |0.643| 0.77| 100000| 50| 32| 64| 1742| 22830|
> |0.671| 0.95| 100000| 100| 32| 64| 2141| 0|
> |0.704| 1.32| 100000| 200| 32| 64| 2923| 0|
> |0.739| 2.04| 100000| 400| 32| 64| 4382| 0|
> |0.470| 0.91| 1000000| 50| 32| 64| 2068| 337081|
> |0.496| 1.21| 1000000| 100| 32| 64| 2548| 0|
> |0.533| 1.77| 1000000| 200| 32| 64| 3479| 0|
> |0.573| 2.58| 1000000| 400| 32| 64| 5257| 0|
> h4. diverse
> ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
> ms||
> |0.801| 0.57| 100000| 50| 32| 64| 593| 17985|
> |0.840| 0.67| 100000| 100| 32| 64| 738| 0|
> |0.883| 0.97| 100000| 200| 32| 64| 1018| 0|
> |0.921| 1.36| 100000| 400| 32| 64| 1502| 0|
> |0.723| 0.71| 1000000| 50| 32| 64| 860| 298383|
> |0.761| 0.77| 1000000| 100| 32| 64| 1058| 0|
> |0.806| 1.06| 1000000| 200| 32| 64| 1442| 0|
> |0.854| 1.67| 1000000| 400| 32| 64| 2159| 0|
> h3. Dataset from work:
> h4. baseline
> ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
> ms||
> |0.933| 1.41| 100000| 50| 32| 64| 1496| 35462|
> |0.948| 1.39| 100000| 100| 32| 64| 1872| 0|
> |0.961| 2.10| 100000| 200| 32| 64| 2591| 0|
> |0.972| 3.04| 100000| 400| 32| 64| 3939| 0|
> |0.827| 1.34| 1000000| 50| 32| 64| 1676| 535802|
> |0.854| 1.76| 1000000| 100| 32| 64| 2056| 0|
> |0.887| 2.47| 1000000| 200| 32| 64| 2761| 0|
> |0.907| 3.75| 1000000| 400| 32| 64| 4129| 0|
> h4. diverse
> ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited|| index
> ms||
> |0.966| 1.18| 100000| 50| 32| 64| 1480| 37656|
> |0.977| 1.46| 100000| 100| 32| 64| 1832| 0|
> |0.988| 2.00| 100000| 200| 32| 64| 2472| 0|
> |0.995| 3.14| 100000| 400| 32| 64| 3629| 0|
> |0.944| 1.34| 1000000| 50| 32| 64| 1780| 526834|
> |0.959| 1.71| 1000000| 100| 32| 64| 2222| 0|
> |0.975| 2.30| 1000000| 200| 32| 64| 3041| 0|
> |0.986| 3.56| 1000000| 400| 32| 64| 4543| 0|
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]