[ 
https://issues.apache.org/jira/browse/LUCENE-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov updated LUCENE-9644:
------------------------------------
    Description: 
This will replace the simple nearest neighbor selection with a criterion that 
takes into account the distance of the neighbors from each other. It is seen to 
provide dramatically improved recall on at least two datasets, and is what is 
being used by our reference implementation, hnswlib. The basic idea is that 
when selecting  the nearest neighbors to associate with a new node added to the 
graph, we filter using a diversity criterion. If a candidate neighbor is closer 
to an already-added (closer to the new node) neighbor than it is to the new 
node, then we pass over it, moving on to more-distant, but presumably more 
diverse neighbors. The same criterion is also (re-) applied to the neighbor 
nodes' neighbors, since we add the links bidirectionally.

h2. Results:
h3. GloVe/Wikipedia

h4. baseline
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||    index 
ms||
|0.643|  0.77|  100000| 50|     32|     64|     1742|   22830|
|0.671|  0.95|  100000| 100|    32|     64|     2141|   0|
|0.704|  1.32|  100000| 200|    32|     64|     2923|   0|
|0.739|  2.04|  100000| 400|    32|     64|     4382|   0|
|0.470|  0.91|  1000000|        50|     32|     64|     2068|   337081|
|0.496|  1.21|  1000000|        100|    32|     64|     2548|   0|
|0.533|  1.77|  1000000|        200|    32|     64|     3479|   0|
|0.573|  2.58|  1000000|        400|    32|     64|     5257|   0|

h4. diverse
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||    index 
ms||
|0.801|  0.57|  100000| 50|     32|     64|     593|    17985|
|0.840|  0.67|  100000| 100|    32|     64|     738|    0|
|0.883|  0.97|  100000| 200|    32|     64|     1018|   0|
|0.921|  1.36|  100000| 400|    32|     64|     1502|   0|
|0.723|  0.71|  1000000|        50|     32|     64|     860|    298383|
|0.761|  0.77|  1000000|        100|    32|     64|     1058|   0|
|0.806|  1.06|  1000000|        200|    32|     64|     1442|   0|
|0.854|  1.67|  1000000|        400|    32|     64|     2159|   0|


h3. Dataset from work:
h4. baseline
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||    index 
ms||
|0.933|  1.41|  100000| 50|     32|     64|     1496|   35462|
|0.948|  1.39|  100000| 100|    32|     64|     1872|   0|
|0.961|  2.10|  100000| 200|    32|     64|     2591|   0|
|0.972|  3.04|  100000| 400|    32|     64|     3939|   0|
|0.827|  1.34|  1000000|        50|     32|     64|     1676|   535802|
|0.854|  1.76|  1000000|        100|    32|     64|     2056|   0|
|0.887|  2.47|  1000000|        200|    32|     64|     2761|   0|
|0.907|  3.75|  1000000|        400|    32|     64|     4129|   0|

h4. diverse
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||    index 
ms||
|0.966|  1.18|  100000| 50|     32|     64|     1480|   37656|
|0.977|  1.46|  100000| 100|    32|     64|     1832|   0|
|0.988|  2.00|  100000| 200|    32|     64|     2472|   0|
|0.995|  3.14|  100000| 400|    32|     64|     3629|   0|
|0.944|  1.34|  1000000|        50|     32|     64|     1780|   526834|
|0.959|  1.71|  1000000|        100|    32|     64|     2222|   0|
|0.975|  2.30|  1000000|        200|    32|     64|     3041|   0|
|0.986|  3.56|  1000000|        400|    32|     64|     4543|   0|


  was:
This will replace the simple nearest neighbor selection with a criterion that 
takes into account the distance of the neighbors from each other. It is seen to 
provide dramatically improved recall on at least two datasets, and is what is 
being used by our reference implementation, hnswlib. The basic idea is that 
when selecting  the nearest neighbors to associate with a new node added to the 
graph, we filter using a diversity criterion. If a candidate neighbor is closer 
to an already-added (closer to the new node) neighbor than it is to the new 
node, then we pass over it, moving on to more-distant, but presumably more 
diverse neighbors. The same criterion is also (re-) applied to the neighbor 
nodes' neighbors, since we add the links bidirectionally.

## Results:
### GloVe/Wikipedia

#### baseline
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||    index 
ms||
|0.643|  0.77|  100000| 50|     32|     64|     1742|   22830|
|0.671|  0.95|  100000| 100|    32|     64|     2141|   0|
|0.704|  1.32|  100000| 200|    32|     64|     2923|   0|
|0.739|  2.04|  100000| 400|    32|     64|     4382|   0|
|0.470|  0.91|  1000000|        50|     32|     64|     2068|   337081|
|0.496|  1.21|  1000000|        100|    32|     64|     2548|   0|
|0.533|  1.77|  1000000|        200|    32|     64|     3479|   0|
|0.573|  2.58|  1000000|        400|    32|     64|     5257|   0|

#### diverse
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||    index 
ms||
|0.801|  0.57|  100000| 50|     32|     64|     593|    17985|
|0.840|  0.67|  100000| 100|    32|     64|     738|    0|
|0.883|  0.97|  100000| 200|    32|     64|     1018|   0|
|0.921|  1.36|  100000| 400|    32|     64|     1502|   0|
|0.723|  0.71|  1000000|        50|     32|     64|     860|    298383|
|0.761|  0.77|  1000000|        100|    32|     64|     1058|   0|
|0.806|  1.06|  1000000|        200|    32|     64|     1442|   0|
|0.854|  1.67|  1000000|        400|    32|     64|     2159|   0|


### Dataset from work:
#### baseline
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||    index 
ms||
|0.933|  1.41|  100000| 50|     32|     64|     1496|   35462|
|0.948|  1.39|  100000| 100|    32|     64|     1872|   0|
|0.961|  2.10|  100000| 200|    32|     64|     2591|   0|
|0.972|  3.04|  100000| 400|    32|     64|     3939|   0|
|0.827|  1.34|  1000000|        50|     32|     64|     1676|   535802|
|0.854|  1.76|  1000000|        100|    32|     64|     2056|   0|
|0.887|  2.47|  1000000|        200|    32|     64|     2761|   0|
|0.907|  3.75|  1000000|        400|    32|     64|     4129|   0|

#### diverse
||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||    index 
ms||
|0.966|  1.18|  100000| 50|     32|     64|     1480|   37656|
|0.977|  1.46|  100000| 100|    32|     64|     1832|   0|
|0.988|  2.00|  100000| 200|    32|     64|     2472|   0|
|0.995|  3.14|  100000| 400|    32|     64|     3629|   0|
|0.944|  1.34|  1000000|        50|     32|     64|     1780|   526834|
|0.959|  1.71|  1000000|        100|    32|     64|     2222|   0|
|0.975|  2.30|  1000000|        200|    32|     64|     3041|   0|
|0.986|  3.56|  1000000|        400|    32|     64|     4543|   0|



> HNSW diverse neighbor selection heuristic
> -----------------------------------------
>
>                 Key: LUCENE-9644
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9644
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Major
>
> This will replace the simple nearest neighbor selection with a criterion that 
> takes into account the distance of the neighbors from each other. It is seen 
> to provide dramatically improved recall on at least two datasets, and is what 
> is being used by our reference implementation, hnswlib. The basic idea is 
> that when selecting  the nearest neighbors to associate with a new node added 
> to the graph, we filter using a diversity criterion. If a candidate neighbor 
> is closer to an already-added (closer to the new node) neighbor than it is to 
> the new node, then we pass over it, moving on to more-distant, but presumably 
> more diverse neighbors. The same criterion is also (re-) applied to the 
> neighbor nodes' neighbors, since we add the links bidirectionally.
> h2. Results:
> h3. GloVe/Wikipedia
> h4. baseline
> ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||  index 
> ms||
> |0.643|        0.77|  100000| 50|     32|     64|     1742|   22830|
> |0.671|        0.95|  100000| 100|    32|     64|     2141|   0|
> |0.704|        1.32|  100000| 200|    32|     64|     2923|   0|
> |0.739|        2.04|  100000| 400|    32|     64|     4382|   0|
> |0.470|        0.91|  1000000|        50|     32|     64|     2068|   337081|
> |0.496|        1.21|  1000000|        100|    32|     64|     2548|   0|
> |0.533|        1.77|  1000000|        200|    32|     64|     3479|   0|
> |0.573|        2.58|  1000000|        400|    32|     64|     5257|   0|
> h4. diverse
> ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||  index 
> ms||
> |0.801|        0.57|  100000| 50|     32|     64|     593|    17985|
> |0.840|        0.67|  100000| 100|    32|     64|     738|    0|
> |0.883|        0.97|  100000| 200|    32|     64|     1018|   0|
> |0.921|        1.36|  100000| 400|    32|     64|     1502|   0|
> |0.723|        0.71|  1000000|        50|     32|     64|     860|    298383|
> |0.761|        0.77|  1000000|        100|    32|     64|     1058|   0|
> |0.806|        1.06|  1000000|        200|    32|     64|     1442|   0|
> |0.854|        1.67|  1000000|        400|    32|     64|     2159|   0|
> h3. Dataset from work:
> h4. baseline
> ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||  index 
> ms||
> |0.933|        1.41|  100000| 50|     32|     64|     1496|   35462|
> |0.948|        1.39|  100000| 100|    32|     64|     1872|   0|
> |0.961|        2.10|  100000| 200|    32|     64|     2591|   0|
> |0.972|        3.04|  100000| 400|    32|     64|     3939|   0|
> |0.827|        1.34|  1000000|        50|     32|     64|     1676|   535802|
> |0.854|        1.76|  1000000|        100|    32|     64|     2056|   0|
> |0.887|        2.47|  1000000|        200|    32|     64|     2761|   0|
> |0.907|        3.75|  1000000|        400|    32|     64|     4129|   0|
> h4. diverse
> ||recall||latency ms||nDoc||fanout|| maxConn|| beamWidth|| visited||  index 
> ms||
> |0.966|        1.18|  100000| 50|     32|     64|     1480|   37656|
> |0.977|        1.46|  100000| 100|    32|     64|     1832|   0|
> |0.988|        2.00|  100000| 200|    32|     64|     2472|   0|
> |0.995|        3.14|  100000| 400|    32|     64|     3629|   0|
> |0.944|        1.34|  1000000|        50|     32|     64|     1780|   526834|
> |0.959|        1.71|  1000000|        100|    32|     64|     2222|   0|
> |0.975|        2.30|  1000000|        200|    32|     64|     3041|   0|
> |0.986|        3.56|  1000000|        400|    32|     64|     4543|   0|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to