[ https://issues.apache.org/jira/browse/LUCENE-10593 ]


    Alessandro Benedetti deleted comment on LUCENE-10593:
    -----------------------------------------------

was (Author: alessandro.benedetti):
Hi @msokolov @mayya-sharipova and @jtibshirani , I have finally finished my 
performance tests.
Initially the results were worse in this branch, I found that suspicious as I 
expected the removal of the BoundChecker and the removal of the reverse 
mechanism to outweigh the additional division in the distance measure during 
graph building and searching.

After a deep investigation I found the culprit (you see it in the latest 
commit).


{code:java}
if (neighborSimilarity >= score) {
if ((neighborSimilarity < score) == false) { // this version improves the 
performance dramatically in both indexing/searching
{code}


After that fix, the results are very encouraging.
There are strong speedup for both angular and euclidean distances, both for 
indexing and searching.
*If this is validated we are getting a great cleanup of the code and also a 
nice performance boost.*
I'll have my colleague @eliaporciani to repeat the tests on Apple M1.

The following tests were executed on Intellij running the 
org.apache.lucene.util.hnsw.KnnGraphTester.
2.4 GHz 8-Core Intel Core i9 - 32 GB 2667 MHz DDR4


{noformat}
`INDEXING EUCLIDEAN

-beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
IW 0 [2022-06-22T14:00:12.647030Z; main]: 64335 msec to write vectors
IW 0 [2022-06-22T14:01:57.425108Z; main]: 65710 msec to write vectors
IW 0 [2022-06-22T14:03:18.052900Z; main]: 64817 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T14:04:50.683607Z; main]: 6597 msec to write vectors
IW 0 [2022-06-22T14:05:34.090801Z; main]: 6687 msec to write vectors
IW 0 [2022-06-22T14:06:00.268309Z; main]: 6564 msec to write vectors

INDEXING ANGULAR

-beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
IW 0 [2022-06-22T13:55:45.401310Z; main]: 32897 msec to write vectors
IW 0 [2022-06-22T13:56:39.737642Z; main]: 33255 msec to write vectors
IW 0 [2022-06-22T13:57:31.172709Z; main]: 32576 msec to write vectors

THIS BRANCH
IW 0 [2022-06-22T13:52:06.085790Z; main]: 25261 msec to write vectors
IW 0 [2022-06-22T13:52:51.022766Z; main]: 25775 msec to write vectors
IW 0 [2022-06-22T13:53:47.565833Z; main]: 24523 msec to write vectors`

`SEARCH EUCLIDEAN

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/sift-128-euclidean.hdf5
 -metric euclidean

ORIGINAL
completed 500 searches in 1026 ms: 487 QPS CPU time=1025ms
completed 500 searches in 1030 ms: 485 QPS CPU time=1029ms
completed 500 searches in 1031 ms: 484 QPS CPU time=1030ms

THIS BRANCH
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 46 ms: 10869 QPS CPU time=46ms
completed 500 searches in 47 ms: 10638 QPS CPU time=46ms

SEARCH ANGULAR

-niter 500 -beamWidthIndex 100 -maxConn 16 -ndoc 80000 -reindex -docs 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -search 
/Users/sease/JavaProjects/ann-benchmarks/ann_benchmarks/datasets/lastfm-64-dot.hdf5
 -metric angular

ORIGINAL
completed 500 searches in 154 ms: 3246 QPS CPU time=153ms
completed 500 searches in 162 ms: 3086 QPS CPU time=162ms
completed 500 searches in 166 ms: 3012 QPS CPU time=166ms

THIS BRANCH
completed 500 searches in 62 ms: 8064 QPS CPU time=62ms
completed 500 searches in 65 ms: 7692 QPS CPU time=65ms
completed 500 searches in 63 ms: 7936 QPS CPU time=62ms
`
{noformat}



Please correct me in case I did anything wrong, it's the first time I was using 
the org.apache.lucene.util.hnsw.KnnGraphTester

I am proceeding in running additional performance tests on different datasets.
Functional tests are all green.

> VectorSimilarityFunction reverse removal
> ----------------------------------------
>
>                 Key: LUCENE-10593
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10593
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alessandro Benedetti
>            Priority: Major
>              Labels: vector-based-search
>
> org.apache.lucene.index.VectorSimilarityFunction#EUCLIDEAN similarity behaves 
> in an opposite way in comparison to the other similarities:
> A higher similarity score means higher distance, for this reason, has been 
> marked with "reversed" and a function is present to map from the similarity 
> to a score (where higher means closer, like in all other similarities.)
> Having this counterintuitive behavior with no apparent explanation I could 
> find(please correct me if I am wrong) brings a lot of nasty side effects for 
> the code readability, especially when combined with the NeighbourQueue that 
> has a "reversed" itself.
> In addition, it complicates also the usage of the pattern:
> Result Queue -> MIN HEAP
> Candidate Queue -> MAX HEAP
> In HNSW searchers.
> The proposal in my Pull Request aims to:
> 1) the Euclidean similarity just returns the score, in line with the other 
> similarities, with the formula currently used to move from distance to score
> 2) simplify the code, removing the bound checker that's not necessary anymore
> 3) refactor here and there to be in line with the simplification
> 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or 
> MAX_HEAP, now debugging is much easier and understanding the HNSW code is 
> much more intuitive



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to