Hi all, FINGER [1] is a recent paper that describes how to accelerate graph-based ANN search by using a faster, approximate distance measurement for most query node / graph node comparisons instead of doing full N-dimension vector calculations.
I've implemented a proof of concept of Finger for Lucene HNSW (in memory), but I have not been able to replicate the paper's claims of massive speedup. Instead I see modest to negligible speedup. For example, for the nytimes-256 dataset, at (M, ef) = (16, 100) and topK=10 [2], I get Vanilla HNSW: 0.72 s to search 10k vectors, recall = 71.5% HNSW-FINGER: 0.70 s to search 10k vectors, recall = 68.9% For the same dataset at (M, ef) = (128, 512), I get Vanilla HNSW: 3.17 s to search 10k vectors, recall = 99.1% HNSW-FINGER: 1.93 s to search 10k vectors, recall = 97.3% The FINGER paper does not give exact M/ef, but for recall=70% they show a speedup of 7000 ops/s -> 17000 ops/s on this dataset in Figure 4, almost 2.5x. I've looked at this code every way I can think of and I'm stuck, I can't see what I'm doing wrong or differently. The approximation code is fairly straightforward arithmetic, and profiling confirms that the approximate-similarity function is in fact ~10x faster than the exact-similarity function. The problem seems to be that it still needs to make a significant number of exact-similarity calls. At (16, 100) it's making about 2/3 as many exact-similarity calls as vanilla HNSW, and at (128, 512) it's making about 1/4 as many (hence the performance improvement). [3] I believe this is because the approximated similarity simply isn't high-fidelity enough. The reason that it's hard to be sure is that there are two key components to the FINGER approximate distance]: 1. Calculate the optimal basis to project the vectors onto, and 2. Use the signs of the vectors projected onto that basis to estimate the angle between them. It's very difficult to rule out bugs in either of these parts since there are no "known good" values to compare with. Compounding this challenge, #2 is a probabilistic result so you really have to look at averages over many calls. Here are some of the properties of my implementation that I have verified behave as expected: 1. The basis vectors are orthonormal, and demonstrate that they preserve more variance and less error than randomly chosen basis vectors 2. For trivial cases, the basis vectors match what an independent numpy implementation derives 3. As r (the dimension of the basis vectors) increases, the error of the estimated similarity compared to the actual similarity decreases 4. Using the actual low-basis angle instead of the hamming approximation improves things, but (it seems to me) not so much that it indicates that the approximation is broken. If someone can take a look and see if I missed something, I'd appreciate that. Code is at https://github.com/jbellis/lucene/tree/concurrent-finger, the highlights are - FingerMetadata: where all the work happens - TestFingerMetadata: verifies properties 1-3 above - HnswGraphTestCase.testFinger*: compares Finger and vanilla results on random data - HnswSearcher: a cleaned-up HnswGraphSearcher with a Builder that adds Finger support (search for useFinger) - VectorMath: utility methods building on Apache Commons Math, which I just copied into the source tree as hnsw.math package to get it working for now Finally, the harness I used for testing the nytimes and other datasets is at https://github.com/jbellis/hnswdemo/tree/finger. (This code works with the hdf5 datasets as provided by https://github.com/erikbern/ann-benchmarks/ .) Much obliged for any ideas you have! P.S. I'm also available on ASF slack #lucene-dev if you want to discuss there. [1] https://dl.acm.org/doi/pdf/10.1145/3543507.3583318 [2] The FINGER paper prefers to measure Recall@10 even though many datasets offer Recall@100. I did not see a massive difference in Finger's performance advantage either way. [3] See Algorithm 2 line 19 in the paper for when exact similarity is performed, corresponding to line 257 in HnswSearcher. -- Jonathan Ellis co-founder, http://www.datastax.com @spyced