+1 for a pull request
Thanks
Michael
Am 27.04.23 um 20:53 schrieb Ishan Chattopadhyaya:
+1, please contribute to Lucene. Thanks!
On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis, <jbel...@gmail.com> wrote:
Hi all,
I've created an HNSW index implementation that allows for
concurrent build and querying. On my i9-12900 (8 performance
cores and 8 efficiency) I get a bit less than 10x speedup of wall
clock time for building and querying the "siftsmall" and "sift"
datasets from http://corpus-texmex.irisa.fr/. The small dataset is
10k vectors while the large is 1M. This speedup feels pretty good
for a data structure that isn't completely parallelizable, and
it's good to see that it's consistent as the dataset gets larger.
The concurrent classes achieve identical recall compared to the
non-concurrent versions within my ability to test it, and are
drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I
use threadlocals to work around the places where the existing API
assumes no concurrency.
The concurrent classes also pass the existing test suite with the
exception of the ram usage ones; the estimator doesn't know about
AtomicReference etc. (Big thanks to Michael Sokolov for
testAknnDiverse which made it much easier to track down subtle
problems!)
My motivation is
1. It is faster to query a single on-heap hnsw index, than to
query multiple such indexes and combine the result.
2. Even with some contention necessarily occurring during building
of the index, we still come out way ahead in terms of total
efficiency vs creating per-thread indexes and combining them,
since combining such indexes boils down to "pick the largest and
then add all the other nodes normally," you don't really benefit
from having computed the others previously.
I am currently adding this to Cassandra as code in our repo, but
my preference would be to upstream it. Is Lucene open to a pull
request?
--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced