+1, please contribute to Lucene. Thanks! On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis, <jbel...@gmail.com> wrote:
> Hi all, > > I've created an HNSW index implementation that allows for concurrent build > and querying. On my i9-12900 (8 performance cores and 8 efficiency) I get > a bit less than 10x speedup of wall clock time for building and querying > the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/. > The small dataset is 10k vectors while the large is 1M. This speedup feels > pretty good for a data structure that isn't completely parallelizable, and > it's good to see that it's consistent as the dataset gets larger. > > The concurrent classes achieve identical recall compared to the > non-concurrent versions within my ability to test it, and are drop-in > replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals > to work around the places where the existing API assumes no concurrency. > > The concurrent classes also pass the existing test suite with the > exception of the ram usage ones; the estimator doesn't know about > AtomicReference etc. (Big thanks to Michael Sokolov for testAknnDiverse > which made it much easier to track down subtle problems!) > > My motivation is > > 1. It is faster to query a single on-heap hnsw index, than to query > multiple such indexes and combine the result. > 2. Even with some contention necessarily occurring during building of the > index, we still come out way ahead in terms of total efficiency vs creating > per-thread indexes and combining them, since combining such indexes boils > down to "pick the largest and then add all the other nodes normally," you > don't really benefit from having computed the others previously. > > I am currently adding this to Cassandra as code in our repo, but my > preference would be to upstream it. Is Lucene open to a pull request? > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced >