+1, please contribute to Lucene. Thanks!

On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis, <jbel...@gmail.com> wrote:

> Hi all,
>
> I've created an HNSW index implementation that allows for concurrent build
> and querying.  On my i9-12900 (8 performance cores and 8 efficiency) I get
> a bit less than 10x speedup of wall clock time for building and querying
> the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
> The small dataset is 10k vectors while the large is 1M.  This speedup feels
> pretty good for a data structure that isn't completely parallelizable, and
> it's good to see that it's consistent as the dataset gets larger.
>
> The concurrent classes achieve identical recall compared to the
> non-concurrent versions within my ability to test it, and are drop-in
> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
> to work around the places where the existing API assumes no concurrency.
>
> The concurrent classes also pass the existing test suite with the
> exception of the ram usage ones; the estimator doesn't know about
> AtomicReference etc.  (Big thanks to Michael Sokolov for testAknnDiverse
> which made it much easier to track down subtle problems!)
>
> My motivation is
>
> 1. It is faster to query a single on-heap hnsw index, than to query
> multiple such indexes and combine the result.
> 2. Even with some contention necessarily occurring during building of the
> index, we still come out way ahead in terms of total efficiency vs creating
> per-thread indexes and combining them, since combining such indexes boils
> down to "pick the largest and then add all the other nodes normally," you
> don't really benefit from having computed the others previously.
>
> I am currently adding this to Cassandra as code in our repo, but my
> preference would be to upstream it.  Is Lucene open to a pull request?
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Reply via email to