That's great! And we were talking about this exactly here:
https://github.com/apache/lucene/pull/12169

It would also help with the new token filter :)
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis <jbel...@gmail.com> wrote:

> Hi all,
>
> I've created an HNSW index implementation that allows for concurrent build
> and querying.  On my i9-12900 (8 performance cores and 8 efficiency) I get
> a bit less than 10x speedup of wall clock time for building and querying
> the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
> The small dataset is 10k vectors while the large is 1M.  This speedup feels
> pretty good for a data structure that isn't completely parallelizable, and
> it's good to see that it's consistent as the dataset gets larger.
>
> The concurrent classes achieve identical recall compared to the
> non-concurrent versions within my ability to test it, and are drop-in
> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
> to work around the places where the existing API assumes no concurrency.
>
> The concurrent classes also pass the existing test suite with the
> exception of the ram usage ones; the estimator doesn't know about
> AtomicReference etc.  (Big thanks to Michael Sokolov for testAknnDiverse
> which made it much easier to track down subtle problems!)
>
> My motivation is
>
> 1. It is faster to query a single on-heap hnsw index, than to query
> multiple such indexes and combine the result.
> 2. Even with some contention necessarily occurring during building of the
> index, we still come out way ahead in terms of total efficiency vs creating
> per-thread indexes and combining them, since combining such indexes boils
> down to "pick the largest and then add all the other nodes normally," you
> don't really benefit from having computed the others previously.
>
> I am currently adding this to Cassandra as code in our repo, but my
> preference would be to upstream it.  Is Lucene open to a pull request?
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Reply via email to