Draft PR is posted here: https://github.com/apache/lucene/pull/12254
This depends on my PR to use HashMap in the non-concurrent OnHeapHnswGraph (because that PR updates the tests to not assume sorted order of nodes in a given level): https://github.com/apache/lucene/pull/12248 On Fri, Apr 28, 2023 at 8:14 AM Jonathan Ellis <jbel...@gmail.com> wrote: > Great, I will work on squashing to get a clean PR. > > One thing I am struggling with is the RamUsageTester. Here is the > stacktrace: > https://gist.github.com/jbellis/20676b0e23f43751cbe8834a8def0d12 > > Apparently RamUsageTester tries to flip private fields to public so it can > introspect them, but the JVM modularization locks this down for internal > classes like ThreadLocal. Unclear to me why this is the first time this > problem has come up or how to fix it. > > On Fri, Apr 28, 2023 at 2:18 AM Alessandro Benedetti <a.benede...@sease.io> > wrote: > >> That's great! And we were talking about this exactly here: >> https://github.com/apache/lucene/pull/12169 >> >> It would also help with the new token filter :) >> -------------------------- >> *Alessandro Benedetti* >> Director @ Sease Ltd. >> *Apache Lucene/Solr Committer* >> *Apache Solr PMC Member* >> >> e-mail: a.benede...@sease.io >> >> >> *Sease* - Information Retrieval Applied >> Consulting | Training | Open Source >> >> Website: Sease.io <http://sease.io/> >> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >> <https://twitter.com/seaseltd> | Youtube >> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >> <https://github.com/seaseltd> >> >> >> On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis <jbel...@gmail.com> wrote: >> >>> Hi all, >>> >>> I've created an HNSW index implementation that allows for concurrent >>> build and querying. On my i9-12900 (8 performance cores and 8 efficiency) >>> I get a bit less than 10x speedup of wall clock time for building and >>> querying the "siftsmall" and "sift" datasets from >>> http://corpus-texmex.irisa.fr/. The small dataset is 10k vectors while >>> the large is 1M. This speedup feels pretty good for a data structure that >>> isn't completely parallelizable, and it's good to see that it's consistent >>> as the dataset gets larger. >>> >>> The concurrent classes achieve identical recall compared to the >>> non-concurrent versions within my ability to test it, and are drop-in >>> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals >>> to work around the places where the existing API assumes no concurrency. >>> >>> The concurrent classes also pass the existing test suite with the >>> exception of the ram usage ones; the estimator doesn't know about >>> AtomicReference etc. (Big thanks to Michael Sokolov for testAknnDiverse >>> which made it much easier to track down subtle problems!) >>> >>> My motivation is >>> >>> 1. It is faster to query a single on-heap hnsw index, than to query >>> multiple such indexes and combine the result. >>> 2. Even with some contention necessarily occurring during building of >>> the index, we still come out way ahead in terms of total efficiency vs >>> creating per-thread indexes and combining them, since combining such >>> indexes boils down to "pick the largest and then add all the other nodes >>> normally," you don't really benefit from having computed the others >>> previously. >>> >>> I am currently adding this to Cassandra as code in our repo, but my >>> preference would be to upstream it. Is Lucene open to a pull request? >>> >>> -- >>> Jonathan Ellis >>> co-founder, http://www.datastax.com >>> @spyced >>> >> > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced > -- Jonathan Ellis co-founder, http://www.datastax.com @spyced