Re: Concurrent HNSW index

Michael Wechner Thu, 27 Apr 2023 13:06:11 -0700

+1 for a pull request

Thanks


Michael

Am 27.04.23 um 20:53 schrieb Ishan Chattopadhyaya:

+1, please contribute to Lucene. Thanks!

On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis, <jbel...@gmail.com> wrote:

    Hi all,

    I've created an HNSW index implementation that allows for
    concurrent build and querying.  On my i9-12900 (8 performance
    cores and 8 efficiency) I get a bit less than 10x speedup of wall
    clock time for building and querying the "siftsmall" and "sift"
    datasets from http://corpus-texmex.irisa.fr/. The small dataset is
    10k vectors while the large is 1M. This speedup feels pretty good
    for a data structure that isn't completely parallelizable, and
    it's good to see that it's consistent as the dataset gets larger.

    The concurrent classes achieve identical recall compared to the
    non-concurrent versions within my ability to test it, and are
    drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I
    use threadlocals to work around the places where the existing API
    assumes no concurrency.

    The concurrent classes also pass the existing test suite with the
    exception of the ram usage ones; the estimator doesn't know about
    AtomicReference etc.  (Big thanks to Michael Sokolov for
    testAknnDiverse which made it much easier to track down subtle
    problems!)

    My motivation is

    1. It is faster to query a single on-heap hnsw index, than to
    query multiple such indexes and combine the result.
    2. Even with some contention necessarily occurring during building
    of the index, we still come out way ahead in terms of total
    efficiency vs creating per-thread indexes and combining them,
    since combining such indexes boils down to "pick the largest and
    then add all the other nodes normally," you don't really benefit
    from having computed the others previously.

    I am currently adding this to Cassandra as code in our repo, but
    my preference would be to upstream it.  Is Lucene open to a pull
    request?

--Jonathan Ellis

    co-founder, http://www.datastax.com
    @spyced

Re: Concurrent HNSW index

Reply via email to