Concurrent HNSW index, take two

2023-07-06 Thread Jonathan Ellis
Hi all,

I first published a concurrent HNSW PR in April, which turned out to be a
bit premature.  There was a lot of code churn as I fixed bugs and improved
performance.  Sorry about that!

This code has been available as part of DataStax Astra’s public vector
search preview for almost a month now, and the pace of change has slowed to
a crawl, so I think it’s a good time to take another stab at contributing
it to Lucene.  The new PR is here:
https://github.com/apache/lucene/pull/12421

Here’s how the performance looks, using my SIFT test harness at
https://github.com/jbellis/hnswdemo/tree/lucene-bench.  Numbers are
averaged across 5 test runs:

Serial construction of 1M nodes: 292.3s
Parallel construction, 1 thread: 379.0s = 129.7% of serial
Parallel construction, 2 threads: 191.3s = 50.5% of parallel/1
Parallel construction, 4 threads: 96.1s = 25.4% of parallel/1
Parallel construction, 8 threads: 52.6s = 13.8% of parallel/1

Serial queries of 100k vectors / top 100: 38.3s
Parallel queries, 1 thread: 41.6s = 1.09% of serial
Parallel queries, 2 threads: 21.0s = 50.5% of parallel/1
Parallel queries, 4 threads: 10.3s = 24.7% of parallel/1
Parallel queries, 8 threads: 5.3s = 12.7% of parallel/1

To summarize, there is about a 30% overhead during construction and 10%
overhead at query time from using the concurrent class.  The concurrent
class parallelizes construction nearly perfectly through 4 threads, with
some additional inefficiency becoming visible at 8.  (Probably this is the
effect of having to do more vector comparisons across the concurrently
added nodes – I would expect this to remain relatively small and not
exploding as thread counts increase.)  Queries scale close to perfectly to
at least 8 threads.

ConcurrentOnHeapHnswGraph is very similar to OnHeapHnswGraph, but with
concurrent backing Collections.  The main addition is a getView operation
to provide a threadsafe snapshot of the graph for searches.  The View uses
a CompletionTracker class to provide a kind of snapshot isolation –
otherwise it is impossible to prevent partially added nodes from causing
very difficult to debug race conditions.  This is used during construction
as well as for user-invoked searches.

(The initial CompletionTracker implementation was just an AtomicInteger
clock and a Map, but the constant box/unbox
introduced significant CPU and GC overhead.  The current implementation is
a result of optimizing that.)

ConcurrentHnswGraphBuilder adds an incremental ram used estimate as a
return value to addGraphNode, and a buildAsync method that takes an
ExecutorService for fine-grained control over parallelism.  Otherwise, it
follows the API of HnswGraphBuilder closely.  (Closely enough that my
original PR extracted a common interface and added factories so that they
could be plugged in interchangeably, but this is now removed after
Michael’s feedback.)

The key to achieving good concurrency while maintaining correctness without
synchronization is, we track in-progress node additions across all threads
in a ConcurrentSkipListSet. After adding ourselves in addGraphNode, we take
a snapshot of this set (this is O(1) for CSLS), and consider all other
in-progress updates as neighbor candidates (subject to normal level
constraints).

In general, the main concern with the Concurrent Builder compared to the
serial is to make sure that partially complete operations never “leak” to
other threads.  In particular,
* Neighbor manipulation has been encapsulated in ConcurrentNeighborSet.
CNS implements a copy-on-write NeighborArray – my initial implementation
used a ConcurrentSkipListSet but this was significantly slower since even
during construction, there are many more “iterate through the neighbors”
operations than “change the neighbors.”  We have to subclass NeighborArray
here to be able to efficiently handle the case of concurrently inserting
the same node (as a forward-link and a back-link).
* Entry point updating is not done until the new node has been fully added.

One more point is a little subtle –
* When a new node is added, it considers both existing nodes in the graph
as candidates, as well as other nodes concurrently in the process of being
added. It does this by taking a snapshot of the "insertionsInProgress" set
when the node begins adding itself, and using both the snapshot and the
current graph state as candidate sources. This allows the graph to remain
connected even when many nodes are added concurrently.
* The subtle part is that we compute diversity *separately* for the
fully-added and the in-progress candidates.  This is because there is an
asymmetry between initial links (you need to be diverse or you’re
discarded) and backlinks added later (diversity is not enforced again until
you bump up against max connections).  Treating them separately allows us
to get very similar results to what you would get adding each node
serially.  See the discussion in addGraphNode about over-pruning for an
example.
* I think we could 

Re: Concurrent HNSW index

2023-04-28 Thread Jonathan Ellis
Draft PR is posted here: https://github.com/apache/lucene/pull/12254

This depends on my PR to use HashMap in the non-concurrent OnHeapHnswGraph
(because that PR updates the tests to not assume sorted order of nodes in a
given level): https://github.com/apache/lucene/pull/12248

On Fri, Apr 28, 2023 at 8:14 AM Jonathan Ellis  wrote:

> Great, I will work on squashing to get a clean PR.
>
> One thing I am struggling with is the RamUsageTester.  Here is the
> stacktrace:
> https://gist.github.com/jbellis/20676b0e23f43751cbe8834a8def0d12
>
> Apparently RamUsageTester tries to flip private fields to public so it can
> introspect them, but the JVM modularization locks this down for internal
> classes like ThreadLocal.  Unclear to me why this is the first time this
> problem has come up or how to fix it.
>
> On Fri, Apr 28, 2023 at 2:18 AM Alessandro Benedetti 
> wrote:
>
>> That's great! And we were talking about this exactly here:
>> https://github.com/apache/lucene/pull/12169
>>
>> It would also help with the new token filter :)
>> --
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benede...@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io 
>> LinkedIn  | Twitter
>>  | Youtube
>>  | Github
>> 
>>
>>
>> On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis  wrote:
>>
>>> Hi all,
>>>
>>> I've created an HNSW index implementation that allows for concurrent
>>> build and querying.  On my i9-12900 (8 performance cores and 8 efficiency)
>>> I get a bit less than 10x speedup of wall clock time for building and
>>> querying the "siftsmall" and "sift" datasets from
>>> http://corpus-texmex.irisa.fr/.  The small dataset is 10k vectors while
>>> the large is 1M.  This speedup feels pretty good for a data structure that
>>> isn't completely parallelizable, and it's good to see that it's consistent
>>> as the dataset gets larger.
>>>
>>> The concurrent classes achieve identical recall compared to the
>>> non-concurrent versions within my ability to test it, and are drop-in
>>> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
>>> to work around the places where the existing API assumes no concurrency.
>>>
>>> The concurrent classes also pass the existing test suite with the
>>> exception of the ram usage ones; the estimator doesn't know about
>>> AtomicReference etc.  (Big thanks to Michael Sokolov for testAknnDiverse
>>> which made it much easier to track down subtle problems!)
>>>
>>> My motivation is
>>>
>>> 1. It is faster to query a single on-heap hnsw index, than to query
>>> multiple such indexes and combine the result.
>>> 2. Even with some contention necessarily occurring during building of
>>> the index, we still come out way ahead in terms of total efficiency vs
>>> creating per-thread indexes and combining them, since combining such
>>> indexes boils down to "pick the largest and then add all the other nodes
>>> normally," you don't really benefit from having computed the others
>>> previously.
>>>
>>> I am currently adding this to Cassandra as code in our repo, but my
>>> preference would be to upstream it.  Is Lucene open to a pull request?
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: Concurrent HNSW index

2023-04-28 Thread Jonathan Ellis
Great, I will work on squashing to get a clean PR.

One thing I am struggling with is the RamUsageTester.  Here is the
stacktrace: https://gist.github.com/jbellis/20676b0e23f43751cbe8834a8def0d12

Apparently RamUsageTester tries to flip private fields to public so it can
introspect them, but the JVM modularization locks this down for internal
classes like ThreadLocal.  Unclear to me why this is the first time this
problem has come up or how to fix it.

On Fri, Apr 28, 2023 at 2:18 AM Alessandro Benedetti 
wrote:

> That's great! And we were talking about this exactly here:
> https://github.com/apache/lucene/pull/12169
>
> It would also help with the new token filter :)
> --
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io 
> LinkedIn  | Twitter
>  | Youtube
>  | Github
> 
>
>
> On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis  wrote:
>
>> Hi all,
>>
>> I've created an HNSW index implementation that allows for concurrent
>> build and querying.  On my i9-12900 (8 performance cores and 8 efficiency)
>> I get a bit less than 10x speedup of wall clock time for building and
>> querying the "siftsmall" and "sift" datasets from
>> http://corpus-texmex.irisa.fr/.  The small dataset is 10k vectors while
>> the large is 1M.  This speedup feels pretty good for a data structure that
>> isn't completely parallelizable, and it's good to see that it's consistent
>> as the dataset gets larger.
>>
>> The concurrent classes achieve identical recall compared to the
>> non-concurrent versions within my ability to test it, and are drop-in
>> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
>> to work around the places where the existing API assumes no concurrency.
>>
>> The concurrent classes also pass the existing test suite with the
>> exception of the ram usage ones; the estimator doesn't know about
>> AtomicReference etc.  (Big thanks to Michael Sokolov for testAknnDiverse
>> which made it much easier to track down subtle problems!)
>>
>> My motivation is
>>
>> 1. It is faster to query a single on-heap hnsw index, than to query
>> multiple such indexes and combine the result.
>> 2. Even with some contention necessarily occurring during building of the
>> index, we still come out way ahead in terms of total efficiency vs creating
>> per-thread indexes and combining them, since combining such indexes boils
>> down to "pick the largest and then add all the other nodes normally," you
>> don't really benefit from having computed the others previously.
>>
>> I am currently adding this to Cassandra as code in our repo, but my
>> preference would be to upstream it.  Is Lucene open to a pull request?
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: Concurrent HNSW index

2023-04-28 Thread Alessandro Benedetti
That's great! And we were talking about this exactly here:
https://github.com/apache/lucene/pull/12169

It would also help with the new token filter :)
--
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  | Twitter
 | Youtube
 | Github



On Thu, 27 Apr 2023 at 19:29, Jonathan Ellis  wrote:

> Hi all,
>
> I've created an HNSW index implementation that allows for concurrent build
> and querying.  On my i9-12900 (8 performance cores and 8 efficiency) I get
> a bit less than 10x speedup of wall clock time for building and querying
> the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
> The small dataset is 10k vectors while the large is 1M.  This speedup feels
> pretty good for a data structure that isn't completely parallelizable, and
> it's good to see that it's consistent as the dataset gets larger.
>
> The concurrent classes achieve identical recall compared to the
> non-concurrent versions within my ability to test it, and are drop-in
> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
> to work around the places where the existing API assumes no concurrency.
>
> The concurrent classes also pass the existing test suite with the
> exception of the ram usage ones; the estimator doesn't know about
> AtomicReference etc.  (Big thanks to Michael Sokolov for testAknnDiverse
> which made it much easier to track down subtle problems!)
>
> My motivation is
>
> 1. It is faster to query a single on-heap hnsw index, than to query
> multiple such indexes and combine the result.
> 2. Even with some contention necessarily occurring during building of the
> index, we still come out way ahead in terms of total efficiency vs creating
> per-thread indexes and combining them, since combining such indexes boils
> down to "pick the largest and then add all the other nodes normally," you
> don't really benefit from having computed the others previously.
>
> I am currently adding this to Cassandra as code in our repo, but my
> preference would be to upstream it.  Is Lucene open to a pull request?
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


Re: Concurrent HNSW index

2023-04-27 Thread Michael Wechner

+1 for a pull request

Thanks

Michael

Am 27.04.23 um 20:53 schrieb Ishan Chattopadhyaya:

+1, please contribute to Lucene. Thanks!

On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis,  wrote:

Hi all,

I've created an HNSW index implementation that allows for
concurrent build and querying.  On my i9-12900 (8 performance
cores and 8 efficiency) I get a bit less than 10x speedup of wall
clock time for building and querying the "siftsmall" and "sift"
datasets from http://corpus-texmex.irisa.fr/. The small dataset is
10k vectors while the large is 1M. This speedup feels pretty good
for a data structure that isn't completely parallelizable, and
it's good to see that it's consistent as the dataset gets larger.

The concurrent classes achieve identical recall compared to the
non-concurrent versions within my ability to test it, and are
drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I
use threadlocals to work around the places where the existing API
assumes no concurrency.

The concurrent classes also pass the existing test suite with the
exception of the ram usage ones; the estimator doesn't know about
AtomicReference etc.  (Big thanks to Michael Sokolov for
testAknnDiverse which made it much easier to track down subtle
problems!)

My motivation is

1. It is faster to query a single on-heap hnsw index, than to
query multiple such indexes and combine the result.
2. Even with some contention necessarily occurring during building
of the index, we still come out way ahead in terms of total
efficiency vs creating per-thread indexes and combining them,
since combining such indexes boils down to "pick the largest and
then add all the other nodes normally," you don't really benefit
from having computed the others previously.

I am currently adding this to Cassandra as code in our repo, but
my preference would be to upstream it.  Is Lucene open to a pull
request?

-- 
Jonathan Ellis

co-founder, http://www.datastax.com
@spyced



Re: Concurrent HNSW index

2023-04-27 Thread Ishan Chattopadhyaya
+1, please contribute to Lucene. Thanks!

On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis,  wrote:

> Hi all,
>
> I've created an HNSW index implementation that allows for concurrent build
> and querying.  On my i9-12900 (8 performance cores and 8 efficiency) I get
> a bit less than 10x speedup of wall clock time for building and querying
> the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
> The small dataset is 10k vectors while the large is 1M.  This speedup feels
> pretty good for a data structure that isn't completely parallelizable, and
> it's good to see that it's consistent as the dataset gets larger.
>
> The concurrent classes achieve identical recall compared to the
> non-concurrent versions within my ability to test it, and are drop-in
> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
> to work around the places where the existing API assumes no concurrency.
>
> The concurrent classes also pass the existing test suite with the
> exception of the ram usage ones; the estimator doesn't know about
> AtomicReference etc.  (Big thanks to Michael Sokolov for testAknnDiverse
> which made it much easier to track down subtle problems!)
>
> My motivation is
>
> 1. It is faster to query a single on-heap hnsw index, than to query
> multiple such indexes and combine the result.
> 2. Even with some contention necessarily occurring during building of the
> index, we still come out way ahead in terms of total efficiency vs creating
> per-thread indexes and combining them, since combining such indexes boils
> down to "pick the largest and then add all the other nodes normally," you
> don't really benefit from having computed the others previously.
>
> I am currently adding this to Cassandra as code in our repo, but my
> preference would be to upstream it.  Is Lucene open to a pull request?
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


Concurrent HNSW index

2023-04-27 Thread Jonathan Ellis
Hi all,

I've created an HNSW index implementation that allows for concurrent build
and querying.  On my i9-12900 (8 performance cores and 8 efficiency) I get
a bit less than 10x speedup of wall clock time for building and querying
the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
The small dataset is 10k vectors while the large is 1M.  This speedup feels
pretty good for a data structure that isn't completely parallelizable, and
it's good to see that it's consistent as the dataset gets larger.

The concurrent classes achieve identical recall compared to the
non-concurrent versions within my ability to test it, and are drop-in
replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
to work around the places where the existing API assumes no concurrency.

The concurrent classes also pass the existing test suite with the exception
of the ram usage ones; the estimator doesn't know about AtomicReference
etc.  (Big thanks to Michael Sokolov for testAknnDiverse which made it much
easier to track down subtle problems!)

My motivation is

1. It is faster to query a single on-heap hnsw index, than to query
multiple such indexes and combine the result.
2. Even with some contention necessarily occurring during building of the
index, we still come out way ahead in terms of total efficiency vs creating
per-thread indexes and combining them, since combining such indexes boils
down to "pick the largest and then add all the other nodes normally," you
don't really benefit from having computed the others previously.

I am currently adding this to Cassandra as code in our repo, but my
preference would be to upstream it.  Is Lucene open to a pull request?

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced