Hi all, I first published a concurrent HNSW PR in April, which turned out to be a bit premature. There was a lot of code churn as I fixed bugs and improved performance. Sorry about that!
This code has been available as part of DataStax Astra’s public vector search preview for almost a month now, and the pace of change has slowed to a crawl, so I think it’s a good time to take another stab at contributing it to Lucene. The new PR is here: https://github.com/apache/lucene/pull/12421 Here’s how the performance looks, using my SIFT test harness at https://github.com/jbellis/hnswdemo/tree/lucene-bench. Numbers are averaged across 5 test runs: Serial construction of 1M nodes: 292.3s Parallel construction, 1 thread: 379.0s = 129.7% of serial Parallel construction, 2 threads: 191.3s = 50.5% of parallel/1 Parallel construction, 4 threads: 96.1s = 25.4% of parallel/1 Parallel construction, 8 threads: 52.6s = 13.8% of parallel/1 Serial queries of 100k vectors / top 100: 38.3s Parallel queries, 1 thread: 41.6s = 1.09% of serial Parallel queries, 2 threads: 21.0s = 50.5% of parallel/1 Parallel queries, 4 threads: 10.3s = 24.7% of parallel/1 Parallel queries, 8 threads: 5.3s = 12.7% of parallel/1 To summarize, there is about a 30% overhead during construction and 10% overhead at query time from using the concurrent class. The concurrent class parallelizes construction nearly perfectly through 4 threads, with some additional inefficiency becoming visible at 8. (Probably this is the effect of having to do more vector comparisons across the concurrently added nodes – I would expect this to remain relatively small and not exploding as thread counts increase.) Queries scale close to perfectly to at least 8 threads. ConcurrentOnHeapHnswGraph is very similar to OnHeapHnswGraph, but with concurrent backing Collections. The main addition is a getView operation to provide a threadsafe snapshot of the graph for searches. The View uses a CompletionTracker class to provide a kind of snapshot isolation – otherwise it is impossible to prevent partially added nodes from causing very difficult to debug race conditions. This is used during construction as well as for user-invoked searches. (The initial CompletionTracker implementation was just an AtomicInteger clock and a Map<node Integer, clock Integer>, but the constant box/unbox introduced significant CPU and GC overhead. The current implementation is a result of optimizing that.) ConcurrentHnswGraphBuilder adds an incremental ram used estimate as a return value to addGraphNode, and a buildAsync method that takes an ExecutorService for fine-grained control over parallelism. Otherwise, it follows the API of HnswGraphBuilder closely. (Closely enough that my original PR extracted a common interface and added factories so that they could be plugged in interchangeably, but this is now removed after Michael’s feedback.) The key to achieving good concurrency while maintaining correctness without synchronization is, we track in-progress node additions across all threads in a ConcurrentSkipListSet. After adding ourselves in addGraphNode, we take a snapshot of this set (this is O(1) for CSLS), and consider all other in-progress updates as neighbor candidates (subject to normal level constraints). In general, the main concern with the Concurrent Builder compared to the serial is to make sure that partially complete operations never “leak” to other threads. In particular, * Neighbor manipulation has been encapsulated in ConcurrentNeighborSet. CNS implements a copy-on-write NeighborArray – my initial implementation used a ConcurrentSkipListSet but this was significantly slower since even during construction, there are many more “iterate through the neighbors” operations than “change the neighbors.” We have to subclass NeighborArray here to be able to efficiently handle the case of concurrently inserting the same node (as a forward-link and a back-link). * Entry point updating is not done until the new node has been fully added. One more point is a little subtle – * When a new node is added, it considers both existing nodes in the graph as candidates, as well as other nodes concurrently in the process of being added. It does this by taking a snapshot of the "insertionsInProgress" set when the node begins adding itself, and using both the snapshot and the current graph state as candidate sources. This allows the graph to remain connected even when many nodes are added concurrently. * The subtle part is that we compute diversity *separately* for the fully-added and the in-progress candidates. This is because there is an asymmetry between initial links (you need to be diverse or you’re discarded) and backlinks added later (diversity is not enforced again until you bump up against max connections). Treating them separately allows us to get very similar results to what you would get adding each node serially. See the discussion in addGraphNode about over-pruning for an example. * I think we could simplify this linking of new nodes to consider fully-added and in-progress candidates as a group instead of separately, if we implemented the paper’s “keep pruned connections” suggestion. I have experimented with this, and I think the recall:memory used benefits are good, but it deserves a followup ticket. The main graph test suites are identical except for changes to perform concurrent graph builds instead of serial, and the addition of testing the incremental ram usage estimate from ConcurrentHnswGraphBuilder::addGraphNode. I also added new tests for ConcurrentNeighborSet. Changes to existing code, mostly minor: * HnswGraphSearcher gets a new searchConcurrent method that uses a GrowableBitSet, since the size of the graph may change during the search. For the same reason, removed "assert friendOrd < size" from the search methods. * Moved JavaUtilBitSet out of BaseBitSetTestCase and renamed to GrowableBitSet. * Several NeighborArray fields move to protected since we're subclassing it now. * Updates OnHeapHnswGraph ramBytesUsed for the TreeMap -> HashMap change made in 3c163745, although apparently it’s close enough either way to mostly not matter. * Added --add-opens in build.gradle for tests to allow RamUsageEstimator to compute exact values when checking correctness of my estimates. * Added HnswGraph.addNode (with default unsupportedoperation) to document the shared expectations in one place. -- Jonathan Ellis co-founder, http://www.datastax.com @spyced