Re: [PR] Feature/collaborative hnsw search [lucene]

via GitHub Tue, 10 Feb 2026 07:03:35 -0800


krickert commented on PR #15676:
URL: https://github.com/apache/lucene/pull/15676#issuecomment-3878265771


   # Collaborative HNSW Shard Pruning: Visited/Recall Analysis
   
   Yup, @benwtrent. The visited/recall curve is the only honest way to judge 
this.
   
   Here is what I am seeing on **1.47M deduped 1024-dim vectors** across **8 
independent shards** (evenly distributed):
   
   | K   | Mode          | Merged Recall | Nodes Visited |
   |-----|---------------|---------------|---------------|
   | 10  | Baseline      | 0.738         | 22,788        |
   | 10  | Collaborative | 0.664         | 16,168        |
   | 100 | Baseline      | 0.796         | 50,785        |
   | 100 | Collaborative | 0.806         | 77,716        |
   
   At **K=10**, collaborative pruning "wins" on performance but fails on recall 
— it's essentially cutting off the bridge paths required to reach local 
clusters. At **K=100**, we recover the recall, but the safety mechanisms 
(delayed bar application and slack buffers) actually cause us to over-explore, 
doing *more* work than the baseline.
   
   The difficulty is that unlike Lucene segments, independent shard indices 
don't share a consistent global HNSW topology. The assumption that shards are 
"random samples" is theoretically sound, but in practice, sharing the raw 
*k*th-best score from the first-finishing shard is too aggressive. It doesn't 
account for the variance in *when* a shard actually "finds" its target 
neighborhood during traversal.
   
   I'm prototyping a distributed gRPC setup to test this with realistic network 
isolation (to eliminate the memory-bus contention we see in a single JVM). 
However, the core question remains: should the collaborative bar be a raw 
score, or a heuristic that factors in the shard count and expected local 
density? My ultimate goal is to make very large result sets ($K \ge 1000$) fast 
by allowing irrelevant shards to "bail out" early, but I'm seeing that I'll 
need a better way to derive that global threshold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Feature/collaborative hnsw search [lucene]

Reply via email to