krickert commented on PR #15676: URL: https://github.com/apache/lucene/pull/15676#issuecomment-3878265771
# Collaborative HNSW Shard Pruning: Visited/Recall Analysis Yup, @benwtrent. The visited/recall curve is the only honest way to judge this. Here is what I am seeing on **1.47M deduped 1024-dim vectors** across **8 independent shards** (evenly distributed): | K | Mode | Merged Recall | Nodes Visited | |-----|---------------|---------------|---------------| | 10 | Baseline | 0.738 | 22,788 | | 10 | Collaborative | 0.664 | 16,168 | | 100 | Baseline | 0.796 | 50,785 | | 100 | Collaborative | 0.806 | 77,716 | At **K=10**, collaborative pruning "wins" on performance but fails on recall — it's essentially cutting off the bridge paths required to reach local clusters. At **K=100**, we recover the recall, but the safety mechanisms (delayed bar application and slack buffers) actually cause us to over-explore, doing *more* work than the baseline. The difficulty is that unlike Lucene segments, independent shard indices don't share a consistent global HNSW topology. The assumption that shards are "random samples" is theoretically sound, but in practice, sharing the raw *k*th-best score from the first-finishing shard is too aggressive. It doesn't account for the variance in *when* a shard actually "finds" its target neighborhood during traversal. I'm prototyping a distributed gRPC setup to test this with realistic network isolation (to eliminate the memory-bus contention we see in a single JVM). However, the core question remains: should the collaborative bar be a raw score, or a heuristic that factors in the shard count and expected local density? My ultimate goal is to make very large result sets ($K \ge 1000$) fast by allowing irrelevant shards to "bail out" early, but I'm seeing that I'll need a better way to derive that global threshold. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
