krickert commented on PR #15676: URL: https://github.com/apache/lucene/pull/15676#issuecomment-3871612413
**Summary so far** I now have a controlled multi‑shard test harness. The current luceneutil doesn’t include a "concurrent shard/join" scenario (which makes sense - that's typically a Solr/OpenSearch concern). **What I’ve confirmed so far** * Single‑index recall is unchanged with this patch. * The luceneutil unit tests show modest pruning (~3%) at low K, with no recall loss. **Why the new harness** This is just a stop-gap to test this feature. I'll be glad to introduce them into luceneutil after and discuss how to do that. @benwtrent regarding lucene to be tested for sharding purposes, would you be open for another effort to extend [luceneutil](https://github.com/mikemccand/luceneutil/) to also have those use cases? I didn't want to bloat the util, and it's a pretty complicated setup. Higher K values need large, realistic datasets. I reviewed luceneutil, but building a one‑off shard tool felt safer than adding large test infrastructure there. I’ll happily move these tests into luceneutil once there’s a multi‑shard test path. **Current status / issue** In the new shard harness I’m seeing a large speedup (~50%) and big latency reduction, but also a significant recall drop with 16 shards (~10%) compared to the standard Lucene jar (and both baselines are lower than I expected). I’m currently isolating whether this is a data collection issue vs a bug. **When stable, planned test matrix** K values: {10, 100, 1000, 10000} Shards: {2, 4, 8, 16} Datasets: {1‑sentence Wikipedia, 3‑sentence Wikipedia} Models: {MiniLM, E5} This leads to 64 test runs after I see more stable results. The testing project I pushed automates most of this (took ~3 hours to set up locally, requires docker). Baseline runs against current Lucene are done. Next I’ll repeat with the PR jar and measure pruning/visit reductions. This should provide a definitive signal for the distributed use case if a streaming update of the global bar is used. Dataset: ~2M Wikipedia vectors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
