krickert commented on PR #15676:
URL: https://github.com/apache/lucene/pull/15676#issuecomment-3871612413

   **Summary so far**
   
   I now have a controlled multi‑shard test harness. The current luceneutil 
doesn’t include a "concurrent shard/join" scenario (which makes sense - that's 
typically a Solr/OpenSearch concern).
   
   **What I’ve confirmed so far**
   
   * Single‑index recall is unchanged with this patch.
   * The luceneutil unit tests show modest pruning (~3%) at low K, with no 
recall loss.
   
   **Why the new harness**
   
   This is just a stop-gap to test this feature.  I'll be glad to introduce 
them into luceneutil after and discuss how to do that.   @benwtrent regarding 
lucene to be tested for sharding purposes, would you be open for another effort 
to extend [luceneutil](https://github.com/mikemccand/luceneutil/) to also have 
those use cases?  I didn't want to bloat the util, and it's a pretty 
complicated setup. 
   
   Higher K values need large, realistic datasets. I reviewed luceneutil, but 
building a one‑off shard tool felt safer than adding large test infrastructure 
there. I’ll happily move these tests into luceneutil once there’s a multi‑shard 
test path.
   
   **Current status / issue**
   
   In the new shard harness I’m seeing a large speedup (~50%) and big latency 
reduction, but also a significant recall drop with 16 shards (~10%) compared to 
the standard Lucene jar (and both baselines are lower than I expected). I’m 
currently isolating whether this is a data collection issue vs a bug.
   
   **When stable, planned test matrix**
   
   K values: {10, 100, 1000, 10000}
   Shards: {2, 4, 8, 16}
   Datasets: {1‑sentence Wikipedia, 3‑sentence Wikipedia}
   Models: {MiniLM, E5}
   
   This leads to 64 test runs after I see more stable results.
   
   The testing project I pushed automates most of this (took ~3 hours to set up 
locally, requires docker). 
   
   Baseline runs against current Lucene are done. Next I’ll repeat with the PR 
jar and measure pruning/visit reductions.
   
   This should provide a definitive signal for the distributed use case if a 
streaming update of the global bar is used.
   
   Dataset: ~2M Wikipedia vectors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to