krickert commented on PR #15676: URL: https://github.com/apache/lucene/pull/15676#issuecomment-3927856021
> @krickert > > Your final numbers don't indicate recall. Please, we need to see what the Pareto frontier (how recall changes with increasing efSearch) looks like for the following scenarios as they reflect optimal, baseline, and candidate: > > - All vectors within a single shard (this is baseline optimal) > - Vectors independently searched between shards (baseline multi-index/shard) > - Vectors searched with your collaborative search (candidate multi-index/shard). > > Last I saw from your benchmarks, at k:100 collaborative was much worse. > > In most real world data, each index/shard will have a random subset of the entire dataset, your tests should reflect this as well. > > The Pareto frontier should likely be "recall vs. total vectors compared". And that for the latter two benchmarks, they are done against the exact same graphs/indices as reindexing isn't necessary to test with the collaborative searcher. > > > I don't think benchmarking between machines is necessary. If the collaborative searching isn't useful when all shards are on the same machine (thus sharing information overhead is at its lowest), I doubt it will be helpful at all once overheads increase. No problem. So The benchmark comparison -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
