benwtrent commented on PR #15676:
URL: https://github.com/apache/lucene/pull/15676#issuecomment-3927772689
@krickert
Your final numbers don't indicate recall. Please, we need to see what the
Pareto frontier (how recall changes with increasing efSearch) looks like for
the following scenarios as they reflect optimal, baseline, and candidate:
- All vectors within a single shard (this is baseline optimal)
- Vectors independently searched between shards (baseline multi-index/shard)
- Vectors searched with your collaborative search (candidate
multi-index/shard).
Last I saw from your benchmarks, at k:100 collaborative was much worse.
In most real world data, each index/shard will have a random subset of the
entire dataset, your tests should reflect this as well.
The Pareto frontier should likely be "recall vs. total vectors compared".
And that for the latter two benchmarks, they are done against the exact same
graphs/indices as reindexing isn't necessary to test with the collaborative
searcher.
I don't think benchmarking between machines is necessary. If the
collaborative searching isn't useful when all shards are on the same machine
(thus sharing information overhead is at its lowest), I doubt it will be
helpful at all once overheads increase.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]