Down to basics, Lucene searches work by locating terms and resolving
documents from them. For standard term queries, a term is located by a
process akin to binary search. That means that it uses log(n) seeks to
get the term. Let's say you have 10M terms in your corpus. If you stored
that in a single field in a single index with a single segment, it would
take log(10M) ~= 24 seeks to locate a term. This is of course very
simplified.

When you have 63 indexes, log(n) works against you. Even with the
unrealistic assumption that the 10M terms are evenly distributed and
without duplicates, the number of seeks for a search that hits all parts
will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
begun to estimate the merging part.
This is true, but if the indexes are kept on 63 separate servers, those seeks will be carried out in parallel. The OP did indicate his indexes would be on different servers, I think? I still agree with your overall point - at this scale a single server is probably best. And if there are performance issues, I think the usual approach is to create multiple mirrored copies (slaves) rather than sharding. Sharding is useful for very large indexes: indexes to big to store on disk and cache in memory on one commodity box

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to