Re: Sharding Techniques

Mike Sokolov Tue, 10 May 2011 06:03:13 -0700

Down to basics, Lucene searches work by locating terms and resolving
documents from them. For standard term queries, a term is located by a
process akin to binary search. That means that it uses log(n) seeks to
get the term. Let's say you have 10M terms in your corpus. If you stored
that in a single field in a single index with a single segment, it would
take log(10M) ~= 24 seeks to locate a term. This is of course very
simplified.


When you have 63 indexes, log(n) works against you. Even with the
unrealistic assumption that the 10M terms are evenly distributed and
without duplicates, the number of seeks for a search that hits all parts
will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
begun to estimate the merging part.

This is true, but if the indexes are kept on 63 separate servers, thoseseeks will be carried out in parallel. The OP did indicate his indexeswould be on different servers, I think? I still agree with your overallpoint - at this scale a single server is probably best. And if thereare performance issues, I think the usual approach is to create multiplemirrored copies (slaves) rather than sharding. Sharding is useful forvery large indexes: indexes to big to store on disk and cache in memoryon one commodity box


-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Sharding Techniques

Reply via email to