That is what is being discussed already. The thing is, at present, Solr
requires an even distribution of documents across shards, so you can't
just add another shard, assign it to a hash range, and be done with it.

The reason is down to the scoring mechanism used - TF/IDF (term
frequency/inverse document frequency). The IDF portion says "how many
times does this term appear in the whole index?" If there are only two
documents in the index, then the IDF will be very different from when
there are 2 million docs, resulting in different scores for equivalent
documents based upon which shard they are in.

Currently, the only solution to this is to distribute your documents
evenly, which would mean, if you have four shards and you create a
fifth, you'd need to send 1/4 of your documents from each shard to the
new shard, which is not really a trivial task.

I believe the JIRA ticket covering this was mentioned earlier in this
thread.

Upayavira

On Mon, Oct 8, 2012, at 04:33 PM, Radim Kolar wrote:
> Do it as it is done in cassandra database. Adding new node and 
> redistributing data can be done in live system without problem it looks 
> like this:
> 
> every cassandra node has key range assigned. instead of assigning keys 
> to nodes like hash(key) mod nodes, then every node has its portion of 
> hash keyspace. They do not need to be same, some node can have larger 
> portion of keyspace then another.
> 
> hash function max possible value is 12.
> 
> shard1 - 1-4
> shard2 - 5-8
> shard3 - 9-12
> 
> now lets add new shard. In cassandra adding new shard by default cuts 
> existing one by half, so you will have
> shard1 - 1-2
> shard2    3-4
> shard3    5-8
> shard4   9-12
> 
> see? You needed to move only documents from old shard1. Usually you are 
> adding more then 1 shard during reorganization, you do not need to 
> rebalance cluster by moving every node into different position in hash 
> keyspace that much.

Reply via email to