I’m trying to learn all I can on Solr for a year now and  I still scratch my 
head when it comes to effects of shards and replicas on performance.

----- info about my setup ----
We have a SolrCloud setup with 6 nodes.
Each collection has 2 shards and 2 replicas. 1 shard’s size is about 100GB.
Each collection has around  400M documents.
We have ~20 collections like this and they are increasing in number.
Each node has 50GB memory, 24GB of it given to Solr heap.

We do a lot of faceting and streaming expressions.
--------------------------------------


Replicas: I don’t need too much fault tolerance, if a node goes down my queries 
can stop, so a minimum of 2 replicas is enough for me if adding more replicas 
won’t help with performance. Is more replicas just a waste of disk space then? 
On the other hand, if more nodes had replicas of the same collection, could 
they execute those queries so the workload would be split over more nodes?


Shards: If a shard is huge for a single node, splitting it helps, but when it’s 
small, splitting it just causes more distributed work to be done right? So 2 
shards may be a sweet spot for me, or would I get better performance if I had 
smaller more numerous shards?


Lets say I had only 1 replica for each collection but I split it to 6 shards, 1 
for every node.
Or I had 2 shards (1 shard is too big for a single node I think) but I had 3 
replicas, 3x2=6, 1 on every node.

How would it affect the performance?

Also we do a lot of multi-collection search (solr/col1,col2,col3.../select) , 
so 1 query sometimes goes to 10 different collections. In that case even if the 
result set is very small, it takes a lot of time for the query to complete.

Many thanks if you read until here!

--uyilmaz

Reply via email to