On 1/8/2019 12:12 PM, Johnston, Charlie wrote:
We have been using Solr 6.5.1 leveraging SolrCloud backed by ZooKeeper for a 
multi-client, multi-node cluster for several months now and have been having a 
few stability/recovery issues we’d like to confirm if they are fixed in Solr 7 
or not. We run 3 large Solr nodes in the cluster (each with 64 GBs of heap, 
100’s of collections, and 9000+ cores).

Managing that many indexes is currently SolrCloud's achilles heel.

Splitting the cluster across more Solr nodes will help to some degree, but dealing with thousands of replicas in a single cluster is simply not going to scale.  In addition to splitting the indexes across more nodes, you may also need to create multiple clusters so that each cluster is managing a smaller number of shard replicas.

This is a known problem, and there is constantly work underway to try and improve the situation.  I need to repeat the experiments that I did on SOLR-7191 on a much newer version so that I can have a better idea of whether the situation has improved in 7.x versions.

The discussion on SOLR-7191 is long and very dense, but it might be worth reading.  You have about twice as many cores as I was creating in my experiments, which means that Solr will be processing more messages for recovery operations:

https://issues.apache.org/jira/browse/SOLR-7191

I think that SOLR-10265 pinpoints the central problem that causes these issues.  Some of its sub-issues have been implemented, which MIGHT mean that 7.x is a lot better off:

https://issues.apache.org/jira/browse/SOLR-10265

Thanks,
Shawn

Reply via email to