Hey there, we are running a SolrCloud, which has 4 nodes, same config. Each node has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but worked for a long time.
We currently run with 2 shards, 2 replicas and 11 collections. The complete data-dir is about 5.3 GB. I think we should move some JVM heap back to the OS. We are running Solr 5.2.1., as I could not see any related bugs to SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother to upgrade first. One of our nodes (node A) reports these errors: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid version (expected 2, but 101) or the data in not in 'javabin' format Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171 And shortly after (4 seconds) this happens on a *different* node (Node B): Stopping recovery for core=suggestion coreNodeName=core_node2 No Stacktrace for this, but happens for all 11 collections. 6 seconds after that Node C reports these errors: org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/customers/params.json Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8 This also happens for 11 collections. And then different errors happen: OverseerAutoReplicaFailoverThread had an error in its thread work loop.:org.apache.solr.common.SolrException: Error reading cluster properties cancelElection did not find election node to remove /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112 At that point the cluster is broken and stops responding to the most queries. In the same time zookeeper looks okay. The cluster cannot selfheal from that situation and we are forced to take manual action and restart node after node and hope that solrcloud eventually recovers. Which sometimes takes several minutes and several restarts from various nodes. We can provide more logdata if needed. Is there anything where we can start digging to find the underlying error for that problem? Thanks in advance Björn