Hey there,

we are running a SolrCloud, which has 4 nodes, same config. Each node
has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
worked for a long time.

We currently run with 2 shards, 2 replicas and 11 collections. The
complete data-dir is about 5.3 GB.
I think we should move some JVM heap back to the OS.

We are running Solr 5.2.1., as I could not see any related bugs to
SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
to upgrade first.

One of our nodes (node A) reports these errors:

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
version (expected 2, but 101) or the data in not in 'javabin' format

Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171

And shortly after (4 seconds) this happens on a *different* node (Node B):

Stopping recovery for core=suggestion coreNodeName=core_node2

No Stacktrace for this, but happens for all 11 collections.

6 seconds after that Node C reports these errors:

org.apache.solr.common.SolrException:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/customers/params.json

Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8

This also happens for 11 collections.

And then different errors happen:

OverseerAutoReplicaFailoverThread had an error in its thread work
loop.:org.apache.solr.common.SolrException: Error reading cluster
properties

cancelElection did not find election node to remove
/overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112

At that point the cluster is broken and stops responding to the most
queries. In the same time zookeeper looks okay.

The cluster cannot selfheal from that situation and we are forced to
take manual action and restart node after node and hope that solrcloud
eventually recovers. Which sometimes takes several minutes and several
restarts from various nodes.

We can provide more logdata if needed.

Is there anything where we can start digging to find the underlying
error for that problem?

Thanks in advance
Björn

Reply via email to