Hello all,

We experienced a two major problems in two days on one of our data centers.
Here is our setup: 15 nodes, 3 shards, one replica per node, around 50Gb of 
index per shard.
We are running Solr 4.10.4 on Ubuntu servers using jdk 1.8.0u51.
We have an ensemble of 5 zookeeper nodes to coordinate the cluster.

We usually have an update rate of around 500 up/s coming from solrj clients.

Suddenly for un unknown reason one of the shard leaders was not able to connect 
to any of its slaves and initiated a recovery on all its slaves.
At this point we were not able to perform any queries on the entire cluster.
On our 15 nodes some nodes were responding but most of the nodes were not 
answering at all (on all shards).
Their CPU was low so I used VisualVM to see what was going on.
It appeared that the hanged nodes were using around 600 threads, most of them 
being "httpShardExecutor" threads: around 100 running and a lot in park mode.
We restarted on of these nodes, and as soon as it started it created these 600 
threads.
We finally managed to get back our cluster by stopping all the incoming traffic 
and restarted the master node of the affected shard and everything was back in 
a few minutes.
I was wondering if we hit 
SOLR-7109<https://issues.apache.org/jira/browse/SOLR-7109> but I'm not sure 
about this.

Any help would be appreciated.
Stephan

Reply via email to