Hello all, We experienced a two major problems in two days on one of our data centers. Here is our setup: 15 nodes, 3 shards, one replica per node, around 50Gb of index per shard. We are running Solr 4.10.4 on Ubuntu servers using jdk 1.8.0u51. We have an ensemble of 5 zookeeper nodes to coordinate the cluster.
We usually have an update rate of around 500 up/s coming from solrj clients. Suddenly for un unknown reason one of the shard leaders was not able to connect to any of its slaves and initiated a recovery on all its slaves. At this point we were not able to perform any queries on the entire cluster. On our 15 nodes some nodes were responding but most of the nodes were not answering at all (on all shards). Their CPU was low so I used VisualVM to see what was going on. It appeared that the hanged nodes were using around 600 threads, most of them being "httpShardExecutor" threads: around 100 running and a lot in park mode. We restarted on of these nodes, and as soon as it started it created these 600 threads. We finally managed to get back our cluster by stopping all the incoming traffic and restarted the master node of the affected shard and everything was back in a few minutes. I was wondering if we hit SOLR-7109<https://issues.apache.org/jira/browse/SOLR-7109> but I'm not sure about this. Any help would be appreciated. Stephan