On 6/3/2015 6:39 PM, Joseph Obernberger wrote: > Hi All - I've run into a problem where every-once in a while one or more > of the shards (27 shard cluster) will loose connection to zookeeper and > report "updates are disabled". In additional to the CLUSTERSTATUS > timeout errors, which don't seem to cause any issue, this one certainly > does as that shard no longer takes any (you guessed it!) updates! > We are using Zookeeper with 7 nodes (7 servers in our quorum). > There stack trace is:
Other messages you have sent talk about Solr 5.x, and one of them mentions a 16-node cluster with a 2.9 terabyte index, with the index data stored on HDFS. I'm going to venture a guess that you don't have anywhere near enough RAM for proper disk caching, leading to general performance issues, which ultimately cause timeouts. With HDFS, I'm not sure whether OS disk cache on the Solr server matters very much, or whether that needs to be on the HDFS servers. I would guess the latter. Also, if your storage networking is gigabit or slower, HDFS may have significantly more latency than local storage. For good network storage speed, you want 10gig ethernet or Infiniband. If it's Solr 5.x and you are using the included startup scripts, then long GC pauses are probably not a major issue. The startup scripts include significant GC tuning. If you have deployed in your own container, GC tuning might be an issue -- it is definitely required. Here is where I have written down everything I've learned about Solr performance problems, most of which are due to one problem or another with memory: https://wiki.apache.org/solr/SolrPerformanceProblems Is your zookeeper database on local storage or HDFS? I would suggest keeping that on local storage for optimal performance. Thanks, Shawn