Folks,

We use Consul for service discovery, health checking, and load balancing. We 
have a 3 node Solrcloud 6.3.6 cluster with about a dozen single shard 
collections; most of those are well under < 10GB, but one is 120GB. We have a 
separate 3 node ZooKeeper ensemble.

We've been having some stability issues with OOM errors and occasional random 
recovery events. We're working through some options to resolve at least one of 
those root causes (out-of-control queries), but this event popped up today in 
the log at the same time that 4 health checks failed (we use the equivalent of 
a curl request to /admin/ping to determine availability of each collection on 
each node).

5/12/2021, 3:18:35 PM
WARN false
ConnectionManager
zkClient has disconnected

In researching `zkClient has disconnected` it seems that the recommendation is 
to simply increase the timeout. We had it set to the default of 15s; after 
updating our `solr.in.sh` file and restarting solr on each of the instances 
it's now at 30s.

Can anyone provide some insight on why Solr would timeout trying to reach 
ZooKeeper? And why is 15s too short? And could this cause the health check to 
fail?

If there's some specific documentation on how Solr interacts with ZooKeeper I'm 
happy to read it, I just can't seem to find it.

Many thanks,

--jamie

Reply via email to