Re: How Zookeeper (and Puppet) brought down our Solr Cluster

Jan Høydahl Mon, 13 Mar 2017 16:52:35 -0700

Hi

Thanks for reporting.
As it may take some time before we get ZK 3.5.x out there it would be nice with 
a fix.
Do you plan to make our zkClient somehow explicitly validate that all given zk 
nodes are “good”?


Or is there some way we could fix this with documentation?
I imagine, if we always propose to use a chroot, e.g. 
ZK_HOST=zoo1,zoo2,zoo3/solr then it would be a requirement to do a mkroot 
before being able to use ZK. And I assume that in that case if one of the ZK 
nodes got restarted without or with wrong configuration, it would startup with 
some other data folder(?) and refuse to serve any data whatsoever since the 
/solr root would not exist?

I’d say, even if this is not a Solr bug per se, it is still worthy of a JIRA 
issue.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 14. mar. 2017 kl. 00.11 skrev Ben DeMott <ben.dem...@gmail.com>:
> 
> So wanted to throw this out there, and get any feedback.
> 
> We had a persistent issue with our Solr clusters doing crazy things, from 
> running out of file-descriptors, to having replication issues, to filling up 
> the /overseer/queue .... Just some of the log Exceptions:
> 
> o.e.j.s.ServerConnector java.io <http://java.io/>.IOException: Too many open 
> files
> 
> o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying 
> to proxy request for url: 
> http://10.50.64.4:8983/solr/efc-jobsearch-col/select 
> <http://10.50.64.4:8983/solr/efc-jobsearch-col/select>
> 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ClusterState 
> says we are the leader 
> (http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2 
> <http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2>), but locally 
> we don't think so. Request came from null
> 
> o.a.s.c.Overseer Bad version writing to ZK using compare-and-set, will force 
> refresh cluster state: KeeperErrorCode = BadVersion for 
> /collections/efc-jobsearch-col/state.json
> 
> IndexFetcher File _5oz.nvd did not match. expected checksum is 3661731988 and 
> actual is checksum 840593658. expected length is 271091 and actual length is 
> 271091
> 
> 
> ...
> 
> I'll get to the point quickly.  This was all caused by a Zookeeper 
> configuration on a particular node getting reset, for a period of seconds, 
> and the service being restarted automatically.  When this happened, Solr's 
> connection to Zookeeper would be reset, Solr would reconnect, to the 
> Zookeeper node, which had a blank configuration and was in "STANDALONE" mode. 
>  The changes to ZK that were registered by the Solr connection wouldn't be 
> registered with the rest of the cluster.
> 
> As a result the cversion of /live_nodes would be ahead of the other servers 
> by a version or two, but the zxid's would all by in-sync.  The nodes would 
> never re-synchronize; as far as Zookeeper is concerned everything is synced 
> up properly.  Also /live_nodes would be a mis-matched mess, empty, or 
> inconsistent, depending on where Solr's ZK connections were pointed, 
> resulting in Client connections  returning some, wrong, or no "live nodes".
> 
> Now, it specifically tells you to never connect to an inconsistent group of 
> servers as it will play havoc with Zookeeper, and it did exactly this.
> 
> As of Zookeeper 3.5 there is an option to NEVER ALLOW IT TO RUN IN STANDALONE 
> which we will be using when a stable version is released.
> 
> It caused absolute havoc within our cluster.
> 
> So to summarize, if a Zookeeper ensemble host ever goes into "Standalone" 
> even temporarily, Solr will be disconnected, and then (may) reconnect 
> (depending on which ZK node it picks) and its updates will never by 
> synchronized. Also it won't be able to coordinate any of its Cloud operations.
> 
> So in the interest of being a good internet citizen I'm writing this up, is 
> there any desire for a patch that would provide a configuration or jvm option 
> to refuse to connect to nodes in standalone operation?   Obviously the 
> built-in ZK server that comes with Solr runs in standalone mode, so this 
> would only be an option for solr.in.sh.... But it would prevent Solr from 
> bringing the entire cluster down, in the event a single ZK server was 
> temporarily misconfigured, or lost it's configuration for some reason.
> 
> Maybe this isn't worth addressing.  Thoughts?
>

Re: How Zookeeper (and Puppet) brought down our Solr Cluster

Reply via email to