Cluster with no overseer?

2019-05-21 Thread Walter Underwood
We have a 6.6.2 cluster in prod that appears to have no overseer. In /overseer_elect on ZK, there is an election folder, but no leader document. An OVERSEERSTATUS request fails with a timeout. I’m going to try ADDROLE, but I’d be delighted to hear any other ideas. We’ve diverted all the traffic

Re: Cluster with no overseer?

2019-05-21 Thread Walter Underwood
ADDROLE times out after 180 seconds. This seems to be an unrecoverable state for the cluster, so that is a pretty serious bug. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 21, 2019, at 4:10 PM, Walter Underwood wrote: > > We have a 6.6.2 clu

Re: Cluster with no overseer?

2019-05-21 Thread Will Martin
+1 Will Martin DEVOPS ENGINEER 540.454.9565 8609 WESTWOOD CENTER DR, SUITE 475 VIENNA, VA 22182 geturgently.com On Tue, May 21, 2019 at 7:39 PM Walter Underwood wrote: > ADDROLE times out after 180 seconds. This seems to be an unrecoverable > state for the cluster, so that is a pretty serious

Re: Cluster with no overseer?

2019-05-21 Thread Will Martin
Walter. Can I cross-post to zk-dev? Will MartinDEVOPS ENGINEER540.454.9565 urgently-email-logo Description: application/apple-msg-attachment 8609 WESTWOOD CENTER DR, SUITE 475VIENNA, VA 22182geturgently.com On May 21, 2019, at 9:26 PM, Will Martin wrote:+1Will MartinDEVOPS ENG

Re: Cluster with no overseer?

2019-05-21 Thread Walter Underwood
Yes, please. I have the logs from each of the Zookeepers. We are running 3.4.12. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 21, 2019, at 6:49 PM, Will Martin wrote: > > Walter. Can I cross-post to zk-dev? > > > > Will Martin > DEVOPS E

Re: Cluster with no overseer?

2019-05-21 Thread Will Martin
Worked with Fusion and Zookeeper at GSA for 18 months: admin role. Before blowing it away, you could try: - id a candidate node, with a snapshot you just might think is old enough to be robust. - clean data for zk nodes otherwise. - bring up the chosen node and wait for it to settle[wish i could

Re: Cluster with no overseer?

2019-05-22 Thread Erick Erickson
Walter: I have no idea what the root cause is here, this really shouldn’t happen. But the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is assigned similarly to a shard leader, the same election process happens. All the election nodes are ephemeral ZK nodes. Solr’s Overseer i

Re: Cluster with no overseer?

2019-05-22 Thread Walter Underwood
Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did a rolling restart yesterday. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 22, 2019, at 8:21 AM, Erick Erickson wrote: > > Walter: > > I have no idea what the root

Re: Cluster with no overseer?

2019-05-22 Thread Erick Erickson
Good luck, this kind of assumes that your ZK ensemble is healthy of course... > On May 22, 2019, at 8:23 AM, Walter Underwood wrote: > > Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did > a rolling restart yesterday. > > wunder > Walter Underwood > wun...@wunderwoo

Re: Cluster with no overseer?

2019-05-22 Thread Walter Underwood
The ZK ensemble appears to be OK. It is the Solr-related stuff that is borked. There are 110 items in /overseer/collection-queue-work/, which doesn’t seem healthy. If it is really hosed, I’ll shut down all the nodes, clean out the files in Zookeeper and start over. wunder Walter Underwood wun.

Re: Cluster with no overseer?

2019-05-22 Thread Erick Erickson
110 isn’t all that many, well within the normal range _assuming_ that they are being processed…. When you restart Solr, every state change operation writes an operation to the work queue which can mount up. Perhaps you’re hitting: https://issues.apache.org/jira/browse/SOLR-13416? In which case