Worked with Fusion and Zookeeper at GSA for 18 months: admin role. Before blowing it away, you could try:
- id a candidate node, with a snapshot you just might think is old enough to be robust. - clean data for zk nodes otherwise. - bring up the chosen node and wait for it to settle[wish i could remember why i called what i saw that] - bring up other nodes 1 at a time. let each one fully sync to follower of the new leader. - they should each in turn request the snapshot from the lead. then you have : align your collections with the ensemble. and for the life of me i can't remember there being anything particularly tricky about that with fusion , which means I can't remember what I did... or have it doc'd at home. ;-) Will Martin DEVOPS ENGINEER 540.454.9565 8609 WESTWOOD CENTER DR, SUITE 475 VIENNA, VA 22182 geturgently.com On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wun...@wunderwood.org> wrote: > Yes, please. I have the logs from each of the Zookeepers. > > We are running 3.4.12. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On May 21, 2019, at 6:49 PM, Will Martin <wmar...@urgent.ly> wrote: > > > > Walter. Can I cross-post to zk-dev? > > > > > > > > Will Martin > > DEVOPS ENGINEER > > 540.454.9565 > > > > <urgently-email-logo> > > > > 8609 WESTWOOD CENTER DR, SUITE 475 > > VIENNA, VA 22182 > > geturgently.com <http://geturgently.com/> > > > > > > > > > >> On May 21, 2019, at 9:26 PM, Will Martin <wmar...@urgent.ly <mailto: > wmar...@urgent.ly>> wrote: > >> > >> +1 > >> > >> Will Martin > >> DEVOPS ENGINEER > >> 540.454.9565 > >> > >> 8609 WESTWOOD CENTER DR, SUITE 475 > >> VIENNA, VA 22182 > >> geturgently.com <http://geturgently.com/> > >> > >> > >> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wun...@wunderwood.org > <mailto:wun...@wunderwood.org>> wrote: > >> ADDROLE times out after 180 seconds. This seems to be an unrecoverable > state for the cluster, so that is a pretty serious bug. > >> > >> wunder > >> Walter Underwood > >> wun...@wunderwood.org <mailto:wun...@wunderwood.org> > >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my > blog) > >> > >> > On May 21, 2019, at 4:10 PM, Walter Underwood <wun...@wunderwood.org > <mailto:wun...@wunderwood.org>> wrote: > >> > > >> > We have a 6.6.2 cluster in prod that appears to have no overseer. In > /overseer_elect on ZK, there is an election folder, but no leader document. > An OVERSEERSTATUS request fails with a timeout. > >> > > >> > I’m going to try ADDROLE, but I’d be delighted to hear any other > ideas. We’ve diverted all the traffic to the backing cluster, so we can > blow this one away and rebuild. > >> > > >> > Looking at the Zookeeper logs, I see a few instances of network > failures across all three nodes. > >> > > >> > wunder > >> > Walter Underwood > >> > wun...@wunderwood.org <mailto:wun...@wunderwood.org> > >> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/> > (my blog) > >> > > >> > > > >