Any particular error/stacktrace in the logs? If it is zookeeper that is self killing it should log it, otherwise is some other external system, I am sorry I don't know Exhibitor
Hope that helps Enrico Il mer 2 ott 2019, 21:40 Jerry Hebert <[email protected]> ha scritto: > Hi Jörn, > > No, this was a very intermittent issue. We've been running this ensemble > for about four years now and have never seen this problem so it seems to be > super heisenbuggy. Our upgrade process will be more involved than what you > described (we're switching networks, instance types, underlying automation > and removing Exhibitor) but I'm glad you asked because I have a question > about that too. :) > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I > wasn't sure if that would work or not. e.g., maybe I could bring up the new > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes, > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes? > > Thanks, > Jerry > > On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <[email protected]> wrote: > > > Have you tried to stop the node, delete the data and log directory, > > upgrade to 3.5.5 , start the node and wait until it is synchronized ? > > > > > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <[email protected]>: > > > > > > Hi all, > > > > > > My first post here! I'm hoping you all might be able to offer some > > guidance > > > or redirect me to an existing ticket. We have a five node ensemble on > > > 3.4.11 that we're currently in the process of upgrading to 3.5.5. We > > > recently saw some bizarre behavior in our ensemble that I was hoping to > > > find some sort pre-existing ticket or discussion about but I was having > > > difficulty finding hits for this in Jira. > > > > > > The behavior that we saw from our metrics is that one of our nodes (not > > > sure if it was a follower or a leader) started to demonstrate > > > instability (high CPU, high RAM) and it crashed. Not a big deal, but as > > > soon as it crashed, all of the other four nodes all immediately > > restarted, > > > resulting in a short outage. One node crashing should never cause an > > > ensemble restart of course, so I assumed that this must be a bug in ZK. > > The > > > nodes that restarted had no indication of errors in their logs, they > just > > > simply restarted. Does this sound familiar to any of you? > > > > > > Also, we are using Exhibitor on that ensemble so it's also possible > that > > > the restart was caused by Exhibitor. > > > > > > My hope is that this issue will be behind us once the 3.5.5 upgrade is > > > complete but I'd ideally like to find some concrete evidence of this. > > > > > > Thanks! > > > Jerry > > >
