Re: One node crashing in 3.4.11 triggered a full ensemble restart

Jerry Hebert Wed, 02 Oct 2019 13:52:33 -0700

Hi Enrico,

The nodes that restarted did not have any errors in their logs, they seemed
to simply restart successfully so I think your hunch about the external
system is probably correct.


Could you comment on my second question above regarding cross-version
migration or should I make a new thread?

Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
> wasn't sure if that would work or not. e.g., maybe I could bring up the new
> 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes,
> five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?


Thanks!
Jerry

On Wed, Oct 2, 2019 at 1:12 PM Enrico Olivelli <[email protected]> wrote:

> Any particular error/stacktrace in the logs?
> If it is zookeeper that is self killing it should log it, otherwise is some
> other external system, I am sorry I don't know Exhibitor
>
> Hope that helps
> Enrico
>
> Il mer 2 ott 2019, 21:40 Jerry Hebert <[email protected]> ha scritto:
>
> > Hi Jörn,
> >
> > No, this was a very intermittent issue. We've been running this ensemble
> > for about four years now and have never seen this problem so it seems to
> be
> > super heisenbuggy. Our upgrade process will be more involved than what
> you
> > described (we're switching networks, instance types, underlying
> automation
> > and removing Exhibitor) but I'm glad you asked because I have a question
> > about that too. :)
> >
> > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble?
> I
> > wasn't sure if that would work or not. e.g., maybe I could bring up the
> new
> > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> nodes,
> > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
> >
> > Thanks,
> > Jerry
> >
> > On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <[email protected]>
> wrote:
> >
> > > Have you tried to stop the node, delete the data and log directory,
> > > upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> > >
> > > > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <[email protected]
> >:
> > > >
> > > > Hi all,
> > > >
> > > > My first post here! I'm hoping you all might be able to offer some
> > > guidance
> > > > or redirect me to an existing ticket. We have a five node ensemble on
> > > > 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> > > > recently saw some bizarre behavior in our ensemble that I was hoping
> to
> > > > find some sort pre-existing ticket or discussion about but I was
> having
> > > > difficulty finding hits for this in Jira.
> > > >
> > > > The behavior that we saw from our metrics is that one of our nodes
> (not
> > > > sure if it was a follower or a leader) started to demonstrate
> > > > instability (high CPU, high RAM) and it crashed. Not a big deal, but
> as
> > > > soon as it crashed, all of the other four nodes all immediately
> > > restarted,
> > > > resulting in a short outage. One node crashing should never cause an
> > > > ensemble restart of course, so I assumed that this must be a bug in
> ZK.
> > > The
> > > > nodes that restarted had no indication of errors in their logs, they
> > just
> > > > simply restarted. Does this sound familiar to any of you?
> > > >
> > > > Also, we are using Exhibitor on that ensemble so it's also possible
> > that
> > > > the restart was caused by Exhibitor.
> > > >
> > > > My hope is that this issue will be behind us once the 3.5.5 upgrade
> is
> > > > complete but I'd ideally like to find some concrete evidence of this.
> > > >
> > > > Thanks!
> > > > Jerry
> > >
> >
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Reply via email to