[
https://issues.apache.org/jira/browse/ZOOKEEPER-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495988#comment-14495988
]
Mike Lundy commented on ZOOKEEPER-2167:
---------------------------------------
I can't check the full log for 10-12 hours from now (I unfortunately snipped
that part of the log out of what I posted here, but I still have the original
logs) but it's possible/probable that they haven't been updated yet; this is
during a rolling restart of the cluster to permanently remove node 2 from the
ensemble. (This cluster is smaller than we usually use, since it's kind of
sketchy to resize it with so few nodes since there's no margin for error; as I
said above, we've seen this on larger ensembles, too, it's just easier to repro
the problem with the smaller cluster).
It's hard to tell just from the description, but yeah, it's possible that this
is the same as https://issues.apache.org/jira/browse/ZOOKEEPER-2164; if my
problem is the same, that means that it's not the restart but the stop that
causes the problem... hm. It usually takes a few hours to repro the problem,
but we do currently have a pretty reliable repro; I plan to run some more
experiments tomorrow and hopefully learn something (but if anyone who
understands zk better than I do has any insights, they'd be appreciated).
> Restarting current leader node sometimes results in a permanent loss of quorum
> ------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2167
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2167
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.6
> Reporter: Mike Lundy
> Attachments: fails-to-rejoin-quorum.gz
>
>
> I'm seeing an issue where a restart of the current leader node results in a
> long-term / permanent loss of quorum (I've only waited 30 minutes, but it
> doesn't look like it's making any progress). Restarting the same instance
> _again_ seems to resolve the problem.
> To me, this looks a lot like the issue described in
> https://issues.apache.org/jira/browse/ZOOKEEPER-1026, but I'm filing this
> separately for the moment in case I am wrong.
> Notes on the attached log:
> 1) If you search for XXX in the log, you'll see where I've annotated it to
> include where the process was told to terminate, when it is reported to have
> completed that, and then the same for the start
> 2) To save you the trouble of figuring it out, here's the zkid <=> ip mapping:
> zid=1, ip=10.20.0.19
> zid=2, ip=10.20.0.18
> zid=3, ip=10.20.0.20
> zid=4, ip=10.20.0.21
> zid=5, ip=10.20.0.22
> 3) It's important to note that this is log is during the process of a rolling
> service restart to remove an instance; in this case, zid #2 / 10.20.0.18 is
> the one being removed, so if you see a conspicuous silence from that service,
> that's why.
> 4) I've been unable to reproduce this problem _except_ during cluster size
> changes, so I suspect that may be related; it's also important to note that
> this test is going from 5 -> 4 (which means, since we remove one and then do
> a rolling restart, we are actually temporarily dropping to 3). I know this is
> not a recommended thing (this is more of a stress test). We have seen this
> same problem on larger cluster sizes, it just seems easier to reproduce it on
> smaller sizes.
> 5) The log starts roughly at the point 10.20.0.21 / zid=4 wins the election
> during the final quorum; zid=4 is the one whose shutdown triggers the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)