[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497517#comment-14497517
 ] 

Mike Lundy commented on ZOOKEEPER-2167:
---------------------------------------

I've attached my more complete log, this goes back much further and shows the 
previous (annotated) node add process as well (we go 3->4->5->4, in each case 
applying the change in a rolling fashion before moving on)

This is the same log as before, just more complete and better annotated (again, 
using XXX to indicate where starts and stops happened, also showing the 
configured ensemble at the time) so everything I said previous still applies 
(ZID/IP mapping is the same, etc).

> Restarting current leader node sometimes results in a permanent loss of quorum
> ------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2167
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2167
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.6
>            Reporter: Mike Lundy
>         Attachments: fails-to-rejoin-quorum.gz
>
>
> I'm seeing an issue where a restart of the current leader node results in a 
> long-term / permanent loss of quorum (I've only waited 30 minutes, but it 
> doesn't look like it's making any progress). Restarting the same instance 
> _again_ seems to resolve the problem.
> To me, this looks a lot like the issue described in 
> https://issues.apache.org/jira/browse/ZOOKEEPER-1026, but I'm filing this 
> separately for the moment in case I am wrong.
> Notes on the attached log:
> 1) If you search for XXX in the log, you'll see where I've annotated it to 
> include where the process was told to terminate, when it is reported to have 
> completed that, and then the same for the start
> 2) To save you the trouble of figuring it out, here's the zkid <=> ip mapping:
> zid=1, ip=10.20.0.19
> zid=2, ip=10.20.0.18
> zid=3, ip=10.20.0.20
> zid=4, ip=10.20.0.21
> zid=5, ip=10.20.0.22
> 3) It's important to note that this is log is during the process of a rolling 
> service restart to remove an instance; in this case, zid #2 / 10.20.0.18 is 
> the one being removed, so if you see a conspicuous silence from that service, 
> that's why. 
> 4) I've been unable to reproduce this problem _except_ during cluster size 
> changes, so I suspect that may be related; it's also important to note that 
> this test is going from 5 -> 4 (which means, since we remove one and then do 
> a rolling restart, we are actually temporarily dropping to 3). I know this is 
> not a recommended thing (this is more of a stress test). We have seen this 
> same problem on larger cluster sizes, it just seems easier to reproduce it on 
> smaller sizes.
> 5) The log starts roughly at the point 10.20.0.21 / zid=4 wins the election 
> during the final quorum; zid=4 is the one whose shutdown triggers the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to