[
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713933#comment-13713933
]
Flavio Junqueira commented on ZOOKEEPER-1732:
---------------------------------------------
Here is my analysis of the logs.
Server 3 has been elected two times, both times with support of Server 1:
{noformat}
2013-07-19 10:16:09,746 [myid:3] - DEBUG
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:30103:FastLeaderElection@493] - About to
leave FLE instance: leader=3, zxid=0xb800000099, my id=3, my state=LEADING
2013-07-19 10:16:26,667 [myid:3] - DEBUG
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:30103:FastLeaderElection@493] - About to
leave FLE instance: leader=3, zxid=0xb900000052, my id=3, my state=LEADING
{noformat}
Server 2 elects Server 3 but loses the connection to Server 3 right after:
{noformat}
2013-07-19 10:16:20,858 [myid:2] - INFO
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:30102:Follower@63] - FOLLOWING - LEADER
ELECTION TOOK - 47
2013-07-19 10:16:20,858 [myid:2] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@762] - Connection broken for id 3, my
id = 2, error =
{noformat}
And it doesn't seem to go into a new round of leader election. Because it is
not trying to elect a new leader, its vote reflects the state of the first
leader instance of Server 3.
Now, Server 3 later on loses its connection to Server 1:
{noformat}
2013-07-19 10:16:34,307 [myid:3] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@762] - Connection broken for id 1, my
id = 3, error =
{noformat}
but it doesn't seem to care, so it must have the support of Server 2. Server 2
again seems to be referring to a previous leader instance of Server 3, so its
support to Server 3 must be surviving the crash of Server 3 around "2013-07-19
10:16:20,858 [myid:2]" and my guess is that Server is getting confused about
dropping the connection right after electing Server 3 and it is trying to
establish a new connection, which succeeds when Server 3 comes back up. I think
there is a race there....
> ZooKeeper server unable to join established ensemble
> ----------------------------------------------------
>
> Key: ZOOKEEPER-1732
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection
> Affects Versions: 3.4.5
> Environment: Windows 7, Java 1.7
> Reporter: Germán Blanco
> Priority: Blocker
> Fix For: 3.5.0, 3.4.6
>
> Attachments: zklog.tar.gz
>
>
> I have a test in which I do a rolling restart of three ZooKeeper servers and
> it was failing from time to time.
> I ran the tests in a loop until the failure came out and it seems that at
> some point one of the servers is unable to join the enssemble formed by the
> other two.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira