[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713933#comment-13713933
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
---------------------------------------------

Here is my analysis of the logs.

Server 3 has been elected two times, both times with support of Server 1:

{noformat}
2013-07-19 10:16:09,746 [myid:3] - DEBUG 
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:30103:FastLeaderElection@493] - About to 
leave FLE instance: leader=3, zxid=0xb800000099, my id=3, my state=LEADING

2013-07-19 10:16:26,667 [myid:3] - DEBUG 
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:30103:FastLeaderElection@493] - About to 
leave FLE instance: leader=3, zxid=0xb900000052, my id=3, my state=LEADING
{noformat}

Server 2 elects Server 3 but loses the connection to Server 3 right after:

{noformat}
2013-07-19 10:16:20,858 [myid:2] - INFO  
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:30102:Follower@63] - FOLLOWING - LEADER 
ELECTION TOOK - 47
2013-07-19 10:16:20,858 [myid:2] - WARN  
[RecvWorker:3:QuorumCnxManager$RecvWorker@762] - Connection broken for id 3, my 
id = 2, error = 
{noformat}

And it doesn't seem to go into a new round of leader election. Because it is 
not trying to elect a new leader, its vote reflects the state of the first 
leader instance of Server 3.

Now, Server 3 later on loses its connection to Server 1:

{noformat}
2013-07-19 10:16:34,307 [myid:3] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@762] - Connection broken for id 1, my 
id = 3, error = 
{noformat}

but it doesn't seem to care, so it must have the support of Server 2. Server 2 
again seems to be referring to a previous leader instance of Server 3, so its 
support to Server 3 must be surviving the crash of Server 3 around "2013-07-19 
10:16:20,858 [myid:2]" and my guess is that Server is getting confused about 
dropping the connection right after electing Server 3 and it is trying to 
establish a new connection, which succeeds when Server 3 comes back up. I think 
there is a race there....
                
> ZooKeeper server unable to join established ensemble
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-1732
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>         Environment: Windows 7, Java 1.7
>            Reporter: Germán Blanco
>            Priority: Blocker
>             Fix For: 3.5.0, 3.4.6
>
>         Attachments: zklog.tar.gz
>
>
> I have a test in which I do a rolling restart of three ZooKeeper servers and 
> it was failing from time to time.
> I ran the tests in a loop until the failure came out and it seems that at 
> some point one of the servers is unable to join the enssemble formed by the 
> other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to