[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

Flavio Junqueira (JIRA) Thu, 25 Jul 2013 01:58:57 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719402#comment-13719402
 ]


Flavio Junqueira commented on ZOOKEEPER-1732:
---------------------------------------------

By "agree to vote", don't you need a different message pattern, even if the 
message content is the same? You're still changing the protocol here. Also, we 
don't need agreement, since different processes can have a different opinion 
about who the leader should be. They need to agree before they start a new 
epoch, but that's precisely what the recovery phase of zab does. It does a bit 
more actually, but the whole state sync up is not relevant to this discussion.

bq. it actually doesn't take part in the leader election logic

This is not entirely true, the LE step exposes a leader that has the highest 
zxid among a quorum of servers. Also, I think that you're using LE as the 
recovery phase of Zab, not that the initial protocol that finds a prospective 
leader.

bq. The new server just checks if the ensemble has a quorum and the leader is 
alive (sends a notification voting for itself)

I believe we have discussed this point in this jira. As you have observed, the 
ensemble is still able to make progress in the situation you have originally 
described, so the inconsistent LE information doesn't prevent zookeeper from 
doing work. The problem is getting a server stuck, which we fix by making sure 
that a follower is able to send notifications with state that reflects the 
latest leader election. 

One option I was actually considering is to loosen the constraint that all 
FOLLOWING/LEADING notifications need to come from the same LE round. This is 
possibly too conservative, so it might be ok to change it, but I need to think 
a bit more about it.
                
> ZooKeeper server unable to join established ensemble
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-1732
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>         Environment: Windows 7, Java 1.7
>            Reporter: Germán Blanco
>            Priority: Blocker
>             Fix For: 3.5.0, 3.4.6
>
>         Attachments: zklog.tar.gz
>
>
> I have a test in which I do a rolling restart of three ZooKeeper servers and 
> it was failing from time to time.
> I ran the tests in a loop until the failure came out and it seems that at 
> some point one of the servers is unable to join the enssemble formed by the 
> other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

Reply via email to