[
https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987517#action_12987517
]
Vishal K commented on ZOOKEEPER-975:
------------------------------------
Hi Flavio,
Do you think that this will be a problem even after we have the patch for
ZOOKEEPER-932?
This is what ZOOKEEPER-475 describes:
----------
* Replica 1 sends a message to itself and to Replica 2 stating that its current
vote is for replica 1;
* Replica 2 sends a message to itself and to Replica 1 stating that its current
vote is for replica 2;
* Replica 1 updates its vote, and sends a message to itself stating that its
current vote is for replica 2;
* Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1
decides to follow 2.
The problem is that replica 2 does not receive a message from 1 stating that it
changed its vote to 2, which prevents 2 from becoming a leader. Now looking
more carefully at why that happened, you can see that when 1 tries to send a
message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at
the same time that it is trying to open a new one. The incorrect
synchronization prevents the creation of a new connection, and 1 and 2 end up
not connected.
----------
We no longer have incorrect synchronization. We can have QCM in 1 shutting
down the connection to 2 while it is trying to send a notification to 2.
However, the only time 1 will shutdown a connection to 2 is when it receives a
new connection request from 2 (or when something is wrong with the connection).
A new connection request is received when 2 is trying to send a notification to
1. As a result, 1 will end up sending a notification to 2 saying that it is
following 2. Do you agree?
> new peer goes in LEADING state even if ensemble is online
> ---------------------------------------------------------
>
> Key: ZOOKEEPER-975
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.3.2
> Reporter: Vishal K
> Fix For: 3.4.0
>
> Attachments: ZOOKEEPER-975.patch
>
>
> Scenario:
> 1. 2 of the 3 ZK nodes are online
> 2. Third node is attempting to join
> 3. Third node unnecessarily goes in "LEADING" state
> 4. Then third goes back to LOOKING (no majority of followers) and finally
> goes to FOLLOWING state.
> While going through the logs I noticed that a peer C that is trying to
> join an already formed cluster goes in LEADING state. This is because
> QuorumCnxManager of A and B sends the entire history of notification
> messages to C. C receives the notification messages that were
> exchanged between A and B when they were forming the cluster.
> In FastLeaderElection.lookForLeader(), due to the following piece of
> code, C quits lookForLeader assuming that it is supposed to lead.
> 740 //If have received from all nodes, then
> terminate
> 741 if ((self.getVotingView().size() ==
> recvset.size()) &&
> 742
> (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){
> 743 self.setPeerState((proposedLeader ==
> self.getId()) ?
> 744 ServerState.LEADING:
> learningState());
> 745 leaveInstance();
> 746 return new Vote(proposedLeader,
> proposedZxid);
> 747
> 748 } else if (termPredicate(recvset,
> This can cause:
> 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit
> and then restart the FLE.
> 2. C waits for 200 ms (finalizeWait) and then considers whatever
> notifications it has received to make a decision. C could potentially
> decide to follow an old leader, fail to connect to the leader, and
> then restart FLE. See code below.
> 752 if (termPredicate(recvset,
> 753 new Vote(proposedLeader, proposedZxid,
> 754 logicalclock))) {
> 755
> 756 // Verify if there is any change in the
> proposed leader
> 757 while((n = recvqueue.poll(finalizeWait,
> 758 TimeUnit.MILLISECONDS)) != null){
> 759 if(totalOrderPredicate(n.leader,
> n.zxid,
> 760 proposedLeader,
> proposedZxid)){
> 761 recvqueue.put(n);
> 762 break;
> 763 }
> 764 }
> In general, this does not affect correctness of FLE since C will
> eventually go back to FOLLOWING state (A and B won't vote for
> C). However, this delays C from joining the cluster. This can in turn
> affect recovery time of an application.
> Proposal: A and B should send only the latest notification (most
> recent) instead of the entire history. Does this sound reasonable?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.