[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058539#comment-18058539
 ] 

Arup Chauhan edited comment on ZOOKEEPER-4892 at 2/13/26 10:07 PM:
-------------------------------------------------------------------

Hi, I’d like to work on this, I’ll open a draft PR and link it here.

I plan to start with the simpler, spec-only approach: {{update 
receiveVotes[i][i]}}  after {{UpdateProposal}} and adjust 
{{HasQuorums/VoteSet}} so quorum counting matches {{FastLeaderElection.java}} 
(count matching votes from receiveVotes, no implicit self inclusion). I’ll 
validate with TLC on a 3-node model and include the config/results in the PR. 


was (Author: JIRAUSER312424):
Hi, I’d like to work on this, I’ll open a draft PR and link it here.

I plan to start with the simpler, spec-only approach: update receiveVotes[i][i] 
after UpdateProposal(i) and adjust HasQuorums/VoteSet so quorum counting 
matches FastLeaderElection.java (count matching votes from receiveVotes, no 
implicit self inclusion). I’ll validate with TLC on a 3-node model and include 
the config/results in the PR. 

> Inconsistency of the quorum rule of the leader election process between the 
> Zookeeper implmentation and specification.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4892
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4892
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.9.3
>         Environment: Run Zookeeper specification with TLA+ Toolbox 1.8.0 in 
> Win11.
> Run Zookeeper implementation in ubuntu 220.4
>            Reporter: Wenhan Feng
>            Priority: Major
>         Attachments: Inconsistency in quorom rule.png
>
>
> It seems the quorum rule of the leader election process between the Zookeeper 
> implmentation and specification is inconsistent.
> In {*}implementation{*}, according to the code in the _FastLeaderElection_
> class, a node maintains the votes of all nodes in the cluster it considers 
> through a `recvset` (which are modeled into _receiveVotes_ variable in the 
> specification). At the same time, it maintains its own vote through the 
> variables {_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_ 
> (which are modeled into _currentVote_ variable in the specification). The 
> quorum rule is that when there are votes from a majority of the nodes in the 
> _recvset_ that match its own vote ({_}proposedLeader{_}, {_}proposedZxid{_}, 
> and _proposedEpoch_) , the node considers the quorum rule is satisfied and 
> moves to the next stage of the election.
> Related codes:
> {code:java}
> // Line 761-780 in FastLeaderElection.java
> protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote 
> vote) {
>         ....
>         for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
>             if (vote.equals(entry.getValue())) {
>                 voteSet.addAck(entry.getKey());
>             }
>         }
>         return voteSet;
> }
> // Line 1045 in FastLeaderElection.java
> voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, 
> logicalclock.get(), proposedEpoch));
> {code}
> In {*}specification{*}, according to the action _FLEHandleNotmsg_ and 
> _HandleNotmsg_, the node itself is initially included in the set of nodes 
> that have reached consensus. Then, the specification checks whether the votes 
> from other nodes in the _rreceiveVotes_ match the node's _currentVote_. If 
> the number of nodes whose votes match the node's _currentVote_ reaches a 
> majority with the node itself, the quorum is considered reached.
> Related codes:
> {code:java}
> \* Line 216-222 in FastLeaderElection.tla.
> VoteSet(i, msource, rcvset, thisvote, thisround) == \{msource} \union \{s \in 
> (Server \ {msource}): VoteEqual(rcvset[s].vote,
>                                                                               
>                               rcvset[s].round,
>                                                                               
>                               thisvote,
>                                                                               
>                               thisround)}
> HasQuorums(i, msource, rcvset, thisvote, thisround) == LET Q == VoteSet(i, 
> msource, rcvset, thisvote, thisround)
>                                                        IN IF Q \in Quorums 
> THEN TRUE ELSE FALSE
> \* Line 433 in HandleNotmsg action of FastLeaderElection.tla.
> /\ LET hasQuorums == HasQuorums(i, i, receiveVotes'[i], currentVote'[i], 
> n.mround)
> {code}
> We can see that these two mechanisms are not entirely consistent. Consider 
> the following three-node scenario:
> 1. The cluster starts, and at this point, node 1's _recvset_ contains votes 
> from nodes 1, 2, and 3, all of which are initial votes (denoted as 
> `InitialVote`).
> 2. Node 3 sends a _Notification-1_ (leader=N3, zxid=1, electionEpoch=1) to 
> nodes 1, 2, and 3.
> 3. Node 1 receives _Notification-1_ from node 3.
> 4. Node 1 updates its vote to _Notification-1_ and sends _Notification-1_ to 
> nodes 1, 2, and 3 again.
> 5. Node 1 updates its _recvset_. At this point, node 1's `recvset` is \{N1: 
> _InitialVote_, N2: _InitialVote_, N3: _Notification-1_}.
> 6. According to the specification, node 1 now believes that quorom nodes have 
> reached consensus. However, according to the implementation, node 1 needs to 
> receive the _Notification-1_ sent by itself in step 4 and update its 
> _recvset_ (i.e., node 1's vote in the `recvset` must also be set to 
> _Notification-1_) before it can consider the quorum reached.
> This figure "Inconsistency in quorom rule.png" in attachment shows the system 
> state after step 5 more detailedly.
> I noticed that in the current version of the TLA+ specification, developers 
> are attempting to avoid modeling the process of a node sending a message to 
> itself and processing its own message. However, this simplification seems to 
> lead to:
> 1. Inconsistencies between key algorithms in the specification and the 
> implementation. In the example above, the inconsistent quorum rule caused me 
> some confusion when I read the code.
> 2. Omission of some test scenarios. In the example above, There are some 
> possible concurrent behaviors in the cluster after step 5 but before node 1 
> receives its own `Notification` and updates its `recvset` in step 6. These 
> concurrent behaviors cannot be verified with the current version of the 
> specification.
> I believe this issue can be fixed with the following two possible solutions:
> 1. *Simpler solution*: In the _HandleNotmsg\(i\)_ action of the 
> specification, update _receiveVotes[i]_ after performing 
> `UpdateProposal\(i\)`. Then, modify the _HasQuorums_ action to match the 
> implementation's quorum rule.
>    
> 2. *Complete solution*: Modify the _BroadcastNotmsg_ action in the 
> specification to include _Notification_ messages sent to itself in the 
> _recvqueue_, as is actually in the implementation (currently, the 
> specification directly ignores such messages). Also, improve the handling 
> logic for these types of messages in _FLEReceiveNotmsg_ and _FLEHandleNotmsg_ 
> actions. Finally, modify the _HasQuorums_ action to match the 
> implementation's quorum rule.
> If this issue is confirmed, I would be happy to provide a PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to