[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenhan Feng updated ZOOKEEPER-4892:
-----------------------------------
    Description: 
It seems the quorum rule of the leader election process between the Zookeeper 
implmentation and specification is inconsistent.

In {*}implementation{*}, according to the code in the _FastLeaderElection_
class, a node maintains the votes of all nodes in the cluster it considers 
through a `recvset` (which are modeled into _receiveVotes_ variable in the 
specification). At the same time, it maintains its own vote through the 
variables {_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_ (which 
are modeled into _currentVote_ variable in the specification). The quorum rule 
is that when there are votes from a majority of the nodes in the _recvset_ that 
match its own vote ({_}proposedLeader{_}, {_}proposedZxid{_}, and 
_proposedEpoch_) , the node considers the quorum rule is satisfied and moves to 
the next stage of the election.

Related codes:
{code:java}
// Line 761-780 in FastLeaderElection.java
protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote vote) 
{
        ....
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }
        return voteSet;
}

// Line 1045 in FastLeaderElection.java
voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, 
logicalclock.get(), proposedEpoch));
{code}
In {*}specification{*}, according to the action _FLEHandleNotmsg_ and 
_HandleNotmsg_, the node itself is initially included in the set of nodes that 
have reached consensus. Then, the specification checks whether the votes from 
other nodes in the _rreceiveVotes_ match the node's _currentVote_. If the 
number of nodes whose votes match the node's _currentVote_ reaches a majority 
with the node itself, the quorum is considered reached.

Related codes:
{code:java}
\* Line 216-222 in FastLeaderElection.tla.
VoteSet(i, msource, rcvset, thisvote, thisround) == \{msource} \union \{s \in 
(Server \ {msource}): VoteEqual(rcvset[s].vote,
                                                                                
                            rcvset[s].round,
                                                                                
                            thisvote,
                                                                                
                            thisround)}

HasQuorums(i, msource, rcvset, thisvote, thisround) == LET Q == VoteSet(i, 
msource, rcvset, thisvote, thisround)
                                                       IN IF Q \in Quorums THEN 
TRUE ELSE FALSE

\* Line 433 in HandleNotmsg action of FastLeaderElection.tla.
/\ LET hasQuorums == HasQuorums(i, i, receiveVotes'[i], currentVote'[i], 
n.mround)
{code}


We can see that these two mechanisms are not entirely consistent. Consider the 
following three-node scenario:

1. The cluster starts, and at this point, node 1's _recvset_ contains votes 
from nodes 1, 2, and 3, all of which are initial votes (denoted as 
`InitialVote`).
2. Node 3 sends a _Notification-1_ (leader=N3, zxid=1, electionEpoch=1) to 
nodes 1, 2, and 3.
3. Node 1 receives _Notification-1_ from node 3.
4. Node 1 updates its vote to _Notification-1_ and sends _Notification-1_ to 
nodes 1, 2, and 3 again.
5. Node 1 updates its _recvset_. At this point, node 1's `recvset` is \{N1: 
_InitialVote_, N2: _InitialVote_, N3: _Notification-1_}.
6. According to the specification, node 1 now believes that quorom nodes have 
reached consensus. However, according to the implementation, node 1 needs to 
receive the _Notification-1_ sent by itself in step 4 and update its _recvset_ 
(i.e., node 1's vote in the `recvset` must also be set to _Notification-1_) 
before it can consider the quorum reached.

This figure "Inconsistency in quorom rule.png" in attachment shows the system 
state after step 5 more detailedly.

I noticed that in the current version of the TLA+ specification, developers are 
attempting to avoid modeling the process of a node sending a message to itself 
and processing its own message. However, this simplification seems to lead to:
1. Inconsistencies between key algorithms in the specification and the 
implementation. In the example above, the inconsistent quorum rule caused me 
some confusion when I read the code.
2. Omission of some test scenarios. In the example above, There are some 
possible concurrent behaviors in the cluster after step 5 but before node 1 
receives its own `Notification` and updates its `recvset` in step 6. These 
concurrent behaviors cannot be verified with the current version of the 
specification.

I believe this issue can be fixed with the following two possible solutions:

1. *Simpler solution*: In the _HandleNotmsg\(i\)_ action of the specification, 
update _receiveVotes[i]_ after performing `UpdateProposal\(i\)`. Then, modify 
the _HasQuorums_ action to match the implementation's quorum rule.
   
2. *Complete solution*: Modify the _BroadcastNotmsg_ action in the 
specification to include _Notification_ messages sent to itself in the 
_recvqueue_, as is actually in the implementation (currently, the specification 
directly ignores such messages). Also, improve the handling logic for these 
types of messages in _FLEReceiveNotmsg_ and _FLEHandleNotmsg_ actions. Finally, 
modify the _HasQuorums_ action to match the implementation's quorum rule.

If this issue is confirmed, I would be happy to provide a PR.

  was:
It seems the quorum rule of the leader election process between the Zookeeper 
implmentation and specification is inconsistent.

In {*}implementation{*}, according to the code in the _FastLeaderElection_
class, a node maintains the votes of all nodes in the cluster it considers 
through a `recvset` (which are modeled into _receiveVotes_ variable in the 
specification). At the same time, it maintains its own vote through the 
variables {_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_ (which 
are modeled into _currentVote_ variable in the specification). The quorum rule 
is that when there are votes from a majority of the nodes in the _recvset_ that 
match its own vote ({_}proposedLeader{_}, {_}proposedZxid{_}, and 
_proposedEpoch_) , the node considers the quorum rule is satisfied and moves to 
the next stage of the election.

Related codes:
{code:java}
// Line 761-780 in FastLeaderElection.java
protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote vote) 
{
        ....
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }
        return voteSet;
}

// Line 1045 in FastLeaderElection.java
voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, 
logicalclock.get(), proposedEpoch));
{code}
In {*}specification{*}, according to the action _FLEHandleNotmsg_ and 
_HandleNotmsg_, the node itself is initially included in the set of nodes that 
have reached consensus. Then, the specification checks whether the votes from 
other nodes in the _rreceiveVotes_ match the node's _currentVote_. If the 
number of nodes whose votes match the node's _currentVote_ reaches a majority 
with the node itself, the quorum is considered reached.

Related codes:
{code:java}
\* Line 216-222 in FastLeaderElection.tla.
VoteSet(i, msource, rcvset, thisvote, thisround) == \{msource} \union \{s \in 
(Server \ {msource}): VoteEqual(rcvset[s].vote,
                                                                                
                            rcvset[s].round,
                                                                                
                            thisvote,
                                                                                
                            thisround)}

HasQuorums(i, msource, rcvset, thisvote, thisround) == LET Q == VoteSet(i, 
msource, rcvset, thisvote, thisround)
                                                       IN IF Q \in Quorums THEN 
TRUE ELSE FALSE

\* Line 433 in HandleNotmsg action of FastLeaderElection.tla.
/\ LET hasQuorums == HasQuorums(i, i, receiveVotes'[i], currentVote'[i], 
n.mround)


{code}
```

```

We can see that these two mechanisms are not entirely consistent. Consider the 
following three-node scenario:

1. The cluster starts, and at this point, node 1's _recvset_ contains votes 
from nodes 1, 2, and 3, all of which are initial votes (denoted as 
`InitialVote`).
2. Node 3 sends a _Notification-1_ (leader=N3, zxid=1, electionEpoch=1) to 
nodes 1, 2, and 3.
3. Node 1 receives _Notification-1_ from node 3.
4. Node 1 updates its vote to _Notification-1_ and sends _Notification-1_ to 
nodes 1, 2, and 3 again.
5. Node 1 updates its _recvset_. At this point, node 1's `recvset` is \{N1: 
_InitialVote_, N2: _InitialVote_, N3: _Notification-1_}.
6. According to the specification, node 1 now believes that quorom nodes have 
reached consensus. However, according to the implementation, node 1 needs to 
receive the _Notification-1_ sent by itself in step 4 and update its _recvset_ 
(i.e., node 1's vote in the `recvset` must also be set to _Notification-1_) 
before it can consider the quorum reached.

This figure "Inconsistency in quorom rule.png" in attachment shows the system 
state after step 5 more detailedly.

I noticed that in the current version of the TLA+ specification, developers are 
attempting to avoid modeling the process of a node sending a message to itself 
and processing its own message. However, this simplification seems to lead to:
1. Inconsistencies between key algorithms in the specification and the 
implementation. In the example above, the inconsistent quorum rule caused me 
some confusion when I read the code.
2. Omission of some test scenarios. In the example above, There are some 
possible concurrent behaviors in the cluster after step 5 but before node 1 
receives its own `Notification` and updates its `recvset` in step 6. These 
concurrent behaviors cannot be verified with the current version of the 
specification.

I believe this issue can be fixed with the following two possible solutions:

1. *Simpler solution*: In the _HandleNotmsg\(i\)_ action of the specification, 
update _receiveVotes[i]_ after performing `UpdateProposal\(i\)`. Then, modify 
the _HasQuorums_ action to match the implementation's quorum rule.
   
2. *Complete solution*: Modify the _BroadcastNotmsg_ action in the 
specification to include _Notification_ messages sent to itself in the 
_recvqueue_, as is actually in the implementation (currently, the specification 
directly ignores such messages). Also, improve the handling logic for these 
types of messages in _FLEReceiveNotmsg_ and _FLEHandleNotmsg_ actions. Finally, 
modify the _HasQuorums_ action to match the implementation's quorum rule.

If this issue is confirmed, I would be happy to provide a PR.


> Inconsistency of the quorum rule of the leader election process between the 
> Zookeeper implmentation and specification.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4892
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4892
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.9.3
>         Environment: Run Zookeeper specification with TLA+ Toolbox 1.8.0 in 
> Win11.
> Run Zookeeper implementation in ubuntu 220.4
>            Reporter: Wenhan Feng
>            Priority: Major
>         Attachments: Inconsistency in quorom rule.png
>
>
> It seems the quorum rule of the leader election process between the Zookeeper 
> implmentation and specification is inconsistent.
> In {*}implementation{*}, according to the code in the _FastLeaderElection_
> class, a node maintains the votes of all nodes in the cluster it considers 
> through a `recvset` (which are modeled into _receiveVotes_ variable in the 
> specification). At the same time, it maintains its own vote through the 
> variables {_}proposedLeader{_}, {_}proposedZxid{_}, and _proposedEpoch_ 
> (which are modeled into _currentVote_ variable in the specification). The 
> quorum rule is that when there are votes from a majority of the nodes in the 
> _recvset_ that match its own vote ({_}proposedLeader{_}, {_}proposedZxid{_}, 
> and _proposedEpoch_) , the node considers the quorum rule is satisfied and 
> moves to the next stage of the election.
> Related codes:
> {code:java}
> // Line 761-780 in FastLeaderElection.java
> protected SyncedLearnerTracker getVoteTracker(Map<Long, Vote> votes, Vote 
> vote) {
>         ....
>         for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
>             if (vote.equals(entry.getValue())) {
>                 voteSet.addAck(entry.getKey());
>             }
>         }
>         return voteSet;
> }
> // Line 1045 in FastLeaderElection.java
> voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, 
> logicalclock.get(), proposedEpoch));
> {code}
> In {*}specification{*}, according to the action _FLEHandleNotmsg_ and 
> _HandleNotmsg_, the node itself is initially included in the set of nodes 
> that have reached consensus. Then, the specification checks whether the votes 
> from other nodes in the _rreceiveVotes_ match the node's _currentVote_. If 
> the number of nodes whose votes match the node's _currentVote_ reaches a 
> majority with the node itself, the quorum is considered reached.
> Related codes:
> {code:java}
> \* Line 216-222 in FastLeaderElection.tla.
> VoteSet(i, msource, rcvset, thisvote, thisround) == \{msource} \union \{s \in 
> (Server \ {msource}): VoteEqual(rcvset[s].vote,
>                                                                               
>                               rcvset[s].round,
>                                                                               
>                               thisvote,
>                                                                               
>                               thisround)}
> HasQuorums(i, msource, rcvset, thisvote, thisround) == LET Q == VoteSet(i, 
> msource, rcvset, thisvote, thisround)
>                                                        IN IF Q \in Quorums 
> THEN TRUE ELSE FALSE
> \* Line 433 in HandleNotmsg action of FastLeaderElection.tla.
> /\ LET hasQuorums == HasQuorums(i, i, receiveVotes'[i], currentVote'[i], 
> n.mround)
> {code}
> We can see that these two mechanisms are not entirely consistent. Consider 
> the following three-node scenario:
> 1. The cluster starts, and at this point, node 1's _recvset_ contains votes 
> from nodes 1, 2, and 3, all of which are initial votes (denoted as 
> `InitialVote`).
> 2. Node 3 sends a _Notification-1_ (leader=N3, zxid=1, electionEpoch=1) to 
> nodes 1, 2, and 3.
> 3. Node 1 receives _Notification-1_ from node 3.
> 4. Node 1 updates its vote to _Notification-1_ and sends _Notification-1_ to 
> nodes 1, 2, and 3 again.
> 5. Node 1 updates its _recvset_. At this point, node 1's `recvset` is \{N1: 
> _InitialVote_, N2: _InitialVote_, N3: _Notification-1_}.
> 6. According to the specification, node 1 now believes that quorom nodes have 
> reached consensus. However, according to the implementation, node 1 needs to 
> receive the _Notification-1_ sent by itself in step 4 and update its 
> _recvset_ (i.e., node 1's vote in the `recvset` must also be set to 
> _Notification-1_) before it can consider the quorum reached.
> This figure "Inconsistency in quorom rule.png" in attachment shows the system 
> state after step 5 more detailedly.
> I noticed that in the current version of the TLA+ specification, developers 
> are attempting to avoid modeling the process of a node sending a message to 
> itself and processing its own message. However, this simplification seems to 
> lead to:
> 1. Inconsistencies between key algorithms in the specification and the 
> implementation. In the example above, the inconsistent quorum rule caused me 
> some confusion when I read the code.
> 2. Omission of some test scenarios. In the example above, There are some 
> possible concurrent behaviors in the cluster after step 5 but before node 1 
> receives its own `Notification` and updates its `recvset` in step 6. These 
> concurrent behaviors cannot be verified with the current version of the 
> specification.
> I believe this issue can be fixed with the following two possible solutions:
> 1. *Simpler solution*: In the _HandleNotmsg\(i\)_ action of the 
> specification, update _receiveVotes[i]_ after performing 
> `UpdateProposal\(i\)`. Then, modify the _HasQuorums_ action to match the 
> implementation's quorum rule.
>    
> 2. *Complete solution*: Modify the _BroadcastNotmsg_ action in the 
> specification to include _Notification_ messages sent to itself in the 
> _recvqueue_, as is actually in the implementation (currently, the 
> specification directly ignores such messages). Also, improve the handling 
> logic for these types of messages in _FLEReceiveNotmsg_ and _FLEHandleNotmsg_ 
> actions. Finally, modify the _HasQuorums_ action to match the 
> implementation's quorum rule.
> If this issue is confirmed, I would be happy to provide a PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to