[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048939#comment-18048939
 ] 

Xin Chen edited comment on ZOOKEEPER-4040 at 1/6/26 6:38 AM:
-------------------------------------------------------------

[~pf] Halo,did you try that base on a new version Zookeeper,such as 3.6.2 ? I 
tried, and this issue can always be reproduced. Because the specified code: 
org.apache.zookeeper.server.quorum.Learner#registerWithLeader  is never changed.

 
{code:java}
} else if (newEpoch == self.getAcceptedEpoch()) {
    // since we have already acked an epoch equal to the leaders, we cannot ack
    // again, but we still need to send our lastZxid to the leader so that we 
can
    // sync with it if it does assume leadership of the epoch.
    // the -1 indicates that this reply should not count as an ack for the new 
epoch
       wrappedEpochBytes.putInt(-1);
} else {
    throw new IOException("Leaders epoch, " + newEpoch + " is less than 
accepted epoch, " + self.getAcceptedEpoch());
} {code}
 

[~symat] [~ztzg] [~fekelund] [~akai12]  I  propose a seemingly simple repair 
solution: directly force the acceptEpoch of the replica to be updated to the 
newEpoch sent by the leader when encountering such a situation. When the 
replica attempts to join the cluster through election again, it can avoid this 
infinite loop of error reporting and successfully join the cluster.

Following is the bugfix code:
{code:java}
} else if (newEpoch == self.getAcceptedEpoch()) {
    ...
} else {
       LOG.warn("Leaders epoch, {} is less than accepted epoch, {}", newEpoch, 
self.getAcceptedEpoch());
       // To avoid getting stuck in an infinite loop, when the acceptEpoch of a 
learner is
       // greater than that of the leader, the local epoch is set compulsorily 
to the leader's epoch
       self.setAcceptedEpoch(newEpoch);
    throw new IOException("Leaders epoch, " + newEpoch + " is less than 
accepted epoch, " + self.getAcceptedEpoch());
} {code}
If you are concerned about the inconsistency between acceptEpoch and 
currentEpoch, you can update them synchronously:
{code:java}
} else if (newEpoch == self.getAcceptedEpoch()) {
    ...
} else {
       LOG.warn("Leaders epoch, {} is less than accepted epoch, {}", newEpoch, 
self.getAcceptedEpoch());
       // To avoid getting stuck in an infinite loop, when the acceptEpoch of a 
learner is
       // greater than that of the leader, the local epoch is set compulsorily 
to the leader's epoch
       self.setCurrentEpoch(newEpoch);
       self.setAcceptedEpoch(newEpoch);
    throw new IOException("Leaders epoch, " + newEpoch + " is less than 
accepted epoch, " + self.getAcceptedEpoch());
} {code}
 

If directly rolling back the acceptEpoch of this replica (from a larger value 
to a smaller one) does not align well with the native logic, could we consider 
rolling back only when the epoch of the cluster leader is greater than the 
currentEpoch of the replica? This seems to be more reasonable:
{code:java}
} else if (newEpoch == self.getAcceptedEpoch()) { ... } 

else { // newEpoch <  self.getAcceptedEpoch()
LOG.warn("Leaders epoch, {} is less than accepted epoch, {}", newEpoch, 
self.getAcceptedEpoch());

if (newEpoch >= self.getCurrentEpoch()) {
    // When the leader's epoch is less than the learner's accepted epoch,
    // and leader's epoch is greater than learner's current Epoch, the local
    // currentEpoch and acceptedEpoch will be forcibly set to the leader's 
epoch.
    self.setCurrentEpoch(newEpoch);
    self.setAcceptedEpoch(newEpoch);
    wrappedEpochBytes.putInt(-1);
} else {
    throw new IOException("Leaders epoch, " + newEpoch + " is less than 
accepted epoch, " + self.getAcceptedEpoch());
} {code}
 

Hope someone can reply to me and let me know if this fix is feasible. I have 
reproduced the issue according to the steps and verified that this bugfix can 
help the replica join the cluster and return to normal.

 


was (Author: JIRAUSER298666):
[~pf] Halo,did you try that base on a new version Zookeeper,such as 3.6.2 ? I 
tried, and this issue can always be reproduced. Because the specified code: 
org.apache.zookeeper.server.quorum.Learner#registerWithLeader  is never changed.

 
{code:java}
} else if (newEpoch == self.getAcceptedEpoch()) {
    // since we have already acked an epoch equal to the leaders, we cannot ack
    // again, but we still need to send our lastZxid to the leader so that we 
can
    // sync with it if it does assume leadership of the epoch.
    // the -1 indicates that this reply should not count as an ack for the new 
epoch
       wrappedEpochBytes.putInt(-1);
} else {
    throw new IOException("Leaders epoch, " + newEpoch + " is less than 
accepted epoch, " + self.getAcceptedEpoch());
} {code}
 

[~symat] [~ztzg] [~fekelund] [~akai12]  I  propose a seemingly simple repair 
solution: directly force the acceptEpoch of the replica to be updated to the 
newEpoch sent by the leader when encountering such a situation. When the 
replica attempts to join the cluster through election again, it can avoid this 
infinite loop of error reporting and successfully join the cluster.

Following is the bugfix code:
{code:java}
} else if (newEpoch == self.getAcceptedEpoch()) {
    ...
} else {
       LOG.warn("Leaders epoch, {} is less than accepted epoch, {}", newEpoch, 
self.getAcceptedEpoch());
       // To avoid getting stuck in an infinite loop, when the acceptEpoch of a 
learner is
       // greater than that of the leader, the local epoch is set compulsorily 
to the leader's epoch
       self.setAcceptedEpoch(newEpoch);
    throw new IOException("Leaders epoch, " + newEpoch + " is less than 
accepted epoch, " + self.getAcceptedEpoch());
} {code}
If you are concerned about the inconsistency between acceptEpoch and 
currentEpoch, you can update them synchronously:
{code:java}
} else if (newEpoch == self.getAcceptedEpoch()) {
    ...
} else {
       LOG.warn("Leaders epoch, {} is less than accepted epoch, {}", newEpoch, 
self.getAcceptedEpoch());
       // To avoid getting stuck in an infinite loop, when the acceptEpoch of a 
learner is
       // greater than that of the leader, the local epoch is set compulsorily 
to the leader's epoch
       self.setCurrentEpoch(newEpoch);
       self.setAcceptedEpoch(newEpoch);
    throw new IOException("Leaders epoch, " + newEpoch + " is less than 
accepted epoch, " + self.getAcceptedEpoch());
} {code}
If directly rolling back the acceptEpoch of this replica (from a larger value 
to a smaller one) does not align well with the native logic, could we consider 
rolling back only when the current epoch of the cluster leader is greater than 
the currentEpoch of the replica? This seems to be more reasonable:

 
{code:java}
} else if (newEpoch == self.getAcceptedEpoch()) { ... } 

else { // newEpoch <  self.getAcceptedEpoch()
LOG.warn("Leaders epoch, {} is less than accepted epoch, {}", newEpoch, 
self.getAcceptedEpoch());

if (newEpoch >= self.getCurrentEpoch()) {
    // When the leader's epoch is less than the learner's accepted epoch,
    // and leader's epoch is greater than learner's current Epoch, the local
    // currentEpoch and acceptedEpoch will be forcibly set to the leader's 
epoch.
    self.setCurrentEpoch(newEpoch);
    self.setAcceptedEpoch(newEpoch);
    wrappedEpochBytes.putInt(-1);
} else {
    throw new IOException("Leaders epoch, " + newEpoch + " is less than 
accepted epoch, " + self.getAcceptedEpoch());
} {code}
 

Hope someone can reply to me and let me know if this fix is feasible. I have 
reproduced the issue according to the steps and verified that this bugfix can 
help the replica join the cluster and return to normal.

 

> java.io.IOException: Leaders epoch, 1 is less than accepted epoch, 2
> --------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4040
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4040
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.5, 3.5.8, 3.6.2
>            Reporter: pengfei
>            Priority: Major
>         Attachments: image-2020-12-28-18-20-07-842.png, 
> image-2020-12-28-18-23-14-073.png, image-2020-12-28-18-25-31-960.png, 
> image-2020-12-28-18-28-07-015.png
>
>
> h4. Overview (mechanically translated from ZOOKEEPER-4039):
> The acceptedEpoch is too large and the corresponding node cannot join the 
> cluster
> After the leader receives the acceptedEpoch of more than half of the nodes, 
> it will set its acceptedEpoch to the maximum value of these nodes plus 1, but 
> at this time, the leader’s downtime will cause the leader node’s 
> acceptedEpoch to be 1 larger than other nodes, and then this node will 
> restart again Be elected as the leader, go down again, and then the remaining 
> nodes re-elect a leader. The epoch of this leader will be smaller than the 
> acceptedEpoch of the original leader, which causes the original node to 
> always look and switch the follower state
> Steps to reproduce:
> 3 nodes, server1, server2, server3
> Start server1, server2, and then stop server1 and server2 at the red dot 
> below. At this time, the corresponding acceptedEpoch=1 of server2
> Restart server1, server2, and then stop server1 and server2 at the red dot 
> below. At this time, the corresponding acceptedEpoch=2 of server2
> Restart server1, server3, wait for server1 and server3 to elect the 
> corresponding leader as server3, and then start server2, the following 
> exception will be repeated
> h4. errorlog:
> java.io.IOException: Leaders epoch, 1 is less than accepted epoch, 
> 2java.io.IOException: Leaders epoch, 1 is less than accepted epoch, 2 at 
> org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:353)
>  at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:78) at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1271)2020-12-28
>  18:09:25,176 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2182)(secure=disabled):Follower@201]
>  - shutdown calledjava.lang.Exception: shutdown Follower at 
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:201) at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1275)
>  
> h4. sample:
> cluster all servers server1,server2,server3
>  * start server1 and server2 ,then shutdown them when they arrive below, now 
> the accpetedEpoch of server2 is 1 , server1 is 0, server3 is 0  
> !image-2020-12-28-18-23-14-073.png!
>  * then repeat step 1 , now the accpetedEpoch of server1 is 0,server2 is 
> 2,server3 is 0  !image-2020-12-28-18-25-31-960.png!
>  * then start server1 and server3 , wait unti the leader of the cluster is 
> server3 , start server2 ,now generate the error below  
> !image-2020-12-28-18-28-07-015.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to