[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sirius updated ZOOKEEPER-4685:
------------------------------
    Description: 
When a follower is processing the NEWLEADER message in SYNC phase, its 
QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence 
task to the SyncThread. The SyncThread may persist txns and reply ACKs of them 
before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may cause 
the consequence that the leader cannot collect enough number of ACK-LDs 
successfully, followed by the leader's shutdown and a new round of election. 
This introduces unnecessary recovery procedures, consumes extra time before 
servers get into the BROADCAST phase and reduces the service's availability a 
lot. 

The following trace can be generated in the latest version nowadays.

 
h2. Trace

Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - +S2+ logs a new txn <1, 1> and makes a broadcast.
 - +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
 - +S2+ is elected leader again. 
 - +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during 
SYNC.
 - After +S0+ receives NEWLEADER,  {+}S0{+}'s SyncThread may persist the txn 
<1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s 
QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible because 
txn logging is processed asynchronously by SyncThread! )
 - The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some 
proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the 
leader turns its state to {_}state.RUNNING{_}.
 - However, the QuorumPeer of the leader +S2+ will not receive enough number of 
ACK-LDs before timeout, and then throws _InterruptedException_ during 
{_}waitForNewLeaderAck(..){_}.
 - After that, the leader will shutdown and a new round of election is raised, 
which consumes extra time for establishing the quorum and reduces availability 
a lot.

 
h2. Analysis

*Root Cause:*

Similar to  ZOOKEEPER-4646 , the root cause lies in the asynchronous executions 
by multi-threads on the follower side.The implementation adopts the 
multi-threading style for performance optimization. However, it may bring some 
underlying subtle bugs.

When the follower receives NEWLEADER, it calls {{logRequest(..)}} to submit the 
logging requests to SyncRequestProcessor's queue before replying ACK-LD. The 
SyncThread may be scheduled to persist the txns and reply ACK(s) before the 
QuorumPeer thread replies ACK-LD.

On the leader side, the corresponding learnerHandler will not recognize the ACK 
of PROPOSAL during _waitForNewLeaderAck(..)_ , and then will be blocked at 
_waitForStartup()_ until the leader turns its state to {_}state.RUNNING{_}. 
However, the leader may not receive enough ACK-LDs before timeout, and then 
throws _InterruptedException_ during {_}waitForNewLeaderAck(..){_}. After that, 
the leader will shutdown and the servers go into a new election and recovery 
process. 

This recovery procedure consumes extra recovery time and increase system 
unavailability time.

To some degree it is unnecessary. It can be fixed and optimized by guaranteeing 
the follower side's message order.

 

*Affected Versions:*

The above trace has been generated in multiple versions such as 3.7.1 & 3.8.1 
(the latest stable & current version till now) by our testing tools. The 
affected versions might be more, since the critical update order between the 
follower's replying ACK-LD and replying ACK of PROPOSAL during SYNC stay 
non-deterministic in multiple versions.

 
h2. Possible Fix

Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee 
the following partial orders to be satisfied:
 * The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. ACK 
of NEWLEADER) to the leader (so as to avoid this issue).
 * The follower replies ACK-LD only after it has persisted the txns that might 
be applied to the leader's datatree before the leader gets into the BROADCAST 
phase (to avoid ZOOKEEPER-4646).

  was:
When a follower is processing the NEWLEADER message in SYNC phase, its 
QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence 
task to the SyncThread. The SyncThread may persist txns and reply ACKs of them 
before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may cause 
the consequence that the leader cannot collect enough number of ACK-LDs 
successfully, followed by the leader's shutdown and a new round of election. 
This introduces unnecessary recovery procedures, consumes extra time before 
servers get into the BROADCAST phase and reduces the service's availability a 
lot. 

The following trace can be generated in the latest version nowadays.

 
h2. Trace

Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - +S2+ logs a new txn <1, 1> and makes a broadcast.
 - +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
 - +S2+ is elected leader again. 
 - +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during 
SYNC.
 - After +S0+ receives NEWLEADER,  {+}S0{+}'s SyncThread may persist the txn 
<1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s 
QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible because 
txn logging is processed asynchronously by SyncThread! )
 - The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some 
proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the 
leader turns its state to {_}state.RUNNING{_}.
 - However, the QuorumPeer of the leader +S2+ will not receive enough number of 
ACK-LDs before timeout, and then throws _InterruptedException_ during 
{_}waitForNewLeaderAck(..){_}.
 - After that, the leader will shutdown and a new round of election is raised, 
which consumes extra time for establishing the quorum and reduces availability 
a lot.

 
h2. Analysis

*Root Cause:*

Similar to  ZOOKEEPER-4646 , the root cause lies in the asynchronous executions 
by multi-threads on the follower side.The implementation adopts the 
multi-threading style for performance optimization. However, it may bring some 
underlying subtle bugs.

When the follower receives NEWLEADER, it calls {{logRequest(..)}} to submit the 
logging requests to SyncRequestProcessor's queue before replying ACK-LD. The 
SyncThread may be scheduled to persist the txns and reply ACK(s) before the 
QuorumPeer thread replies ACK-LD.

On the leader side, the corresponding learnerHandler will not recognize the ACK 
of PROPOSAL during _waitForNewLeaderAck(..)_ , and then will be blocked at 
_waitForStartup()_ until the leader turns its state to {_}state.RUNNING{_}. 
However, the leader may not receive enough ACK-LDs before timeout, and then 
throws _InterruptedException_ during {_}waitForNewLeaderAck(..){_}. After that, 
the leader will shutdown and the servers go into a new election and recovery 
process. 

This recovery procedure consumes extra recovery time and increase system 
unavailability time.

To some degree it is unnecessary. It can be fixed and optimized by guaranteeing 
the follower side's message order.

 

*Affected Versions:*

The above trace has been generated in multiple versions such as 3.7.1 & 3.8.1 
(the latest stable & current version till now) by our testing tools. The 
affected versions might be more, since the critical update order between the 
follower's replying ACK-LD and updating its history during SYNC stay 
non-deterministic in multiple versions.

 
h2. Possible Fix

Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee 
the following partial orders to be satisfied:
 * The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. ACK 
of NEWLEADER) to the leader (so as to avoid this issue).
 * The follower replies ACK-LD only after it has persisted the txns that might 
be applied to the leader's datatree before the leader gets into the BROADCAST 
phase (to avoid ZOOKEEPER-4646).


> Unnecessary system unavailability due to Leader shutdown when follower sent 
> ACK of PROPOSAL before sending ACK of NEWLEADER in log recovery
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4685
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4685
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.8.1
>            Reporter: Sirius
>            Priority: Major
>
> When a follower is processing the NEWLEADER message in SYNC phase, its 
> QuorumPeer thread will call {{logRequest(..)}} to submit the txn persistence 
> task to the SyncThread. The SyncThread may persist txns and reply ACKs of 
> them before replying ACK-LD (i.e. ACK of NEWLEADER) to the leader. This may 
> cause the consequence that the leader cannot collect enough number of ACK-LDs 
> successfully, followed by the leader's shutdown and a new round of election. 
> This introduces unnecessary recovery procedures, consumes extra time before 
> servers get into the BROADCAST phase and reduces the service's availability a 
> lot. 
> The following trace can be generated in the latest version nowadays.
>  
> h2. Trace
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
>  - +S2+ is elected leader.
>  - +S2+ logs a new txn <1, 1> and makes a broadcast.
>  - +S0+ restarts & +S1+ crashes before receiving the proposal of <1, 1>.
>  - +S2+ is elected leader again. 
>  - +S2+ syncs with +S0+ using DIFF, and sends the proposal of <1, 1> during 
> SYNC.
>  - After +S0+ receives NEWLEADER,  {+}S0{+}'s SyncThread may persist the txn 
> <1, 1> and reply corresponding ACK to the leader +S2+ before {+}S0{+}'s 
> QuorumPeer thread replies ACK-LD to the leader +S2+ .(This is possible 
> because txn logging is processed asynchronously by SyncThread! )
>  - The corresponding LearnerHandler on +S2+ cannot recognize the ACK of some 
> proposal before ACK-LD, and will be blocked at _waitForStartup()_ until the 
> leader turns its state to {_}state.RUNNING{_}.
>  - However, the QuorumPeer of the leader +S2+ will not receive enough number 
> of ACK-LDs before timeout, and then throws _InterruptedException_ during 
> {_}waitForNewLeaderAck(..){_}.
>  - After that, the leader will shutdown and a new round of election is 
> raised, which consumes extra time for establishing the quorum and reduces 
> availability a lot.
>  
> h2. Analysis
> *Root Cause:*
> Similar to  ZOOKEEPER-4646 , the root cause lies in the asynchronous 
> executions by multi-threads on the follower side.The implementation adopts 
> the multi-threading style for performance optimization. However, it may bring 
> some underlying subtle bugs.
> When the follower receives NEWLEADER, it calls {{logRequest(..)}} to submit 
> the logging requests to SyncRequestProcessor's queue before replying ACK-LD. 
> The SyncThread may be scheduled to persist the txns and reply ACK(s) before 
> the QuorumPeer thread replies ACK-LD.
> On the leader side, the corresponding learnerHandler will not recognize the 
> ACK of PROPOSAL during _waitForNewLeaderAck(..)_ , and then will be blocked 
> at _waitForStartup()_ until the leader turns its state to 
> {_}state.RUNNING{_}. However, the leader may not receive enough ACK-LDs 
> before timeout, and then throws _InterruptedException_ during 
> {_}waitForNewLeaderAck(..){_}. After that, the leader will shutdown and the 
> servers go into a new election and recovery process. 
> This recovery procedure consumes extra recovery time and increase system 
> unavailability time.
> To some degree it is unnecessary. It can be fixed and optimized by 
> guaranteeing the follower side's message order.
>  
> *Affected Versions:*
> The above trace has been generated in multiple versions such as 3.7.1 & 3.8.1 
> (the latest stable & current version till now) by our testing tools. The 
> affected versions might be more, since the critical update order between the 
> follower's replying ACK-LD and replying ACK of PROPOSAL during SYNC stay 
> non-deterministic in multiple versions.
>  
> h2. Possible Fix
> Considering this issue and ZOOKEEPER-4646 , one possible fix is to guarantee 
> the following partial orders to be satisfied:
>  * The follower replies ACK of PROPOSAL only after it replies ACK-LD (i.e. 
> ACK of NEWLEADER) to the leader (so as to avoid this issue).
>  * The follower replies ACK-LD only after it has persisted the txns that 
> might be applied to the leader's datatree before the leader gets into the 
> BROADCAST phase (to avoid ZOOKEEPER-4646).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to