[jira] [Updated] (ZOOKEEPER-4646) Committed txns may be lost if followers reply ACK-LD before writing txns to disk

Sirius (Jira) Thu, 08 Dec 2022 09:25:17 -0800


     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sirius updated ZOOKEEPER-4646:
------------------------------
    Description: 
When a follower is processing the NEWLEADER message in SYNC phase, it will call 
logRequest(..) to submit the txn persistence task to the SyncRequestProcessor 
thread. The latter does not promise to finish the task before the follower 
replies ACK-LD (i.e. ACK of NEWLEADER) to the leader. which may lead to 
committed data loss.

Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
trace is an example. 
h2. Trace

The trace is basically the same as the one in ZOOKEEPER-3911 .  (Here we use 
the zxid to represent a txn.)

Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - All of them have {{_lastLoggedZxid_}} = <1, 3>, {{_lastProcessedZxid_}} = 
<1, 3>.
 - +S2+ logs a new txn <1, 4> and makes a broadcast.
 - Shutdown +S0+ & +S1+ before they receive the proposal of <1, 4>.
 - Restart +S0+ & {+}S1{+}.
 - +S2+ uses DIFF to sync with +S0+ & {+}S1{+}.
 - +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads 
log txns to disk.
 - Verify clients of +S2+ has the view of <1, 4>.
 - Shutdown {+}S2{+}, and make sure to shutdown the followers +S0+ and +S1+ 
*before* their SyncRequestProcessor threads persist txns to disk. (This is 
extremely timing sensitive but possible! )
 - Restart +S0+ and {+}S1{+}.
 - Verify clients of +S0+ and +S1+ do not have the view of <1, 4>, a violation 
of ZAB.

 
h2. Analysis

*Property Violation:*

>From the server side, the committed log of the ensemble does not appends 
>monotonically. From the client side, a client may read stale data after a 
>newer version is obtained, and that newer version cannot be obtained anymore.

Although ZOOKEEPER-4643 shares similar symptoms and property violations, we 
think it is a distinct problem as it has different root cause and risk pattern 
compared to this one. More specifically,
 * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch before 
logging txns to disk. The bug can be triggered by interrupting the action of 
logging txns after currentEpoch is updated. 
 * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before 
logging txns to disk. The bug can be triggered by interrupting the action of 
logging txns after ACK-LD is replied. 

{*}Gap between Protocol and Implementation:{*}{*}{*}

The implementation adopts the multi-threading style for performance 
optimization. However, it may bring some subtle underlying bugs that will not 
occur at the protocol level. The fix of ZOOKEEPER-3911 simply adds the 
QuorumPeer's action of calling logRequest(..) inside the NEWLEADER processing 
logic, without further considering the risk of asynchronous executions by other 
threads. 

*Affected Versions:*

The above trace has been triggered in multiple versions such as 3.7.1 & 3.8.0 
(the latest stable & current version till now) by our testing tools. The 
affected versions might be more, since the critical update order between the 
follower's replying ACK-LD and updating its history during SYNC stay 
non-deterministic each time even as the version evolves.

 

  was:
When a follower is processing the NEWLEADER message in SYNC phase, it will call 
logRequest(..) to submit the txn persistence task to the SyncRequestProcessor 
thread. The latter does not promise to finish the task before the follower 
replies ACK-LD (i.e. ACK of NEWLEADER) to the leader. which may lead to 
committed data loss.

Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
trace is an example. 
h2. Trace

The trace is basically the same as the one in ZOOKEEPER-3911 .  (Here we use 
the zxid to represent a txn.)

Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - All of them have {{_lastLoggedZxid_}} = <1, 3>, {{_lastProcessedZxid_}} = 
<1, 3>.
 - +S2+ logs a new txn <1, 4> and makes a broadcast.
 - Shutdown +S0+ & +S1+ before they receive the proposal of <1, 4>.
 - Restart +S0+ & {+}S1{+}.
 - +S2+ uses DIFF to sync with +S0+ & {+}S1{+}.
 - +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads 
log txns to disk.
 - Verify clients of +S2+ has the view of <1, 4>.
 - Shutdown {+}S2{+}, and make sure to shutdown the followers +S0+ and +S1+ 
*before* their SyncRequestProcessor threads persist txns to disk. (This is 
extremely timing sensitive but possible! )
 - Restart +S0+ and {+}S1{+}.
 - Verify clients of +S0+ and +S1+ do not have the view of <1, 4>, a violation 
of ZAB.

 
h2. Analysis

*Property Violation:*

>From the server side, the committed log of the ensemble does not appends 
>monotonically. From the client side, a client may read stale data after a 
>newer version is obtained, and that newer version cannot be obtained anymore.

Although ZOOKEEPER-4643 shares similar symptoms and property violations, we 
think it is a distinct problem as it has different root cause and risk pattern 
compared to this one. More specifically,
 * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch before 
logging txns to disk. The bug can be triggered by interrupting the action of 
logging txns after currentEpoch is updated. 
 * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before 
logging txns to disk. The bug can be triggered by interrupting the action of 
logging txns after ACK-LD is replied. 

{*}Gap between Protocol and Implementation:{*}{*}{{*}}

The implementation adopts the multi-threading style for performance 
optimization. However, it may bring some subtle underlying bugs that will not 
occur at the protocol level. The fix of ZOOKEEPER-3911 simply adds the 
QuorumPeer's action of calling logRequest(..) inside the NEWLEADER processing 
logic, without further considering the risk of asynchronous executions by other 
threads. 

*Affected Versions:*

The above trace has been triggered in multiple versions such as 3.7.1 & 3.8.0 
(the latest stable & current version till now) by our testing tools. The 
affected versions might be more, since the critical update order between the 
follower's replying ACK-LD and updating its history during SYNC stay 
non-deterministic each time even as the version evolves.

 


> Committed txns may be lost if followers reply ACK-LD before writing txns to 
> disk
> --------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4646
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.8.0
>            Reporter: Sirius
>            Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will 
> call logRequest(..) to submit the txn persistence task to the 
> SyncRequestProcessor thread. The latter does not promise to finish the task 
> before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader. 
> which may lead to committed data loss.
> Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
> fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
> trace is an example. 
> h2. Trace
> The trace is basically the same as the one in ZOOKEEPER-3911 .  (Here we use 
> the zxid to represent a txn.)
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
>  - +S2+ is elected leader.
>  - All of them have {{_lastLoggedZxid_}} = <1, 3>, {{_lastProcessedZxid_}} = 
> <1, 3>.
>  - +S2+ logs a new txn <1, 4> and makes a broadcast.
>  - Shutdown +S0+ & +S1+ before they receive the proposal of <1, 4>.
>  - Restart +S0+ & {+}S1{+}.
>  - +S2+ uses DIFF to sync with +S0+ & {+}S1{+}.
>  - +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads 
> log txns to disk.
>  - Verify clients of +S2+ has the view of <1, 4>.
>  - Shutdown {+}S2{+}, and make sure to shutdown the followers +S0+ and +S1+ 
> *before* their SyncRequestProcessor threads persist txns to disk. (This is 
> extremely timing sensitive but possible! )
>  - Restart +S0+ and {+}S1{+}.
>  - Verify clients of +S0+ and +S1+ do not have the view of <1, 4>, a 
> violation of ZAB.
>  
> h2. Analysis
> *Property Violation:*
> From the server side, the committed log of the ensemble does not appends 
> monotonically. From the client side, a client may read stale data after a 
> newer version is obtained, and that newer version cannot be obtained anymore.
> Although ZOOKEEPER-4643 shares similar symptoms and property violations, we 
> think it is a distinct problem as it has different root cause and risk 
> pattern compared to this one. More specifically,
>  * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch 
> before logging txns to disk. The bug can be triggered by interrupting the 
> action of logging txns after currentEpoch is updated. 
>  * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before 
> logging txns to disk. The bug can be triggered by interrupting the action of 
> logging txns after ACK-LD is replied. 
> {*}Gap between Protocol and Implementation:{*}{*}{*}
> The implementation adopts the multi-threading style for performance 
> optimization. However, it may bring some subtle underlying bugs that will not 
> occur at the protocol level. The fix of ZOOKEEPER-3911 simply adds the 
> QuorumPeer's action of calling logRequest(..) inside the NEWLEADER processing 
> logic, without further considering the risk of asynchronous executions by 
> other threads. 
> *Affected Versions:*
> The above trace has been triggered in multiple versions such as 3.7.1 & 3.8.0 
> (the latest stable & current version till now) by our testing tools. The 
> affected versions might be more, since the critical update order between the 
> follower's replying ACK-LD and updating its history during SYNC stay 
> non-deterministic each time even as the version evolves.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ZOOKEEPER-4646) Committed txns may be lost if followers reply ACK-LD before writing txns to disk

Reply via email to