[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sirius updated ZOOKEEPER-4646:
------------------------------
    Description: 
When a follower is processing the NEWLEADER message in SYNC phase, it will call 
logRequest(..) to submit the txn persistence task to the SyncRequestProcessor 
thread. The latter does not promise to finish the task before the follower 
replies ACK-LD (i.e. ACK of NEWLEADER) to the leader. which may lead to 
committed data loss.

Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
trace is a  
h2. Trace

The trace is basically the same as the one in ZOOKEEPER-3911 .  (Here we use 
the zxid to represent a txn.)
 - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - All of them have {{_lastLoggedZxid_}} = <1, 3>, {{_lastProcessedZxid_}} = 
<1, 3>.
 - +S2+ logs a new txn <1, 4> and makes a broadcast.
 - Shutdown +S0+ & +S1+ before they receive the proposal of <1, 4>.
 - Restart +S0+ & {+}S1{+}.
 - +S2+ uses DIFF to sync with +S0+ & {+}S1{+}.
 - +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads 
log txns to disk.
 - Verify clients of +S2+ has the view of <1, 4>.
 - Shutdown {+}S2{+}, and make sure to shutdown the followers +S0+ and +S1+ 
*before* their SyncRequestProcessor threads persist txns to disk. (This is 
extremely timing sensitive but possible! )
 - Restart +S0+ and {+}S1{+}.
 - Verify clients of +S0+ and +S1+ do not have the view of <1, 4>, a violation 
of ZAB.

 
h2. Analysis

*Property Violation:*

>From the server side, the committed log of the ensemble does not appends 
>monotonically. From the client side, a client may read stale data after a 
>newer version is obtained, and that newer version cannot be obtained anymore.

ZOOKEEPER-4643 shows similar symptoms, but its fix only mitigates the 
occurrance of the problem without solving it at the root.

  was:
When a follower is processing the NEWLEADER message in SYNC phase, it will call 
logRequest(..) to submit the txn persistence task to the SyncRequestProcessor 
thread. The latter does not promise to finish the task before the follower 
replies ACK-LD to the leader. which may trigger committed data loss.

 

When a follower is processing the NEWLEADER message in SYNC phase, it will 
update its {{_currentEpoch_}} to the file *before* writing the txns (from the 
PROPOSALs sent by leader in SYNC) to the log file. Such order may lead to 
improper truncation of *committed* txns in later rounds.

The critical step to trigger this problem is to make a follower node crash 
right after it updates its {{_currentEpoch_}} to the file but before writing 
the txns to the log file. The potential risk is that, this node with incomplete 
committed txns might be elected as a leader with its larger {{_currentEpoch_}} 
and then improperly uses TRUNC to ask nodes to delete their committed txns!

 
h2. Trace

The trace

Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and 
{{{}_lastLoggedZxid_{}}})

{*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to 
1{*}{*}):{*}
 - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - For all of them, {{_acceptedEpoch_}} = 1, _{{currentEpoch}}_ = 1.
 - Besides, all of them have {{_lastLoggedZxid_}} = <1, 3>, 
{{_lastProcessedZxid_}} = <1, 3>.
 - +S0+ crashes.
 - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ & 
+S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> ( 
Clients can read the datatree with latest zxid <1, 4>).

*Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
2{*}{*}){*}{*}:{*}
 * +S0+ restarts, +S2+ restarts, and +S1+ crashes.
 * Again, +S2+ is elected leader.
 * During the DISCOVERY phase, +S0+ & +S2+ update their {{_acceptedEpoch_}} to 
2.
 * Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} = 
<1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} = 
<1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk).
 * Note that the follower +S0+ updates its currentEpoch file before writing the 
txns to the log file when receiving NEWLEADER message.
 * *Unfortunately, right after the follower +S0+ finishes updating its 
currentEpoch file, it crashes.*

*Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
3{*}{*}){*}{*}:{*}
 * +S0+ & +S1+ restarts, and +S2+ crashes.
 * Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1, +S0+ 
will be elected leader.
 * During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1, 3>) 
will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>). Then, 
+S1+ removes txn <1, 4>. 


> Committed txns may be lost if followers reply ACK-LD before writing txns to 
> disk
> --------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4646
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.8.0
>            Reporter: Sirius
>            Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will 
> call logRequest(..) to submit the txn persistence task to the 
> SyncRequestProcessor thread. The latter does not promise to finish the task 
> before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader. 
> which may lead to committed data loss.
> Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the 
> fix of  ZOOKEEPER-3911  does not solve the problem at the root. The following 
> trace is a  
> h2. Trace
> The trace is basically the same as the one in ZOOKEEPER-3911 .  (Here we use 
> the zxid to represent a txn.)
>  - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
>  - +S2+ is elected leader.
>  - All of them have {{_lastLoggedZxid_}} = <1, 3>, {{_lastProcessedZxid_}} = 
> <1, 3>.
>  - +S2+ logs a new txn <1, 4> and makes a broadcast.
>  - Shutdown +S0+ & +S1+ before they receive the proposal of <1, 4>.
>  - Restart +S0+ & {+}S1{+}.
>  - +S2+ uses DIFF to sync with +S0+ & {+}S1{+}.
>  - +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads 
> log txns to disk.
>  - Verify clients of +S2+ has the view of <1, 4>.
>  - Shutdown {+}S2{+}, and make sure to shutdown the followers +S0+ and +S1+ 
> *before* their SyncRequestProcessor threads persist txns to disk. (This is 
> extremely timing sensitive but possible! )
>  - Restart +S0+ and {+}S1{+}.
>  - Verify clients of +S0+ and +S1+ do not have the view of <1, 4>, a 
> violation of ZAB.
>  
> h2. Analysis
> *Property Violation:*
> From the server side, the committed log of the ensemble does not appends 
> monotonically. From the client side, a client may read stale data after a 
> newer version is obtained, and that newer version cannot be obtained anymore.
> ZOOKEEPER-4643 shows similar symptoms, but its fix only mitigates the 
> occurrance of the problem without solving it at the root.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to