[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sirius updated ZOOKEEPER-4643:
------------------------------
    Attachment:     (was: ZOOKEEPER-4643.pdf)

> Committed txns may be improperly truncated if node crashes right after 
> updating currentEpoch 
> ---------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4643
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1
>            Reporter: Sirius
>            Priority: Critical
>         Attachments: Trace-ZK-4643.pdf
>
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will 
> update its {{_currentEpoch_}} to the file *before* writing the txns (from the 
> PROPOSALs sent by leader in SYNC) to the log file. Such order may lead to 
> improper truncation of *committed* txns in later rounds.
> The critical step to trigger this problem is to make a follower node crash 
> right after it updates its {{_currentEpoch_}} to the file but before writing 
> the txns to the log file. The potential risk is that, this node with 
> incomplete committed txns might be elected as a leader with its larger 
> {{_currentEpoch_}} and then improperly uses TRUNC to ask nodes to delete 
> their committed txns!
>  
> h2. Trace
> [^ZOOKEEPER-4643.pdf]
> Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and 
> {{{}_lastLoggedZxid_{}}})
> {*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to 
> 1{*}{*}):{*}
>  - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
>  - +S2+ is elected leader.
>  - For all of them, _{{currentEpoch}}_ = 1, {{_lastLoggedZxid_}} (the last 
> zxid in the log)= <1, 3>, {{_lastProcessedZxid_}} = <1, 3>.
>  - +S0+ crashes.
>  - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ & 
> +S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> .
>  - Verify clients can read the datatree with latest zxid <1, 4>. 
> *Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
> 2{*}{*}){*}{*}:{*}
>  * +S0+ & +S2+ restart, and +S1+ crashes.
>  * Again, +S2+ is elected leader.
>  * Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} = 
> <1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} = 
> <1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk).
>  * ( Note that the follower +S0+ updates its currentEpoch file before writing 
> the txns to the log file when receiving NEWLEADER message. )
>  * *Unfortunately, right after the follower +S0+ finishes updating its 
> currentEpoch file, it crashes.*
> *Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
> 3{*}{*}){*}{*}:{*}
>  * +S0+ & +S1+ restart, and +S2+ crashes.
>  * Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1, 
> +S0+ will be elected leader.
>  * During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1, 
> 3>) will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>). 
> Then, +S1+ removes txn <1, 4>.  
>  * ( However, <1, 4> was committed and visible by clients before, and is not 
> supposed to be truncated! )
>  * Verify clients of +S0+ & +S1+ do NOT have the view of txn <1, 4>, a 
> violation of ZAB.
>  
> Extra note: The trace can be constructed with quorum nodes alive at any 
> moment with careful time tuning of node crash & restart, e.g., let +S1+ 
> restart before +S0+ crashes at the end of Round 2.
> h2. Analysis
> *Root Cause:*
> When a learner updates its current epoch, it should guarantee that it has 
> already persisted the initial history of the leader (or, taken snapshot). 
> Otherwise, after the current epoch is updated to the file while the history 
> of the learner is not updated yet, environment failures might prevent the 
> latter from going on smoothly. It is dangerous for a node with updated 
> current epoch but stale history to be elected leader. It might truncate 
> committed txns on other nodes.
> *Property Violation:*
>  * From the server side, the ensemble deletes a committed txn, which is not 
> allowed; the committed log of the ensemble does not append monotonically; 
> different nodes have inconsistent committed logs.
>  * From the client side, clients connected to different nodes may have 
> inconsistent views. A client may read stale data after a newer version is 
> obtained. That newer version can only be obtained from certain nodes of the 
> ensemble rather than all nodes. What's worse, that newer version may also be 
> removed later.
> ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the 
> occurrance of the problem without solving it at the root. We also raise an 
> issue related to that fix in ZOOKEEPER-4646 . 
> *Gap between Theory & Reality:*
> The critical step here is to interrupt two updates that should be done 
> together. However, there lie some gaps between the theory and the reality. In 
> the Zab paper, a learner updates its current epoch and history during the 
> SYNC phase in an *atomic* action, and the correctness of the Zab protocol is 
> based on that. Unfortunately, in actual environment, a node crash (or other 
> environment failures) may occur at any time, breaking the procedures that 
> will not be interrupted at the protocol level. By digging into the code,  the 
> *synchronized* section in the method {{syncWithLeader(long)}} in 
> {{Learner.java}} can only prevent the procedure from interrupted by other 
> threads, but environment failures can still interrupt the process. We think 
> it important to keep ZooKeeper still in a correct condition even under such 
> type of environment circumstances.
> *Affected Versions:*
> The above trace has been generated in multiple versions such as 3.7.1 & 3.8.0 
> (the latest stable & current version till now) by our testing tools. The 
> affected versions might be more, because the critical update order between 
> the learner's current epoch and its history during SYNC stay unchanged across 
> multiple versions.
>  
> h2. Possible Fix
> In theory, this issue can be avoided by exchanging the order of writing 
> {{_currentEpoch_}} to the currentEpoch file and writing the 
> should-be-committed txns to the log file when the learner is processing 
> NEWLEADER message. Writing txns to the log file before updating 
> {{_currentEpoch_}} to the file is a more {_}c{_}onservative strategy, but it 
> prevents the problem that a node with the incomplete committed txns can win 
> an election with its higher _{{{}currentEpoch{}}}._ 
> When it comes to the implementation of multi-threading, things become a 
> little more complicated. In the procedure of processing NEWLEADER message in 
> the method {{syncWithLeader(long)}} in {{{}Learner.java{}}}, simply 
> exchanging the order of updating {{_currentEpoch_}} and writing txns to the 
> log file cannot fix the problem, because these two actions are conducted 
> asynchronously by different threads. When the QuorumPeer thread calls 
> {{logRequest(..)}} to submit log persistence task to the SyncRequestProcessor 
> thread, the latter thread will write txns to disk asynchronously, without 
> promising to finish that before the node's {{_currentEpoch_}} updated. 
> Considering this issue and ZOOKEEPER-4646, one possible fix is to guarantee 
> the following partial orders to be satisfied:
>  * A learner updates its {{_currentEpoch_}} only after it has persisted the 
> txns that might be applied to the leader's datatree before the leader gets 
> into the BROADCAST phase (so as to avoid this issue). 
>  * A learner replies ACK-LD only after it has persisted the txns that might 
> be applied to the leader's datatree before the leader gets into the BROADCAST 
> phase  (so as to avoid the issue of ZOOKEEPER-4646 ).
> Intuitively, the {{CompletableFuture}} class introduced in Java 8 may be a 
> suitable choice to fix the issue. With the method of {{complete(..)}} and 
> {{whenComplete(..)}} provided by {{{}CompletableFuture{}}}, the 
> SyncRequestProcessor can explicitly set the txn logging process completed and 
> trigger the process of updating {{_currenEpoch_}} and replying ACK-LD.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to