[
https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sirius updated ZOOKEEPER-4643:
------------------------------
Flags: Important
> Committed txns may be improperly truncated if node crashes right after
> updating currentEpoch
> ---------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4643
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1
> Reporter: Sirius
> Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will
> update its {{_currentEpoch_}} to the file *before* writing the txns (from the
> PROPOSALs sent by leader in SYNC) to the log file. Such order may lead to
> improper truncation of *committed* txns in later rounds.
> The critical step to trigger this problem is to make a follower node crash
> right after it updates its {{_currentEpoch_}} to the file but before writing
> the txns to the log file. The potential risk is that, this node with
> incomplete committed txns might be elected as a leader with its larger
> {{_currentEpoch_}} and then improperly uses TRUNC to ask nodes to delete
> their committed txns!
>
> h2. Trace
> Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and
> {{{}_lastLoggedZxid_{}}})
> {*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to
> 1{*}{*}):{*}
> - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
> - +S2+ is elected leader.
> - For all of them, {{_acceptedEpoch_}} = 1, _{{currentEpoch}}_ = 1.
> - Besides, all of them have {{_lastLoggedZxid_}} = <1, 3>,
> {{_lastProcessedZxid_}} = <1, 3>.
> - +S0+ crashes.
> - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ &
> +S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> (
> Clients can read the datatree with latest zxid <1, 4>).
> *Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
> 2{*}{*}){*}{*}:{*}
> * +S0+ restarts, +S2+ restarts, and +S1+ crashes.
> * Again, +S2+ is elected leader.
> * During the DISCOVERY phase, +S0+ & +S2+ update their {{_acceptedEpoch_}}
> to 2.
> * Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} =
> <1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} =
> <1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk).
> * Note that the follower +S0+ updates its currentEpoch file before writing
> the txns to the log file when receiving NEWLEADER message.
> * *Unfortunately, right after the follower +S0+ finishes updating its
> currentEpoch file, it crashes.*
> *Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
> 3{*}{*}){*}{*}:{*}
> * +S0+ & +S1+ restarts, and +S2+ crashes.
> * Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1,
> +S0+ will be elected leader.
> * During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1,
> 3>) will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>).
> Then, +S1+ removes txn <1, 4>.
>
> However, <1, 4> was committed and visible by clients before, and is not
> supposed to be truncated!
>
> Note: The trace can be constructed with quorum nodes alive at any moment with
> careful time tuning of node crash & restart, e.g., let +S1+ restart before
> +S0+ crashes at the end of Round 2.
> h2.
> h2. Analysis
> *Property Violation:*
> From the server side, the ensemble deletes a committed txn, which is not
> allowed. From the client side, a client may read stale data after a newer
> version is obtained, and that newer version cannot be obtained anymore.
> ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the
> occurrance of the problem without solving it at the root.
> *Gap between Theory and Reality:*
> The critical step here is to interrupt two updates that should be done
> together. However, there lie some gaps between the theory and the reality. In
> the Zab paper, a follower updates its current epoch and history during the
> SYNC phase in an *atomic* action, and the correctness of the Zab protocol is
> based on that. Unfortunately, in actual environment, a node crash (or other
> environment failures) may occur at any time, breaking the procedures that
> will not be interrupted at the protocol level. By digging into the code, the
> *synchronized* section in the method {{syncWithLeader(long)}} in
> {{Learner.java}} can only prevent the procedure from interrupted by other
> threads, but environment failures can still interrupt the process. We think
> it important to keep ZooKeeper still in a correct condition even under such
> type of environment circumstances.
> *Affected Versions:*
> The above trace has been triggered in multiple versions such as 3.7.1 & 3.8.0
> (the latest stable & current version till now) by our testing tools. The
> affected versions might be more, because the critical update order between
> the follower's current epoch and its history during SYNC stay unchanged
> across multiple versions.
>
> I will provide more materials to demonstrate further details in two days.
>
>
> h2. Possible Fix
> In theory, this issue can be avoided by exchanging the order of writing
> {{_currentEpoch_}} to the currentEpoch file and writing the
> should-be-committed txns to the log file when the follower is processing
> NEWLEADER message. Writing txns to the log file before updating
> {{_currentEpoch_}} to the file is {_}a more c{_}onservative strategy, but it
> prevents the problem that a node with the incomplete committed txns can win
> an election with its higher _{{{}currentEpoch{}}}._
> When it comes to the implementation of multi-threading, things become a
> little more complicated. In the procedure of processing NEWLEADER message in
> the method {{syncWithLeader(long)}} in {{{}Learner.java{}}}, simply
> exchanging the order of updating {{_currentEpoch_}} and writing txns to the
> log file cannot fix the problem, because these two actions are conducted by
> different threads. When the QuorumPeer thread calls {{logRequest(..)}} to
> submit log persistence task to the SyncRequestProcessor thread, the latter
> thread will write txns to disk asynchronously, without promising to finish
> that before the node's {{_currentEpoch_}} updated.
> Above all, the fix at the code level should guarantee the partial order that
> {{_currentEpoch_}} is updated only after the learner has the same initial
> history with leader.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)