[ https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sirius updated ZOOKEEPER-4643: ------------------------------ Attachment: ZOOKEEPER-4643.pdf > Committed txns may be improperly truncated if node crashes right after > updating currentEpoch > --------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-4643 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1 > Reporter: Sirius > Priority: Critical > Attachments: ZOOKEEPER-4643.pdf > > > When a follower is processing the NEWLEADER message in SYNC phase, it will > update its {{_currentEpoch_}} to the file *before* writing the txns (from the > PROPOSALs sent by leader in SYNC) to the log file. Such order may lead to > improper truncation of *committed* txns in later rounds. > The critical step to trigger this problem is to make a follower node crash > right after it updates its {{_currentEpoch_}} to the file but before writing > the txns to the log file. The potential risk is that, this node with > incomplete committed txns might be elected as a leader with its larger > {{_currentEpoch_}} and then improperly uses TRUNC to ask nodes to delete > their committed txns! > > h2. Trace > > Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and > {{{}_lastLoggedZxid_{}}}) > {*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to > 1{*}{*}):{*} > - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}. > - +S2+ is elected leader. > - For all of them, _{{currentEpoch}}_ = 1, {{_lastLoggedZxid_}} (the last > zxid in the log)= <1, 3>, {{_lastProcessedZxid_}} = <1, 3>. > - +S0+ crashes. > - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ & > +S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> . > - Verify clients can read the datatree with latest zxid <1, 4>. > *Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to > 2{*}{*}){*}{*}:{*} > * +S0+ & +S2+ restart, and +S1+ crashes. > * Again, +S2+ is elected leader. > * Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} = > <1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} = > <1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk). > * ( Note that the follower +S0+ updates its currentEpoch file before writing > the txns to the log file when receiving NEWLEADER message. ) > * *Unfortunately, right after the follower +S0+ finishes updating its > currentEpoch file, it crashes.* > *Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to > 3{*}{*}){*}{*}:{*} > * +S0+ & +S1+ restart, and +S2+ crashes. > * Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1, > +S0+ will be elected leader. > * During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1, > 3>) will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>). > Then, +S1+ removes txn <1, 4>. > * ( However, <1, 4> was committed and visible by clients before, and is not > supposed to be truncated! ) > * Verify clients of +S0+ & +S1+ do NOT have the view of txn <1, 4>, a > violation of ZAB. > > Extra note: The trace can be constructed with quorum nodes alive at any > moment with careful time tuning of node crash & restart, e.g., let +S1+ > restart before +S0+ crashes at the end of Round 2. > h2. Analysis > *Root Cause:* > When a learner updates its current epoch, it should guarantee that it has > already persisted the initial history of the leader (or, taken snapshot). > Otherwise, after the current epoch is updated to the file while the history > of the learner is not updated yet, environment failures might prevent the > latter from going on smoothly. It is dangerous for a node with updated > current epoch but stale history to be elected leader. It might truncate > committed txns on other nodes. > *Property Violation:* > * From the server side, the ensemble deletes a committed txn, which is not > allowed; the committed log of the ensemble does not append monotonically; > different nodes have inconsistent committed logs. > * From the client side, clients connected to different nodes may have > inconsistent views. A client may read stale data after a newer version is > obtained. That newer version can only be obtained from certain nodes of the > ensemble rather than all nodes. What's worse, that newer version may also be > removed later. > ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the > occurrance of the problem without solving it at the root. We also raise an > issue related to that fix in ZOOKEEPER-4646 . > *Gap between Theory & Reality:* > The critical step here is to interrupt two updates that should be done > together. However, there lie some gaps between the theory and the reality. In > the Zab paper, a learner updates its current epoch and history during the > SYNC phase in an *atomic* action, and the correctness of the Zab protocol is > based on that. Unfortunately, in actual environment, a node crash (or other > environment failures) may occur at any time, breaking the procedures that > will not be interrupted at the protocol level. By digging into the code, the > *synchronized* section in the method {{syncWithLeader(long)}} in > {{Learner.java}} can only prevent the procedure from interrupted by other > threads, but environment failures can still interrupt the process. We think > it important to keep ZooKeeper still in a correct condition even under such > type of environment circumstances. > *Affected Versions:* > The above trace has been generated in multiple versions such as 3.7.1 & 3.8.0 > (the latest stable & current version till now) by our testing tools. The > affected versions might be more, because the critical update order between > the learner's current epoch and its history during SYNC stay unchanged across > multiple versions. > > h2. Possible Fix > In theory, this issue can be avoided by exchanging the order of writing > {{_currentEpoch_}} to the currentEpoch file and writing the > should-be-committed txns to the log file when the learner is processing > NEWLEADER message. Writing txns to the log file before updating > {{_currentEpoch_}} to the file is a more {_}c{_}onservative strategy, but it > prevents the problem that a node with the incomplete committed txns can win > an election with its higher _{{{}currentEpoch{}}}._ > When it comes to the implementation of multi-threading, things become a > little more complicated. In the procedure of processing NEWLEADER message in > the method {{syncWithLeader(long)}} in {{{}Learner.java{}}}, simply > exchanging the order of updating {{_currentEpoch_}} and writing txns to the > log file cannot fix the problem, because these two actions are conducted > asynchronously by different threads. When the QuorumPeer thread calls > {{logRequest(..)}} to submit log persistence task to the SyncRequestProcessor > thread, the latter thread will write txns to disk asynchronously, without > promising to finish that before the node's {{_currentEpoch_}} updated. > Considering this issue and ZOOKEEPER-4646, one possible fix is to guarantee > the following partial orders to be satisfied: > * A learner updates its {{_currentEpoch_}} only after it has persisted the > txns that might be applied to the leader's datatree before the leader gets > into the BROADCAST phase (so as to avoid this issue). > * A learner replies ACK-LD only after it has persisted the txns that might > be applied to the leader's datatree before the leader gets into the BROADCAST > phase (so as to avoid the issue of ZOOKEEPER-4646 ). > Intuitively, the {{CompletableFuture}} class introduced in Java 8 may be a > suitable choice to fix the issue. With the method of {{complete(..)}} and > {{whenComplete(..)}} provided by {{{}CompletableFuture{}}}, the > SyncRequestProcessor can explicitly set the txn logging process completed and > trigger the process of updating {{_currenEpoch_}} and replying ACK-LD. -- This message was sent by Atlassian Jira (v8.20.10#820010)