[ https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sirius updated ZOOKEEPER-4646: ------------------------------ Affects Version/s: 3.8.1 > Committed txns may still be lost if followers crash after replying ACK-LD but > before writing txns to disk > --------------------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-4646 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.8.1 > Reporter: Sirius > Priority: Critical > Attachments: Trace-ZK-4646.pdf > > > When a follower is processing the NEWLEADER message in SYNC phase, it will > call {{logRequest(..)}} to submit the txn persistence task to the > SyncRequestProcessor thread. The SyncRequestProcessor thread will persist > txns asynchronously and does not promise to finish the task before the > follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader, which may lead > to committed data loss. > Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the > fix of ZOOKEEPER-3911 does not solve the problem at the root. The following > trace can still be generated in the latest version nowadays. > h2. Trace > [^Trace-ZK-4646.pdf] > The trace is basically the same as the one in ZOOKEEPER-3911 (See the first > comment provided by [~hanm] in that issue). For convenience we use the zxid > to represent a txn here. > Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}. > - +S2+ is elected leader. > - All of them have the same log with the last zxid <1, 3>. > - +S2+ logs a new txn <1, 4> and makes a broadcast. > - +S0+ & +S1+ crash before they receive the proposal of <1, 4>. > - +S0+ & +S1+ restart. > - +S2+ is elected leader again. > - +S0+ & +S1+ DIFF sync with +S2+ . > - +S0+ & +S1+ send ACK-LD to +S2+ *before* their SyncRequestProcessor > threads log txns to disk. (This is possible because txn logging is processed > asynchronously! ) > - Verify clients of +S2+ have the view of <1, 4>. > - The followers +S0+ & +S1+ crash *before* their SyncRequestProcessor > threads persist txns to disk. (This is extremely timing sensitive but > possible! ) > - +S0+ & +S1+ restart, and +S2+ crashes. > - Verify clients of +S0+ & +S1+ do NOT have the view of <1, 4>, a violation > of ZAB. > > Extra note: The trace can be constructed with quorum nodes alive at any > moment with careful time tuning of node shutdown & restart, e.g., let +S0+ & > +S1+ shutdown and restart one by one in a short time. > h2. Analysis > *Root Cause:* > The root cause lies in the asynchronous executions by multi-threads. > When a follower replies ACK-LD, it should promise that it has already logged > the initial history of the leader (according to ZAB). However, txn logging is > executed by the SyncRequestProcessor thread asynchronously, so the above > promise can be violated. It is possible that, after the leader receives > ACK-LD, believing that the responding follower has been in sync, and then > gets into the BROADCAST phase, while in fact the history of the follower is > not in sync yet. At this time, environment failures might prevent the > follower from logging successfully. When that node with stale or incomplete > committed history is elected leader later, it might lose txns that have been > committed and applied on the former leader node. > The implementation adopts the multi-threading style for performance > optimization. However, it may bring some underlying subtle bugs that will not > occur at the protocol level. The fix of ZOOKEEPER-3911 simply calls > {{logRequest(..)}} to submit the logging requests to SyncRequestProcessor's > queue before replying ACK-LD inside the NEWLEADER processing logic, without > further considering the risk of asynchronous executions by multi-threads. > When the follower replies ACK-LD and then crashes before its > SyncRequestProcessor thread writes txns to disk, the problem is triggered. > *Property Violation:* > From the server side, the committed log of the ensemble does not append > monotonically; different nodes have inconsistent committed logs. From the > client side, clients connected to different nodes may have inconsistent > views. A client may read stale data after a newer version is obtained. That > newer version can only be obtained from certain nodes of the ensemble rather > than all nodes. What's worse, that newer version may also be removed later. > Although ZOOKEEPER-4643 has similar symptoms and property violations, it > should be regarded as a distinct problem because it has different root cause > and risk pattern compared to this one. More specifically, > * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch > before logging txns to disk. The bug can be triggered by interrupting the > action of logging txns after currentEpoch is updated. > * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before > logging txns to disk. The bug can be triggered by interrupting the action of > logging txns after ACK-LD is replied. > *Affected Versions:* > The above trace has been generated in multiple versions such as 3.7.1 & 3.8.0 > (the latest stable & current version till now) by our testing tools. The > affected versions might be more, since the critical update order between the > follower's replying ACK-LD and updating its history during SYNC stay > non-deterministic in multiple versions. > > h2. Possible Fix > Considering this issue and ZOOKEEPER-4643 , one possible fix is to guarantee > the following partial orders to be satisfied: > * A follower updates its {{_currentEpoch_}} only after it has persisted the > txns that might be applied to the leader's datatree before the leader gets > into the BROADCAST phase (so as to avoid the issue of ZOOKEEPER-4643 ). > * A follower replies ACK-LD only after it has persisted the txns that might > be applied to the leader's datatree before the leader gets into the BROADCAST > phase (so as to avoid this issue). -- This message was sent by Atlassian Jira (v8.20.10#820010)