[
https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sirius updated ZOOKEEPER-4646:
------------------------------
Affects Version/s: 3.7.0
3.6.3
> Committed txns may still be lost if followers reply ACK-LD before writing
> txns to disk
> --------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4646
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1
> Reporter: Sirius
> Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will
> call {{logRequest(..)}} to submit the txn persistence task to the
> SyncRequestProcessor thread. The latter does not promise to finish the task
> before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader.
> which may lead to committed data loss.
> Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the
> fix of ZOOKEEPER-3911 does not solve the problem at the root. The following
> trace can still occur in the latest version nowadays.
> h2. Trace
> The trace is basically the same as the one in ZOOKEEPER-3911 (See the first
> comment provided by [~hanm] in that issue). For convenience we use the zxid
> to represent a txn here.
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
> - +S2+ is elected leader.
> - All of them have the same log with the last zxid <1, 3>.
> - +S2+ logs a new txn <1, 4> and makes a broadcast.
> - +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
> - +S0+ & +S1+ restart.
> - +S2+ is elected leader again.
> - +S0+ & +S1+ DIFF sync with +S2+ .
> - +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads
> log txns to disk.
> - Verify clients of +S2+ has the view of <1, 4>. (That means, S2)
> - The followers +S0+ & +S1+ crash *before* their SyncRequestProcessor
> threads persist txns to disk. (This is extremely timing sensitive but
> possible! )
> - +S0+ & +S1+ restart, and +S2+ crashes.
> - Verify clients of +S0+ & +S1+ do not have the view of <1, 4>, a violation
> of ZAB.
>
> Extra note: The trace can be constructed with quorum nodes alive at any
> moment with careful time tuning of node shutdown & restart, e.g., let +S0+ &
> +S1+ shutdown and restart one by one in a short time.
> h2. Analysis
> *Property Violation:*
> From the server side, the committed log of the ensemble does not appends
> monotonically; different nodes have inconsistent committed logs. From the
> client side, a client may read stale data after a newer version is obtained,
> and that newer version cannot be obtained anymore.
> Although ZOOKEEPER-4643 has similar symptoms and property violations, it
> should be regarded as a distinct problem because it has different root cause
> and risk pattern compared to this one. More specifically,
> * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch
> before logging txns to disk. The bug can be triggered by interrupting the
> action of logging txns after currentEpoch is updated.
> * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before
> logging txns to disk. The bug can be triggered by interrupting the action of
> logging txns after ACK-LD is replied.
> *Gap between Protocol and Implementation:*
> The implementation adopts the multi-threading style for performance
> optimization. However, it may bring some underlying subtle bugs that will not
> occur at the protocol level. The fix of ZOOKEEPER-3911 simply adds the
> QuorumPeer's action of calling {{logRequest(..)}} inside the NEWLEADER
> processing logic, without further considering the risk of asynchronous
> executions by other threads.
> *Affected Versions:*
> The above trace has been triggered in multiple versions such as 3.7.1 & 3.8.0
> (the latest stable & current version till now) by our testing tools. The
> affected versions might be more, since the critical update order between the
> follower's replying ACK-LD and updating its history during SYNC stay
> non-deterministic as the version evolves.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)