[
https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sirius updated ZOOKEEPER-4646:
------------------------------
Description:
When a follower is processing the NEWLEADER message in SYNC phase, it will call
{{logRequest(..)}} to submit the txn persistence task to the
SyncRequestProcessor thread. The latter does not promise to finish the task
before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader. which
may lead to committed data loss.
Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the
fix of ZOOKEEPER-3911 does not solve the problem at the root. The following
trace can still occur in the latest version nowadays.
h2. Trace
The trace is basically the same as the one in ZOOKEEPER-3911 (See the first
comment provided by [~hanm] in that issue). For convenience we use the zxid to
represent a txn here.
Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
- +S2+ is elected leader.
- All of them have the same log with the last zxid <1, 3>.
- +S2+ logs a new txn <1, 4> and makes a broadcast.
- +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
- +S0+ & +S1+ restart.
- +S2+ is elected leader again.
- +S0+ & +S1+ DIFF sync with +S2+ .
- +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads
log txns to disk.
- Verify clients of +S2+ has the view of <1, 4>. (That means, S2)
- The followers +S0+ & +S1+ crash *before* their SyncRequestProcessor threads
persist txns to disk. (This is extremely timing sensitive but possible! )
- +S0+ & +S1+ restart, and +S2+ crashes.
- Verify clients of +S0+ & +S1+ do not have the view of <1, 4>, a violation of
ZAB.
Extra note: The trace can be constructed with quorum nodes alive at any moment
with careful time tuning of node shutdown & restart, e.g., let +S0+ & +S1+
shutdown and restart one by one in a short time.
h2. Analysis
*Property Violation:*
>From the server side, the committed log of the ensemble does not appends
>monotonically; different nodes have inconsistent committed logs. From the
>client side, a client may read stale data after a newer version is obtained,
>and that newer version cannot be obtained anymore.
Although ZOOKEEPER-4643 has similar symptoms and property violations, it should
be regarded as a distinct problem because it has different root cause and risk
pattern compared to this one. More specifically,
* ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch before
logging txns to disk. The bug can be triggered by interrupting the action of
logging txns after currentEpoch is updated.
* ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before
logging txns to disk. The bug can be triggered by interrupting the action of
logging txns after ACK-LD is replied.
*Gap between Protocol and Implementation:*
The implementation adopts the multi-threading style for performance
optimization. However, it may bring some underlying subtle bugs that will not
occur at the protocol level. The fix of ZOOKEEPER-3911 simply adds the
QuorumPeer's action of calling {{logRequest(..)}} inside the NEWLEADER
processing logic, without further considering the risk of asynchronous
executions by other threads.
*Affected Versions:*
The above trace has been triggered in multiple versions such as 3.7.1 & 3.8.0
(the latest stable & current version till now) by our testing tools. The
affected versions might be more, since the critical update order between the
follower's replying ACK-LD and updating its history during SYNC stay
non-deterministic as the version evolves.
was:
When a follower is processing the NEWLEADER message in SYNC phase, it will call
{{logRequest(..)}} to submit the txn persistence task to the
SyncRequestProcessor thread. The latter does not promise to finish the task
before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader. which
may lead to committed data loss.
Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the
fix of ZOOKEEPER-3911 does not solve the problem at the root. The following
trace can still occur in the latest version nowadays.
h2. Trace
The trace is basically the same as the one in ZOOKEEPER-3911 (See the first
comment provided by [~hanm] in that issue). For convenience we use the zxid to
represent a txn here.
Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
- +S2+ is elected leader.
- All of them have the same log with the last zxid <1, 3>.
- +S2+ logs a new txn <1, 4> and makes a broadcast.
- +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
- +S0+ & +S1+ restart.
- +S2+ is elected leader again.
- +S0+ & +S1+ DIFF sync with +S2+ .
- +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads
log txns to disk.
- Verify clients of +S2+ has the view of <1, 4>. (That means, S2)
- The followers +S0+ & +S1+ crash *before* their SyncRequestProcessor threads
persist txns to disk. (This is extremely timing sensitive but possible! )
- +S0+ & +S1+ restart, and +S2+ crashes.
- Verify clients of +S0+ & +S1+ do not have the view of <1, 4>, a violation of
ZAB.
Extra note: The trace can be constructed with quorum nodes alive at any moment
with careful time tuning of node shutdown & restart, e.g., let +S0+ & +S1+
shutdown and restart one by one in a short time.
h2. Analysis
*Property Violation:*
>From the server side, the committed log of the ensemble does not appends
>monotonically; different nodes have inconsistent committed logs. From the
>client side, a client may read stale data after a newer version is obtained,
>and that newer version cannot be obtained anymore.
Although ZOOKEEPER-4643 has similar symptoms and property violations, it should
be regarded as a distinct problem because it has different root cause and risk
pattern compared to this one. More specifically,
* ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch before
logging txns to disk. The bug can be triggered by interrupting the action of
logging txns after currentEpoch is updated.
* ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before
logging txns to disk. The bug can be triggered by interrupting the action of
logging txns after ACK-LD is replied.
*Gap between Protocol and Implementation:*
The implementation adopts the multi-threading style for performance
optimization. However, it may bring some underlying subtle bugs that will not
occur at the protocol level. The fix of ZOOKEEPER-3911 simply adds the
QuorumPeer's action of calling {{logRequest(..)}} inside the NEWLEADER
processing logic, without further considering the risk of asynchronous
executions by other threads.
*Affected Versions:*
The above trace has been triggered in multiple versions such as 3.7.1 & 3.8.0
(the latest stable & current version till now) by our testing tools. The
affected versions might be more, since the critical update order between the
follower's replying ACK-LD and updating its history during SYNC stay
non-deterministic each time even as the version evolves.
> Committed txns may still be lost if followers reply ACK-LD before writing
> txns to disk
> --------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4646
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.8.0, 3.7.1
> Reporter: Sirius
> Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will
> call {{logRequest(..)}} to submit the txn persistence task to the
> SyncRequestProcessor thread. The latter does not promise to finish the task
> before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader.
> which may lead to committed data loss.
> Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the
> fix of ZOOKEEPER-3911 does not solve the problem at the root. The following
> trace can still occur in the latest version nowadays.
> h2. Trace
> The trace is basically the same as the one in ZOOKEEPER-3911 (See the first
> comment provided by [~hanm] in that issue). For convenience we use the zxid
> to represent a txn here.
> Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
> - +S2+ is elected leader.
> - All of them have the same log with the last zxid <1, 3>.
> - +S2+ logs a new txn <1, 4> and makes a broadcast.
> - +S0+ & +S1+ crash before they receive the proposal of <1, 4>.
> - +S0+ & +S1+ restart.
> - +S2+ is elected leader again.
> - +S0+ & +S1+ DIFF sync with +S2+ .
> - +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads
> log txns to disk.
> - Verify clients of +S2+ has the view of <1, 4>. (That means, S2)
> - The followers +S0+ & +S1+ crash *before* their SyncRequestProcessor
> threads persist txns to disk. (This is extremely timing sensitive but
> possible! )
> - +S0+ & +S1+ restart, and +S2+ crashes.
> - Verify clients of +S0+ & +S1+ do not have the view of <1, 4>, a violation
> of ZAB.
>
> Extra note: The trace can be constructed with quorum nodes alive at any
> moment with careful time tuning of node shutdown & restart, e.g., let +S0+ &
> +S1+ shutdown and restart one by one in a short time.
> h2. Analysis
> *Property Violation:*
> From the server side, the committed log of the ensemble does not appends
> monotonically; different nodes have inconsistent committed logs. From the
> client side, a client may read stale data after a newer version is obtained,
> and that newer version cannot be obtained anymore.
> Although ZOOKEEPER-4643 has similar symptoms and property violations, it
> should be regarded as a distinct problem because it has different root cause
> and risk pattern compared to this one. More specifically,
> * ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch
> before logging txns to disk. The bug can be triggered by interrupting the
> action of logging txns after currentEpoch is updated.
> * ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before
> logging txns to disk. The bug can be triggered by interrupting the action of
> logging txns after ACK-LD is replied.
> *Gap between Protocol and Implementation:*
> The implementation adopts the multi-threading style for performance
> optimization. However, it may bring some underlying subtle bugs that will not
> occur at the protocol level. The fix of ZOOKEEPER-3911 simply adds the
> QuorumPeer's action of calling {{logRequest(..)}} inside the NEWLEADER
> processing logic, without further considering the risk of asynchronous
> executions by other threads.
> *Affected Versions:*
> The above trace has been triggered in multiple versions such as 3.7.1 & 3.8.0
> (the latest stable & current version till now) by our testing tools. The
> affected versions might be more, since the critical update order between the
> follower's replying ACK-LD and updating its history during SYNC stay
> non-deterministic as the version evolves.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)