[
https://issues.apache.org/jira/browse/ZOOKEEPER-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sirius updated ZOOKEEPER-4646:
------------------------------
Description:
When a follower is processing the NEWLEADER message in SYNC phase, it will call
logRequest(..) to submit the txn persistence task to the SyncRequestProcessor
thread. The latter does not promise to finish the task before the follower
replies ACK-LD (i.e. ACK of NEWLEADER) to the leader. which may lead to
committed data loss.
Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the
fix of ZOOKEEPER-3911 does not solve the problem at the root. The following
trace is a
h2. Trace
The trace is basically the same as the one in ZOOKEEPER-3911 . (Here we use
the zxid to represent a txn.)
- Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
- +S2+ is elected leader.
- All of them have {{_lastLoggedZxid_}} = <1, 3>, {{_lastProcessedZxid_}} =
<1, 3>.
- +S2+ logs a new txn <1, 4> and makes a broadcast.
- Shutdown +S0+ & +S1+ before they receive the proposal of <1, 4>.
- Restart +S0+ & {+}S1{+}.
- +S2+ uses DIFF to sync with +S0+ & {+}S1{+}.
- +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads
log txns to disk.
- Verify clients of +S2+ has the view of <1, 4>.
- Shutdown {+}S2{+}, and make sure to shutdown the followers +S0+ and +S1+
*before* their SyncRequestProcessor threads persist txns to disk. (This is
extremely timing sensitive but possible! )
- Restart +S0+ and {+}S1{+}.
- Verify clients of +S0+ and +S1+ do not have the view of <1, 4>, a violation
of ZAB.
h2. Analysis
*Property Violation:*
>From the server side, the committed log of the ensemble does not appends
>monotonically. From the client side, a client may read stale data after a
>newer version is obtained, and that newer version cannot be obtained anymore.
ZOOKEEPER-4643 shows similar symptoms, but its fix only mitigates the
occurrance of the problem without solving it at the root.
was:
When a follower is processing the NEWLEADER message in SYNC phase, it will call
logRequest(..) to submit the txn persistence task to the SyncRequestProcessor
thread. The latter does not promise to finish the task before the follower
replies ACK-LD to the leader. which may trigger committed data loss.
When a follower is processing the NEWLEADER message in SYNC phase, it will
update its {{_currentEpoch_}} to the file *before* writing the txns (from the
PROPOSALs sent by leader in SYNC) to the log file. Such order may lead to
improper truncation of *committed* txns in later rounds.
The critical step to trigger this problem is to make a follower node crash
right after it updates its {{_currentEpoch_}} to the file but before writing
the txns to the log file. The potential risk is that, this node with incomplete
committed txns might be elected as a leader with its larger {{_currentEpoch_}}
and then improperly uses TRUNC to ask nodes to delete their committed txns!
h2. Trace
The trace
Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and
{{{}_lastLoggedZxid_{}}})
{*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to
1{*}{*}):{*}
- Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
- +S2+ is elected leader.
- For all of them, {{_acceptedEpoch_}} = 1, _{{currentEpoch}}_ = 1.
- Besides, all of them have {{_lastLoggedZxid_}} = <1, 3>,
{{_lastProcessedZxid_}} = <1, 3>.
- +S0+ crashes.
- A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ &
+S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> (
Clients can read the datatree with latest zxid <1, 4>).
*Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
2{*}{*}){*}{*}:{*}
* +S0+ restarts, +S2+ restarts, and +S1+ crashes.
* Again, +S2+ is elected leader.
* During the DISCOVERY phase, +S0+ & +S2+ update their {{_acceptedEpoch_}} to
2.
* Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} =
<1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} =
<1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk).
* Note that the follower +S0+ updates its currentEpoch file before writing the
txns to the log file when receiving NEWLEADER message.
* *Unfortunately, right after the follower +S0+ finishes updating its
currentEpoch file, it crashes.*
*Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
3{*}{*}){*}{*}:{*}
* +S0+ & +S1+ restarts, and +S2+ crashes.
* Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1, +S0+
will be elected leader.
* During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1, 3>)
will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>). Then,
+S1+ removes txn <1, 4>.
> Committed txns may be lost if followers reply ACK-LD before writing txns to
> disk
> --------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4646
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4646
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.8.0
> Reporter: Sirius
> Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will
> call logRequest(..) to submit the txn persistence task to the
> SyncRequestProcessor thread. The latter does not promise to finish the task
> before the follower replies ACK-LD (i.e. ACK of NEWLEADER) to the leader.
> which may lead to committed data loss.
> Actually, this problem had been first raised in ZOOKEEPER-3911 . However, the
> fix of ZOOKEEPER-3911 does not solve the problem at the root. The following
> trace is a
> h2. Trace
> The trace is basically the same as the one in ZOOKEEPER-3911 . (Here we use
> the zxid to represent a txn.)
> - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
> - +S2+ is elected leader.
> - All of them have {{_lastLoggedZxid_}} = <1, 3>, {{_lastProcessedZxid_}} =
> <1, 3>.
> - +S2+ logs a new txn <1, 4> and makes a broadcast.
> - Shutdown +S0+ & +S1+ before they receive the proposal of <1, 4>.
> - Restart +S0+ & {+}S1{+}.
> - +S2+ uses DIFF to sync with +S0+ & {+}S1{+}.
> - +S0+ & +S1+ send ACK-LD to +S2+ before their SyncRequestProcessor threads
> log txns to disk.
> - Verify clients of +S2+ has the view of <1, 4>.
> - Shutdown {+}S2{+}, and make sure to shutdown the followers +S0+ and +S1+
> *before* their SyncRequestProcessor threads persist txns to disk. (This is
> extremely timing sensitive but possible! )
> - Restart +S0+ and {+}S1{+}.
> - Verify clients of +S0+ and +S1+ do not have the view of <1, 4>, a
> violation of ZAB.
>
> h2. Analysis
> *Property Violation:*
> From the server side, the committed log of the ensemble does not appends
> monotonically. From the client side, a client may read stale data after a
> newer version is obtained, and that newer version cannot be obtained anymore.
> ZOOKEEPER-4643 shows similar symptoms, but its fix only mitigates the
> occurrance of the problem without solving it at the root.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)