[
https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sirius updated ZOOKEEPER-4643:
------------------------------
Description:
When a follower is processing the NEWLEADER message in SYNC phase, it will
update its _currentEpoch_ to the file *before* writing the uncommitted txns to
the log file. Such order may lead to improper truncation of *committed* txns in
later rounds.
The critical step to trigger this problem is to make a follower node crash
right after it updates its _currentEpoch_ to the file but before writing the
uncommitted txns to the log file. The potential risk is that, this node with
incomplete committed txns might be elected as a leader with its larger
_currentEpoch_ and then improperly uses TRUNC to ask nodes to delete their
committed txns!
h2. Trace
Here is an example to trigger the bug. (Focus on _currentEpoch_ and
{_}lastLoggedZxid{_})
{*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to
1{*}{*}):{*}
- Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
- +S2+ is elected leader.
- For all of them, _acceptedEpoch_ = 1, _currentEpoch_ = 1.
- Besides, all of them have _lastLoggedZxid_ = <1, 3>, _lastProcessedZxid_ =
<1, 3>.
- +S0+ crashes.
- A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ &
+S2+ have _lastLoggedZxid_ = <1, 4>, _lastProcessedZxid_ = <1, 4> ( Clients can
read the datatree with latest zxid <1, 4>).
*Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
2{*}{*}){*}{*}:{*}
* +S0+ restarts, +S2+ restarts, and +S1+ crashes.
* Again, +S2+ is elected leader.
* During the DISCOVERY phase, +S0+ & +S2+ update their _acceptedEpoch_ to 2.
* Then, during the SYNC phase, the leader +S2+ ({_}maxCommittedLog{_} = <1,
4>) uses DIFF to sync with the follower +S0+ ({_}lastLoggedZxid{_} = <1, 3>),
and their _currentEpoch_ will be set to 2 (and written to disk).
* Note that the follower +S0+ updates its currentEpoch file before writing the
uncommitted txns to the log file when receiving NEWLEADER message.
* *Unfortunately, right after the follower +S0+ finishes updating its
currentEpoch file, it crashes.*
*Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
3{*}{*}){*}{*}:{*}
* +S0+ & +S1+ restarts, and +S2+ crashes.
* Since +S0+ has _currentEpoch_ = 2, +S1+ has _currentEpoch_ = 1, +S0+ will be
elected leader.
* During the SYNC phase, the leader +S0+ ({_}maxCommittedLog{_} = <1, 3>) will
use TRUNC to sync with +S1+ ({_}lastLoggedZxid{_} = <1, 4>). Then, +S1+ removes
txn <1, 4>.
However, <1, 4> was committed and visible by clients before, and is not
supposed to be truncated!
Note: The trace can be constructed with quorum nodes alive at any moment with
careful time tuning of node crash & restart, e.g., let +S1+ restart before +S0+
crashes at the end of Round 2.
h2.
h2. Analysis
*Property Violation:*
>From the server side, the ensemble deletes a committed txn, which is not
>allowed. From the client side, a client may read stale data after a newer
>version is obtained, and that newer version cannot be obtained anymore.
ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the
occurrance of the problem without solving it at the root.
*Gap between Theory and Reality:*
The critical step here is to interrupt two updates that should be done
together. However, there lie some gaps between the theory and the reality. In
the Zab paper, a follower updates its current epoch and history during the SYNC
phase in an *atomic* action, and the correctness of the Zab protocol is based
on that. Unfortunately, in actual environment, a node crash (or other
environment failures) may occur at any time, breaking the procedures that will
not be interrupted at the protocol level. By digging into the code, the
synchronized section in the method syncWithLeader(long) in Learner.java can
only prevent the procedure from interrupted by other threads, not environment
failures. We think it important to keep ZooKeeper still in a correct condition
even under such type of environment circumstances.
*Affected Versions:*
The above trace has been triggered in version 3.7.1 & 3.8.0 (the latest stable
& current version till now) by our testing tools. The affected versions might
be more, because the critical update order between the follower's current epoch
and its history during SYNC stay unchanged across multiple versions.
I will provide more materials to demonstrate further details in two days.
h2. Possible Fix
Theoretically, this issue can be avoided by exchanging the order of writing
uncommitted txns to the log file and writing _currentEpoch_ to the currentEpoch
file when the follower is processing NEWLEADER message. (See the code of
Learner.java). Writing the uncommitted txns to the log file before updating
_currentEpoch_ to the file is {_}a more c{_}onservative strategy, but it
prevents the problem that a node with the incomplete committed log can win an
election with its higher _currentEpoch._
was:
When a follower is processing the NEWLEADER message in SYNC phase, it will
update its _currentEpoch_ to the file *before* writing the uncommitted txns to
the log file. Such order may lead to improper truncation of *committed* txns in
later rounds.
The critical step to trigger this problem is to make a follower node crash
right after it updates its _currentEpoch_ to the file but before writing the
uncommitted txns to the log file. The potential risk is that, this node with
incomplete committed txns might be elected as a leader with its larger
_currentEpoch_ and then improperly uses TRUNC to ask nodes to delete their
committed txns!
h2. Trace
Here is an example to trigger the bug. (Focus on _currentEpoch_ and
{_}lastLoggedZxid{_})
{*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to
1{*}{*}):{*}
- Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
- +S2+ is elected leader.
- For all of them, _acceptedEpoch_ = 1, _currentEpoch_ = 1.
- Besides, all of them have _lastLoggedZxid_ = <1, 3>, _lastProcessedZxid_ =
<1, 3>.
- +S0+ crashes.
- A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ &
+S2+ have _lastLoggedZxid_ = <1, 4>, _lastProcessedZxid_ = <1, 4> ( Clients can
read the datatree with latest zxid <1, 4>).
*Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
2{*}{*}){*}{*}:{*}
* +S0+ restarts, +S2+ restarts, and +S1+ crashes.
* Again, +S2+ is elected leader.
* During the DISCOVERY phase, +S0+ & +S2+ update their _acceptedEpoch_ to 2.
* Then, during the SYNC phase, the leader +S2+ ({_}maxCommittedLog{_} = <1,
4>) uses DIFF to sync with the follower +S0+ ({_}lastLoggedZxid{_} = <1, 3>),
and their _currentEpoch_ will be set to 2 (and written to disk).
* Note that the follower +S0+ updates its currentEpoch file before writing the
uncommitted txns to the log file when receiving NEWLEADER message.
* *Unfortunately, right after the follower +S0+ finishes updating its
currentEpoch file, it crashes.*
*Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
3{*}{*}){*}{*}:{*}
* +S0+ & +S1+ restarts, and +S2+ crashes.
* Since +S0+ has _currentEpoch_ = 2, +S1+ has _currentEpoch_ = 1, +S0+ will be
elected leader.
* During the SYNC phase, the leader +S0+ ({_}maxCommittedLog{_} = <1, 3>) will
use TRUNC to sync with +S1+ ({_}lastLoggedZxid{_} = <1, 4>). Then, +S1+ removes
txn <1, 4>.
However, <1, 4> was committed and visible by clients before, and is not
supposed to be truncated!
Note: The trace can be constructed with quorum nodes alive at any moment with
careful time tuning of node crash & restart, e.g., let +S1+ restart before +S0+
crashes at the end of Round 2.
h2.
h2. Analysis
*Property Violation:*
>From the server side, the ensemble deletes a committed txn, which is not
>allowed. From the client side, a client may read stale data after a newer
>version is obtained, and that newer version cannot be obtained anymore.
ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the
occurrance of the problem without solving it at the root.
*Gap between Theory and Reality:*
The critical step here is to interrupt two updates that should be done
together. However, there lie some gaps between the theory and the reality. In
the Zab paper, a follower updates its current epoch and history during the SYNC
phase in an *atomic* action, and the correctness of the Zab protocol is based
on that. However, in actual environment, a node crash (or other environment
failures) may occur at any time, breaking the procedures that will not be
interrupted at the protocol level. At the implementation level, the
synchronized section in the method syncWithLeader(long) in Learner.java can
only prevent the procedure from interrupted by other threads, and environment
failures cannot be prevented. We think it important to keep ZooKeeper still in
a correct condition even under such type of environment circumstances.
*Affected Versions:*
The above trace has been triggered in version 3.7.1 & 3.8.0 (the latest stable
& current version till now) by our testing tools. The affected versions might
be more, because the critical update order between the follower's current epoch
and its history during SYNC stay unchanged across multiple versions.
I will provide more materials to demonstrate further details in two days.
h2. Possible Fix
Intuitively, this issue can be avoided by exchanging the order of writing
uncommitted txns to the log file and writing _currentEpoch_ to the currentEpoch
file when the follower is processing NEWLEADER message. (See the code of
Learner.java). Writing the uncommitted txns to the log file before updating
_currentEpoch_ to the file is {_}a more c{_}onservative strategy, but it
prevents the problem that a node with the incomplete committed log can win an
election with its higher _currentEpoch._
> Committed txns may be improperly truncated if node crashes right after
> updating currentEpoch
> ---------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4643
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.8.0, 3.7.1
> Reporter: Sirius
> Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will
> update its _currentEpoch_ to the file *before* writing the uncommitted txns
> to the log file. Such order may lead to improper truncation of *committed*
> txns in later rounds.
> The critical step to trigger this problem is to make a follower node crash
> right after it updates its _currentEpoch_ to the file but before writing the
> uncommitted txns to the log file. The potential risk is that, this node with
> incomplete committed txns might be elected as a leader with its larger
> _currentEpoch_ and then improperly uses TRUNC to ask nodes to delete their
> committed txns!
>
> h2. Trace
> Here is an example to trigger the bug. (Focus on _currentEpoch_ and
> {_}lastLoggedZxid{_})
> {*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to
> 1{*}{*}):{*}
> - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
> - +S2+ is elected leader.
> - For all of them, _acceptedEpoch_ = 1, _currentEpoch_ = 1.
> - Besides, all of them have _lastLoggedZxid_ = <1, 3>, _lastProcessedZxid_ =
> <1, 3>.
> - +S0+ crashes.
> - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ &
> +S2+ have _lastLoggedZxid_ = <1, 4>, _lastProcessedZxid_ = <1, 4> ( Clients
> can read the datatree with latest zxid <1, 4>).
> *Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
> 2{*}{*}){*}{*}:{*}
> * +S0+ restarts, +S2+ restarts, and +S1+ crashes.
> * Again, +S2+ is elected leader.
> * During the DISCOVERY phase, +S0+ & +S2+ update their _acceptedEpoch_ to 2.
> * Then, during the SYNC phase, the leader +S2+ ({_}maxCommittedLog{_} = <1,
> 4>) uses DIFF to sync with the follower +S0+ ({_}lastLoggedZxid{_} = <1, 3>),
> and their _currentEpoch_ will be set to 2 (and written to disk).
> * Note that the follower +S0+ updates its currentEpoch file before writing
> the uncommitted txns to the log file when receiving NEWLEADER message.
> * *Unfortunately, right after the follower +S0+ finishes updating its
> currentEpoch file, it crashes.*
> *Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to
> 3{*}{*}){*}{*}:{*}
> * +S0+ & +S1+ restarts, and +S2+ crashes.
> * Since +S0+ has _currentEpoch_ = 2, +S1+ has _currentEpoch_ = 1, +S0+ will
> be elected leader.
> * During the SYNC phase, the leader +S0+ ({_}maxCommittedLog{_} = <1, 3>)
> will use TRUNC to sync with +S1+ ({_}lastLoggedZxid{_} = <1, 4>). Then, +S1+
> removes txn <1, 4>.
>
> However, <1, 4> was committed and visible by clients before, and is not
> supposed to be truncated!
>
> Note: The trace can be constructed with quorum nodes alive at any moment with
> careful time tuning of node crash & restart, e.g., let +S1+ restart before
> +S0+ crashes at the end of Round 2.
> h2.
> h2. Analysis
> *Property Violation:*
> From the server side, the ensemble deletes a committed txn, which is not
> allowed. From the client side, a client may read stale data after a newer
> version is obtained, and that newer version cannot be obtained anymore.
> ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the
> occurrance of the problem without solving it at the root.
> *Gap between Theory and Reality:*
> The critical step here is to interrupt two updates that should be done
> together. However, there lie some gaps between the theory and the reality. In
> the Zab paper, a follower updates its current epoch and history during the
> SYNC phase in an *atomic* action, and the correctness of the Zab protocol is
> based on that. Unfortunately, in actual environment, a node crash (or other
> environment failures) may occur at any time, breaking the procedures that
> will not be interrupted at the protocol level. By digging into the code, the
> synchronized section in the method syncWithLeader(long) in Learner.java can
> only prevent the procedure from interrupted by other threads, not environment
> failures. We think it important to keep ZooKeeper still in a correct
> condition even under such type of environment circumstances.
> *Affected Versions:*
> The above trace has been triggered in version 3.7.1 & 3.8.0 (the latest
> stable & current version till now) by our testing tools. The affected
> versions might be more, because the critical update order between the
> follower's current epoch and its history during SYNC stay unchanged across
> multiple versions.
>
> I will provide more materials to demonstrate further details in two days.
>
>
> h2. Possible Fix
> Theoretically, this issue can be avoided by exchanging the order of writing
> uncommitted txns to the log file and writing _currentEpoch_ to the
> currentEpoch file when the follower is processing NEWLEADER message. (See the
> code of Learner.java). Writing the uncommitted txns to the log file before
> updating _currentEpoch_ to the file is {_}a more c{_}onservative strategy,
> but it prevents the problem that a node with the incomplete committed log can
> win an election with its higher _currentEpoch._
--
This message was sent by Atlassian Jira
(v8.20.10#820010)