[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sirius updated ZOOKEEPER-4643:
------------------------------
    Description: 
When a follower is processing the NEWLEADER message in SYNC phase, it will 
update its {{_currentEpoch_}} to the file *before* writing the txns (from the 
PROPOSALs sent by leader in SYNC) to the log file. Such order may lead to 
improper truncation of *committed* txns in later rounds.

The critical step to trigger this problem is to make a follower node crash 
right after it updates its {{_currentEpoch_}} to the file but before writing 
the txns to the log file. The potential risk is that, this node with incomplete 
committed txns might be elected as a leader with its larger {{_currentEpoch_}} 
and then improperly uses TRUNC to ask nodes to delete their committed txns!

 
h2. Trace

Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and 
{{{}_lastLoggedZxid_{}}})

{*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to 
1{*}{*}):{*}
 - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - For all of them, {{_acceptedEpoch_}} = 1, _{{currentEpoch}}_ = 1.
 - Besides, all of them have {{_lastLoggedZxid_}} = <1, 3>, 
{{_lastProcessedZxid_}} = <1, 3>.
 - +S0+ crashes.
 - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ & 
+S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> ( 
Clients can read the datatree with latest zxid <1, 4>).

*Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
2{*}{*}){*}{*}:{*}
 * +S0+ restarts, +S2+ restarts, and +S1+ crashes.
 * Again, +S2+ is elected leader.
 * During the DISCOVERY phase, +S0+ & +S2+ update their {{_acceptedEpoch_}} to 
2.
 * Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} = 
<1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} = 
<1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk).
 * Note that the follower +S0+ updates its currentEpoch file before writing the 
txns to the log file when receiving NEWLEADER message.
 * *Unfortunately, right after the follower +S0+ finishes updating its 
currentEpoch file, it crashes.*

*Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
3{*}{*}){*}{*}:{*}
 * +S0+ & +S1+ restarts, and +S2+ crashes.
 * Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1, +S0+ 
will be elected leader.
 * During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1, 3>) 
will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>). Then, 
+S1+ removes txn <1, 4>. 

 

However, <1, 4> was committed and visible by clients before, and is not 
supposed to be truncated! 

 

Note: The trace can be constructed with quorum nodes alive at any moment with 
careful time tuning of node crash & restart, e.g., let +S1+ restart before +S0+ 
crashes at the end of Round 2.
h2.  
h2. Analysis

*Property Violation:*

>From the server side, the ensemble deletes a committed txn, which is not 
>allowed. From the client side, a client may read stale data after a newer 
>version is obtained, and that newer version cannot be obtained anymore.

ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the 
occurrance of the problem without solving it at the root.

*Gap between Theory and Reality:*

The critical step here is to interrupt two updates that should be done 
together. However, there lie some gaps between the theory and the reality. In 
the Zab paper, a follower updates its current epoch and history during the SYNC 
phase in an *atomic* action, and the correctness of the Zab protocol is based 
on that. Unfortunately, in actual environment, a node crash (or other 
environment failures) may occur at any time, breaking the procedures that will 
not be interrupted at the protocol level. By digging into the code,  the 
*synchronized* section in the method {{syncWithLeader(long)}} in 
{{Learner.java}} can only prevent the procedure from interrupted by other 
threads, but environment failures can still interrupt the process. We think it 
important to keep ZooKeeper still in a correct condition even under such type 
of environment circumstances.

*Affected Versions:*

The above trace has been triggered in version 3.7.1 & 3.8.0 (the latest stable 
& current version till now) by our testing tools. The affected versions might 
be more, because the critical update order between the follower's current epoch 
and its history during SYNC stay unchanged across multiple versions.

 

I will provide more materials to demonstrate further details in two days.

 

 
h2. Possible Fix

In theory, this issue can be avoided by exchanging the order of writing 
{{_currentEpoch_}} to the currentEpoch file and writing the should-be-committed 
txns to the log file when the follower is processing NEWLEADER message. Writing 
txns to the log file before updating {{_currentEpoch_}} to the file is {_}a 
more c{_}onservative strategy, but it prevents the problem that a node with the 
incomplete committed txns can win an election with its higher 
_{{{}currentEpoch{}}}._ 

When it comes to the implementation of multi-threading, things become a little 
more complicated. In the procedure of processing NEWLEADER message in the 
method {{syncWithLeader(long)}} in {{Learner.java, }}simply exchanging the 
order of updating {{_currentEpoch_}} and writing txns to the log file cannot 
fix the problem, because these two updates are conducted by different threads. 
When the QuorumPeer thread calls {{logRequest(..) }}to submit log persistence 
task to the SyncRequestProcessor thread, the latter thread will write txns to 
disk asynchronously, without promising to finish that before the node's 
{{{_}currentEpoch{_}}} updated. 

Above all, the fix at the code level should guarantee the partial order that 
{{_currentEpoch_}} updated only after the learner has the same initial history 
with leader.

  was:
When a follower is processing the NEWLEADER message in SYNC phase, it will 
update its {{_currentEpoch_}} to the file *before* writing the txns (from the 
PROPOSALs sent by leader in SYNC) to the log file. Such order may lead to 
improper truncation of *committed* txns in later rounds.

The critical step to trigger this problem is to make a follower node crash 
right after it updates its {{_currentEpoch_}} to the file but before writing 
the txns to the log file. The potential risk is that, this node with incomplete 
committed txns might be elected as a leader with its larger {{_currentEpoch_}} 
and then improperly uses TRUNC to ask nodes to delete their committed txns!

 
h2. Trace

Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and 
{{{}_lastLoggedZxid_{}}})

{*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to 
1{*}{*}):{*}
 - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
 - +S2+ is elected leader.
 - For all of them, {{_acceptedEpoch_}} = 1, _{{currentEpoch}}_ = 1.
 - Besides, all of them have {{_lastLoggedZxid_}} = <1, 3>, 
{{_lastProcessedZxid_}} = <1, 3>.
 - +S0+ crashes.
 - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ & 
+S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> ( 
Clients can read the datatree with latest zxid <1, 4>).

*Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
2{*}{*}){*}{*}:{*}
 * +S0+ restarts, +S2+ restarts, and +S1+ crashes.
 * Again, +S2+ is elected leader.
 * During the DISCOVERY phase, +S0+ & +S2+ update their {{_acceptedEpoch_}} to 
2.
 * Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} = 
<1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} = 
<1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk).
 * Note that the follower +S0+ updates its currentEpoch file before writing the 
txns to the log file when receiving NEWLEADER message.
 * *Unfortunately, right after the follower +S0+ finishes updating its 
currentEpoch file, it crashes.*

*Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
3{*}{*}){*}{*}:{*}
 * +S0+ & +S1+ restarts, and +S2+ crashes.
 * Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1, +S0+ 
will be elected leader.
 * During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1, 3>) 
will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>). Then, 
+S1+ removes txn <1, 4>. 

 

However, <1, 4> was committed and visible by clients before, and is not 
supposed to be truncated! 

 

Note: The trace can be constructed with quorum nodes alive at any moment with 
careful time tuning of node crash & restart, e.g., let +S1+ restart before +S0+ 
crashes at the end of Round 2.
h2.  
h2. Analysis

*Property Violation:*

>From the server side, the ensemble deletes a committed txn, which is not 
>allowed. From the client side, a client may read stale data after a newer 
>version is obtained, and that newer version cannot be obtained anymore.

ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the 
occurrance of the problem without solving it at the root.

*Gap between Theory and Reality:*

The critical step here is to interrupt two updates that should be done 
together. However, there lie some gaps between the theory and the reality. In 
the Zab paper, a follower updates its current epoch and history during the SYNC 
phase in an *atomic* action, and the correctness of the Zab protocol is based 
on that. Unfortunately, in actual environment, a node crash (or other 
environment failures) may occur at any time, breaking the procedures that will 
not be interrupted at the protocol level. By digging into the code,  the 
*synchronized* section in the method {{syncWithLeader(long)}} in 
{{Learner.java}} can only prevent the procedure from interrupted by other 
threads, but environment failures can still interrupt the process. We think it 
important to keep ZooKeeper still in a correct condition even under such type 
of environment circumstances.

*Affected Versions:*

The above trace has been triggered in version 3.7.1 & 3.8.0 (the latest stable 
& current version till now) by our testing tools. The affected versions might 
be more, because the critical update order between the follower's current epoch 
and its history during SYNC stay unchanged across multiple versions.

 

I will provide more materials to demonstrate further details in two days.

 

 
h2. Possible Fix

In theory, this issue can be avoided by exchanging the order of writing 
{{_currentEpoch_}} to the currentEpoch file and writing the should-be-committed 
txns to the log file when the follower is processing NEWLEADER message. Writing 
txns to the log file before updating {{_currentEpoch_}} to the file is {_}a 
more c{_}onservative strategy, but it prevents the problem that a node with the 
incomplete committed txns can win an election with its higher 
_{{{}currentEpoch{}}}._ 

When it comes to the implementation of multi-threading, things become a little 
more complicated. In the procedure of processing NEWLEADER message in the 
method {{syncWithLeader(long)}} in {{{}Learner.java, s{}}}imply exchanging the 
order of updating {{_currentEpoch_}} and writing txns to the log file cannot 
fix the problem, because these two updates are conducted by different threads. 
When the QuorumPeer thread calls {{logRequest(..) }}to submit log persistence 
task to the SyncRequestProcessor thread, the latter thread will write txns to 
disk asynchronously, without promising to finish that before the node's 
{{_currentEpoch_}} updated. 

Above all, the fix at the code level should guarantee the partial order that 
{{_currentEpoch_}} updated only after the learner has the same initial history 
with leader.


> Committed txns may be improperly truncated if node crashes right after 
> updating currentEpoch 
> ---------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4643
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.8.0, 3.7.1
>            Reporter: Sirius
>            Priority: Critical
>
> When a follower is processing the NEWLEADER message in SYNC phase, it will 
> update its {{_currentEpoch_}} to the file *before* writing the txns (from the 
> PROPOSALs sent by leader in SYNC) to the log file. Such order may lead to 
> improper truncation of *committed* txns in later rounds.
> The critical step to trigger this problem is to make a follower node crash 
> right after it updates its {{_currentEpoch_}} to the file but before writing 
> the txns to the log file. The potential risk is that, this node with 
> incomplete committed txns might be elected as a leader with its larger 
> {{_currentEpoch_}} and then improperly uses TRUNC to ask nodes to delete 
> their committed txns!
>  
> h2. Trace
> Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and 
> {{{}_lastLoggedZxid_{}}})
> {*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to 
> 1{*}{*}):{*}
>  - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}.
>  - +S2+ is elected leader.
>  - For all of them, {{_acceptedEpoch_}} = 1, _{{currentEpoch}}_ = 1.
>  - Besides, all of them have {{_lastLoggedZxid_}} = <1, 3>, 
> {{_lastProcessedZxid_}} = <1, 3>.
>  - +S0+ crashes.
>  - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ & 
> +S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> ( 
> Clients can read the datatree with latest zxid <1, 4>).
> *Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
> 2{*}{*}){*}{*}:{*}
>  * +S0+ restarts, +S2+ restarts, and +S1+ crashes.
>  * Again, +S2+ is elected leader.
>  * During the DISCOVERY phase, +S0+ & +S2+ update their {{_acceptedEpoch_}} 
> to 2.
>  * Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} = 
> <1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} = 
> <1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk).
>  * Note that the follower +S0+ updates its currentEpoch file before writing 
> the txns to the log file when receiving NEWLEADER message.
>  * *Unfortunately, right after the follower +S0+ finishes updating its 
> currentEpoch file, it crashes.*
> *Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to 
> 3{*}{*}){*}{*}:{*}
>  * +S0+ & +S1+ restarts, and +S2+ crashes.
>  * Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1, 
> +S0+ will be elected leader.
>  * During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1, 
> 3>) will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>). 
> Then, +S1+ removes txn <1, 4>. 
>  
> However, <1, 4> was committed and visible by clients before, and is not 
> supposed to be truncated! 
>  
> Note: The trace can be constructed with quorum nodes alive at any moment with 
> careful time tuning of node crash & restart, e.g., let +S1+ restart before 
> +S0+ crashes at the end of Round 2.
> h2.  
> h2. Analysis
> *Property Violation:*
> From the server side, the ensemble deletes a committed txn, which is not 
> allowed. From the client side, a client may read stale data after a newer 
> version is obtained, and that newer version cannot be obtained anymore.
> ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the 
> occurrance of the problem without solving it at the root.
> *Gap between Theory and Reality:*
> The critical step here is to interrupt two updates that should be done 
> together. However, there lie some gaps between the theory and the reality. In 
> the Zab paper, a follower updates its current epoch and history during the 
> SYNC phase in an *atomic* action, and the correctness of the Zab protocol is 
> based on that. Unfortunately, in actual environment, a node crash (or other 
> environment failures) may occur at any time, breaking the procedures that 
> will not be interrupted at the protocol level. By digging into the code,  the 
> *synchronized* section in the method {{syncWithLeader(long)}} in 
> {{Learner.java}} can only prevent the procedure from interrupted by other 
> threads, but environment failures can still interrupt the process. We think 
> it important to keep ZooKeeper still in a correct condition even under such 
> type of environment circumstances.
> *Affected Versions:*
> The above trace has been triggered in version 3.7.1 & 3.8.0 (the latest 
> stable & current version till now) by our testing tools. The affected 
> versions might be more, because the critical update order between the 
> follower's current epoch and its history during SYNC stay unchanged across 
> multiple versions.
>  
> I will provide more materials to demonstrate further details in two days.
>  
>  
> h2. Possible Fix
> In theory, this issue can be avoided by exchanging the order of writing 
> {{_currentEpoch_}} to the currentEpoch file and writing the 
> should-be-committed txns to the log file when the follower is processing 
> NEWLEADER message. Writing txns to the log file before updating 
> {{_currentEpoch_}} to the file is {_}a more c{_}onservative strategy, but it 
> prevents the problem that a node with the incomplete committed txns can win 
> an election with its higher _{{{}currentEpoch{}}}._ 
> When it comes to the implementation of multi-threading, things become a 
> little more complicated. In the procedure of processing NEWLEADER message in 
> the method {{syncWithLeader(long)}} in {{Learner.java, }}simply exchanging 
> the order of updating {{_currentEpoch_}} and writing txns to the log file 
> cannot fix the problem, because these two updates are conducted by different 
> threads. When the QuorumPeer thread calls {{logRequest(..) }}to submit log 
> persistence task to the SyncRequestProcessor thread, the latter thread will 
> write txns to disk asynchronously, without promising to finish that before 
> the node's {{{_}currentEpoch{_}}} updated. 
> Above all, the fix at the code level should guarantee the partial order that 
> {{_currentEpoch_}} updated only after the learner has the same initial 
> history with leader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to