Sirius created ZOOKEEPER-4643:
---------------------------------
Summary: Committed txns may be improperly truncated when node
crashes right after updating currentEpoch
Key: ZOOKEEPER-4643
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643
Project: ZooKeeper
Issue Type: Bug
Components: quorum, server
Affects Versions: 3.7.1, 3.8.0
Reporter: Sirius
When a follower is processing the NEWLEADER message in SYNC phase, it will
update its currentEpoch to the currentEpoch file *before* writing the
uncommitted txns to the log file. Such order may lead to improper truncation of
*committed* txns in later rounds.
The critical step is to make a follower node crash right after it updates its
currentEpoch to the file but before writing the uncommitted txns to the log
file.
h2. Trace
Here is an example to trigger the bug. (Focus on the currentEpoch and the
lastLoggedZxid)
*Initial condition:*
- Start the ensemble with three nodes: A, B and C.
- Node C is elected leader.
- For all of them, acceptedEpoch=1, currentEpoch=1.
- Besides, all of them have lastLoggedZxid = <1, 3>, lastProcessedZxid = <1,
3>.
*Step 1:* Node A crashes.
*Step 2:* A new txn is logged and committed by Node B & C. Then, B & C have
lastLoggedZxid = <1, 4>, lastProcessedZxid = <1, 4> ( Clients can read the
datatree with latest zxid <1, 4>).
*Step 3:* Node A restarts, Node C restarts and Node B crashes. Again, C is
elected leader. During the DISCOVERY phase, both A and C update their
acceptedEpoch to 2. Then, during the SYNC phase, the leader C (maxCommittedLog
= <1, 4>) uses DIFF to sync with the follower A (lastLoggedZxid = <1, 3>). and
their currentEpoch will be set to 2 (and written to disk). Note that the
follower A updates its currentEpoch file before writing the uncommitted txns to
the log file when receiving NEWLEADER message. *Unfortunately, right after the
follower A finishes updating its currentEpoch file, it crashes.*
*Step 4:* Node A and B restarts and Node C crashes. Since Node A has
currentEpoch=2, Node C has currentEpoch=1, Node A will be elected leader.
During the SYNC phase, the leader A (maxCommittedLog = <1, 3>) will use TRUNC
to sync with B (lastLoggedZxid = <1, 4>). Then, B removes txn <1, 4>.
However, <1, 4> was committed and visible by clients before, and is not
supposed to be truncated!
(Note: With careful time tuning of crash & restart, the trace can be
constructed with quorum nodes alive at any moment.)
The above trace has been triggered by our testing tools. I will provide more
materials to further demonstrate above case soon. Besides, I think the affected
versions might be more.
h2. Analysis
The critical step here is to interrupt two updates that should be done
together. In the Zab paper, a follower updates its current epoch and history
during the SYNC phase in an *atomic* action, and the correctness of Zab is
based on that. However, in actual environment, a node crash (or other
environment failures) may occur at any time, breaking the procedures that will
not be interrupted at the protocol level. This is the gap between the theory
and the reality. As the system evolves, the implementation of the SYNC phase
becomes much more complicated compared to the original Zab protocol. We think
it important to keep ZooKeeper still in a correct condition even under such
type of environment circumstances.
h2. Possible Fix
Intuitively, this issue can be avoided by exchanging the order of writing
uncommitted txns to the log file and writing currentEpoch to the currentEpoch
file when the follower is processing NEWLEADER message. (See the code of
Learner.java)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)