Adam Binford created HDFS-17710:
-----------------------------------
Summary: Standby node can load unpersisted edit from JournalNode
cache
Key: HDFS-17710
URL: https://issues.apache.org/jira/browse/HDFS-17710
Project: Hadoop HDFS
Issue Type: Bug
Components: journal-node
Affects Versions: 3.4.1
Reporter: Adam Binford
A standby or observer node can load edits from the journal node that failed to
be durably persisted. This can cause the standby or observer node to
incorrectly think that the last committed transaction ID is higher than it
actually is. This is the scenario that led us to find this:
We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby, and
NN3 is observer. NN2 was failing to upload fsimage checkpoints to the other
NameNodes, for reasons we are still investigating. But because a checkpoint was
never able to be fully created, the JournalNodes could never cleanup old edit
files. This led all 3 of our JournalNodes to slowly fill up and eventually run
out of disk space. Because all the JournalNodes store effectively the same
things, they all filled up at nearly the same time.
Since the JournalNodes could no longer write new transactions, NN1 and NN2 both
started entering restart loops, since as soon as they finished booting up and
out of safe mode, and the ZKFC made them active, they crashed after being
unable to persist new transactions. NN3 stayed up in observer mode the whole
time, never crashing as it never tried to write new transactions.
Because they are just on VMs, we simply increased the disk size of the
JournalNodes to get them functioning again. After this, NN1 and NN2 were still
in the process of booting up, so we put NN3 into standby mode so that the ZKFC
could make it active right away, getting our system back online. After this,
NN1 and NN2 failed to boot up do to a missing edits file on the journal nodes.
We believe this all stems from the fact that transactions are added to the edit
cache on the journal nodes [before they are persisted to
disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433].
We think what happened is something like:
* Before disks filled up, NN1 successfully committed transaction 0096 to the
Journal Nodes.
* NN1 attempted to write transactions 0097 and 0098 to the journal nodes. These
transactions got added to the edit cache, but then failed to persist to disk
because the disk was full. The write failed on NN1 and it crashed and
restarted. NN2 then became active and entered the same crash and restart loop.
* NN3 was tailing the edits, and the journal nodes all returned transactions
0097 and 0098 from the edits cache. Because of this NN3 thinks that up to
transaction 0098 have been durably persisted.
* Disk sizes are increased and journal nodes are able to write transactions
again.
* NN3 becomes active, thinks that up to transaction 0098 have been committed,
and begins writing new transactions starting at 0099, and the journal nodes
update their committed transaction ID up to 0099.
* No journal nodes actually have transactions 0097 and 0098 written to disk, so
when NN1 and NN2 start up, they fail to load edits from the journal node
because the journals think it should have up through transactions 0099, but
can't find any file with those edits.
I had to manually delete all edits files associated with any transaction >=
0099, and manually edit the committed-txn file back to 0096 to finally get all
the NameNodes to boot back up to a consistent state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]