[
https://issues.apache.org/jira/browse/HDFS-17710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023147#comment-18023147
]
ASF GitHub Bot commented on HDFS-17710:
---------------------------------------
Kimahriman opened a new pull request, #8000:
URL: https://github.com/apache/hadoop/pull/8000
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
Update the process of JournalNodes journaling new edits to persist edits to
disk before adding them to the edit cache. This prevents standby and observer
nodes from accidentally reading transactions from the edit cache when tailing
logs that have not actually been durably persisted, making the standby or
observer have in invalid state thinking the latest committed transaction is
higher than it should be.
### How was this patch tested?
New UT using a new fault injector method to simulate a disk write failure.
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [x] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [x] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
> Standby node can load unpersisted edit from JournalNode cache
> -------------------------------------------------------------
>
> Key: HDFS-17710
> URL: https://issues.apache.org/jira/browse/HDFS-17710
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: journal-node
> Affects Versions: 3.4.1
> Reporter: Adam Binford
> Priority: Major
> Labels: pull-request-available
>
> A standby or observer node can load edits from the journal node that failed
> to be durably persisted. This can cause the standby or observer node to
> incorrectly think that the last committed transaction ID is higher than it
> actually is. This is the scenario that led us to find this:
> We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby,
> and NN3 is observer. NN2 was failing to upload fsimage checkpoints to the
> other NameNodes, for reasons we are still investigating. But because a
> checkpoint was never able to be fully created, the JournalNodes could never
> cleanup old edit files. This led all 3 of our JournalNodes to slowly fill up
> and eventually run out of disk space. Because all the JournalNodes store
> effectively the same things, they all filled up at nearly the same time.
> Since the JournalNodes could no longer write new transactions, NN1 and NN2
> both started entering restart loops, since as soon as they finished booting
> up and out of safe mode, and the ZKFC made them active, they crashed after
> being unable to persist new transactions. NN3 stayed up in observer mode the
> whole time, never crashing as it never tried to write new transactions.
> Because they are just on VMs, we simply increased the disk size of the
> JournalNodes to get them functioning again. After this, NN1 and NN2 were
> still in the process of booting up, so we put NN3 into standby mode so that
> the ZKFC could make it active right away, getting our system back online.
> After this, NN1 and NN2 failed to boot up do to a missing edits file on the
> journal nodes.
> We believe this all stems from the fact that transactions are added to the
> edit cache on the journal nodes [before they are persisted to
> disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433].
> We think what happened is something like:
> * Before disks filled up, NN1 successfully committed transaction 0096 to the
> Journal Nodes.
> * NN1 attempted to write transactions 0097 and 0098 to the journal nodes.
> These transactions got added to the edit cache, but then failed to persist to
> disk because the disk was full. The write failed on NN1 and it crashed and
> restarted. NN2 then became active and entered the same crash and restart loop.
> * NN3 was tailing the edits, and the journal nodes all returned transactions
> 0097 and 0098 from the edits cache. Because of this NN3 thinks that up to
> transaction 0098 have been durably persisted.
> * Disk sizes are increased and journal nodes are able to write transactions
> again.
> * NN3 becomes active, thinks that up to transaction 0098 have been committed,
> and begins writing new transactions starting at 0099, and the journal nodes
> update their committed transaction ID up to 0099.
> * No journal nodes actually have transactions 0097 and 0098 written to disk,
> so when NN1 and NN2 start up, they fail to load edits from the journal node
> because the journals think it should have up through transactions 0099, but
> can't find any file with those edits.
> I had to manually delete all edits files associated with any transaction >=
> 0099, and manually edit the committed-txn file back to 0096 to finally get
> all the NameNodes to boot back up to a consistent state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]