[GitHub] [hadoop] xkrogen commented on pull request #4628: HDFS-16689. NameNode may crash when transitioning to Active with in-progress tailer if there are some abnormal JNs.

GitBox Fri, 12 Aug 2022 12:01:28 -0700


xkrogen commented on PR #4628:
URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1213426216


   @ZanderXu @ferhui sorry for being late to the discussion here.
   
   Please correct me if I'm wrong, but I believe this introduces a correctness 
issue. Unless you can dispute the situation I outline below, we will need to 
revert this PR.
   
   Using `onlyDurableTxns` when initializing the new active is essential to 
maintain a correct transaction history. Envision the following scenario:
   
   1. Currently NN0 is active and JN0-2 all have txn 2 committed.
   2. NN0 attempts to write txn 3. It only succeeds to JN0, and crashes before 
writing to JN1/JN2.
   3. We fail over to NN1, which currently has txns up to 1
   4. NN1 attempts to load most recent state from JNs
   4a. Previously, NN1 uses `onlyDurableTxns=true`, so it will only load txn 2. 
Good!
   4b. With this PR, NN1 uses `onlyDurableTxns=false`, so if it gets a response 
from JN0, it will load and apply txn 2 AND 3.
   
   In situation (4b), txn 3 should NOT be loaded by NN1, because it was not 
durably committed -- it only landed on 1 of 3 JNs. Thus NN1 should discard it 
when failing over.
   
   This can result in a serious data loss issue. Let's say we operate for some 
time after going through 1-4b above, and NN1 writes many new txns. In the 
meantime, the disk on JN0 has failed -- it no longer has a reference to txn 3. 
Since JN1 and JN2 never got txn 3, the only copy of txn 3 is lost. Now if NN1 
crashes and we need to recover the state from the latest fsimage + JNs, we are 
unable to, because there is a hole in the transaction log at txn 3. All state 
at txn > 3 must be thrown away because we don't have a consistent history.
   
   I agree that the situation you described in the original bug report is 
valid, but this is not the right way to solve it. Instead, we need to:
   1. Enhance the startup/failover logic to continue attempting to call 
`doTailEdits()` (with `onlyDurableTxns=true`!) until enough txns have been 
loaded to be current with the JNs. Two options I see are modifying 
`FSNamesystem#startActiveServices()` to either (a) use `getEditLogManifest()` 
to find the `committedTxnId` and use this as the low-watermark for loading 
edits, discarding newer ones (as we do in `selectStreamingInputStreams()`) or 
(b) keep the loading logic the same but catch the `IllegalStateException` and 
have some number of retries where we attempt to re-fetch more edits from the 
JNs. I think (a) is a much more elegant solution.
   2. [optional] To reduce the chance of this situation occurring, it would be 
better if we had a way to allow for `QuorumJournalManager#waitForWriteQuorum()` 
to allow for some configurable delay before returning the results. As-is, as 
soon as 2 of 3 JNs respond, it will return. If you have a situation where JN0 
is lagging, but JN2 is temporarily slow, then JN0 and JN1 will always respond 
first, so you'll never see the durable transactions (which can only be fetched 
with a response from JN1 and JN2). It would be better if you could specify some 
brief extra "wait period" after getting the 2 necessary for quorum, to give a 
chance for a 3rd to respond as well, to get a more full picture. This isn't 
required to fix the issue here, but it would help minimize the occurrence of 
this kind of situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[GitHub] [hadoop] xkrogen commented on pull request #4628: HDFS-16689. NameNode may crash when transitioning to Active with in-progress tailer if there are some abnormal JNs.

Reply via email to