xkrogen commented on PR #4628: URL: https://github.com/apache/hadoop/pull/4628#issuecomment-1213426216
@ZanderXu @ferhui sorry for being late to the discussion here. Please correct me if I'm wrong, but I believe this introduces a correctness issue. Unless you can dispute the situation I outline below, we will need to revert this PR. Using `onlyDurableTxns` when initializing the new active is essential to maintain a correct transaction history. Envision the following scenario: 1. Currently NN0 is active and JN0-2 all have txn 2 committed. 2. NN0 attempts to write txn 3. It only succeeds to JN0, and crashes before writing to JN1/JN2. 3. We fail over to NN1, which currently has txns up to 1 4. NN1 attempts to load most recent state from JNs 4a. Previously, NN1 uses `onlyDurableTxns=true`, so it will only load txn 2. Good! 4b. With this PR, NN1 uses `onlyDurableTxns=false`, so if it gets a response from JN0, it will load and apply txn 2 AND 3. In situation (4b), txn 3 should NOT be loaded by NN1, because it was not durably committed -- it only landed on 1 of 3 JNs. Thus NN1 should discard it when failing over. This can result in a serious data loss issue. Let's say we operate for some time after going through 1-4b above, and NN1 writes many new txns. In the meantime, the disk on JN0 has failed -- it no longer has a reference to txn 3. Since JN1 and JN2 never got txn 3, the only copy of txn 3 is lost. Now if NN1 crashes and we need to recover the state from the latest fsimage + JNs, we are unable to, because there is a hole in the transaction log at txn 3. All state at txn > 3 must be thrown away because we don't have a consistent history. I agree that the situation you described in the original bug report is valid, but this is not the right way to solve it. Instead, we need to: 1. Enhance the startup/failover logic to continue attempting to call `doTailEdits()` (with `onlyDurableTxns=true`!) until enough txns have been loaded to be current with the JNs. Two options I see are modifying `FSNamesystem#startActiveServices()` to either (a) use `getEditLogManifest()` to find the `committedTxnId` and use this as the low-watermark for loading edits, discarding newer ones (as we do in `selectStreamingInputStreams()`) or (b) keep the loading logic the same but catch the `IllegalStateException` and have some number of retries where we attempt to re-fetch more edits from the JNs. I think (a) is a much more elegant solution. 2. [optional] To reduce the chance of this situation occurring, it would be better if we had a way to allow for `QuorumJournalManager#waitForWriteQuorum()` to allow for some configurable delay before returning the results. As-is, as soon as 2 of 3 JNs respond, it will return. If you have a situation where JN0 is lagging, but JN2 is temporarily slow, then JN0 and JN1 will always respond first, so you'll never see the durable transactions (which can only be fetched with a response from JN1 and JN2). It would be better if you could specify some brief extra "wait period" after getting the 2 necessary for quorum, to give a chance for a 3rd to respond as well, to get a more full picture. This isn't required to fix the issue here, but it would help minimize the occurrence of this kind of situation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org