[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1294684900 @xkrogen Sir, can you help me finally review it? @ashutoshcipher @tomscut @ayushtkn @Hexiaoqiao Sir, can help me to double-review it when you are available? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1226679229 ``` if (curSegment != null) { LOG.warn("Client is requesting a new log segment " + txid + " though we are already writing " + curSegment + ". " + "Aborting the current segment in order to begin the new one." + " ; journal id: " + journalId); // The writer may have lost a connection to us and is now // re-connecting after the connection came back. // We should abort our own old segment. abortCurSegment(); } ``` The `abortCurSegment()` just aborts the current segment, but not finalize the current inProgress segment, so may result in two inProgress segment files on disk. > So are we agreed that the best way forward is to modify recoverUnclosedStreams() to throw exception on failure, then we can use inProgressOk = false to solve this problem as you originally proposed? Yes, I totally agree with this and I will modify this patch with this idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1222443598 @abhishekkarigar Thanks for your attention to this issue. @xkrogen and I will solve this problem as soon as possible. @xkrogen Sir, please review the latest patch. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1221235271 @xkrogen Thanks. > I think you might have also understood what I was saying in my last comment Yes, I got it. > I am thinking we can add a new parameter to LogsPurgeable#selectInputStreams() like preferBulkReads. This is a good idea, I will fix this patch like this. > In recoverUnclosedStreams, if the finalization fails, it will just ignore it and assume that it will be handled later (by Journal#startLogSegment(), which will automatically close an old stream when you try to open a new one). > In recoverUnclosedStreams, if the finalization fails, it will just ignore it Sorry, I didn't notice this. But I think it's crazy. > assume that it will be handled later (by Journal#startLogSegment(), which will automatically close an old stream when you try to open a new one). I'm sorry, I just find this comment, but didn't find related code to finalize the previous inProgress segment. Can you share the related code? Thanks. > If there is something preventing the new active from communicating with the JNs, or something preventing the JNs from finalizing the old segment, then the NN will eventually fail to become active regardless. Yes, I agree. Standby should crash or fail to become active if it cannot finalize the old segment. About this case, How about fix it in a new PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1220527843 @xkrogen Master, I have update this patch via enable or disable inProgressTailing in `QuorumJournalManager`, please help me review it. If you have any good ideas, I'd be happy to modify this patch as intended. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1220377078 > If the active crashed, then the segment won't be finalized, right? If the active crashed, during standby starting active services, the standby will recover unclosed streams via `recoverUnclosedStreams`. So before `catchupDuringFailover`, the last segment should always closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1218994368 The same processing idea has also appeared in HDFS-14806. [Here](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/BootstrapStandby.java#L113) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1218979907 @xkrogen Thanks for your review and comment. Sorry for the late replay. Before `catchupDuringFailover`, whatever active crash or successfully changed to standby, the last segment in majority journalnode should be finalized. My idea is `catchupDuringFailover` just ignore `selectRpcInputStreams` and fail back to use `selectStreamingInputStreams` with `getEditLogManifest`. Unfortunately, `catchupDuringFailover` can only use disable in-progress to ignore `selectRpcInputStreams`, because `inProgressTailingEnabled` in `QuorumJournalManager` is unchangeable. The key code is as follows: ``` @Override public void selectInputStreams(Collection streams, long fromTxnId, boolean inProgressOk, boolean onlyDurableTxns) throws IOException { // Here, catchupDuringFailover should ignore this if branch and fail back to selectStreamingInputStreams if (inProgressOk && inProgressTailingEnabled) { LOG.debug("Tailing edits starting from txn ID {} via RPC mechanism", fromTxnId); try { Collection rpcStreams = new ArrayList<>(); selectRpcInputStreams(rpcStreams, fromTxnId, onlyDurableTxns); streams.addAll(rpcStreams); return; } catch (IOException ioe) { LOG.warn("Encountered exception while tailing edits >= " + fromTxnId + " via RPC; falling back to streaming.", ioe); } } selectStreamingInputStreams(streams, fromTxnId, inProgressOk, onlyDurableTxns); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] ZanderXu commented on pull request #4744: HDFS-16689. Standby NameNode crashes when transitioning to Active with in-progress tailer
ZanderXu commented on PR #4744: URL: https://github.com/apache/hadoop/pull/4744#issuecomment-1215017852 @ferhui @xkrogen Master, this PR uses `getEditLogManifest()` to fix this problem. Please help me review it, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org