[ https://issues.apache.org/jira/browse/HDFS-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824925#comment-17824925 ]
ASF GitHub Bot commented on HDFS-17368: --------------------------------------- zhuzilong2013 commented on code in PR #6518: URL: https://github.com/apache/hadoop/pull/6518#discussion_r1518555454 ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ########## @@ -1582,6 +1582,10 @@ void startStandbyServices(final Configuration conf, boolean isObserver) standbyCheckpointer = new StandbyCheckpointer(conf, this); standbyCheckpointer.start(); } + if (isNoManualAndResourceLowSafeMode()) { + LOG.info("Standby should not enter safe mode when resources are low, exiting safe mode."); + leaveSafeMode(false); Review Comment: I reused the logic from [HDFS-17231](https://issues.apache.org/jira/browse/HDFS-17231), and I believe there is no issue. HDFS-17231 enables the ANN to automatically exit ResourceLowSafeMode. At the same time, I noticed that the 'leaveSafeMode(false)' method also exits 'StartupSafeMode'. I'm not sure if this is an issue; I mentioned this phenomenon in [HDFS-17402](https://issues.apache.org/jira/browse/HDFS-17402). If necessary, I can fix it. > HA: Standy should exit safemode when resources are from low available > --------------------------------------------------------------------- > > Key: HDFS-17368 > URL: https://issues.apache.org/jira/browse/HDFS-17368 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Zilong Zhu > Assignee: Zilong Zhu > Priority: Major > Labels: pull-request-available > > The NameNodeResourceMonitor automatically enters safemode when it detects > that the resources are not suffcient. NNRM is only in ANN. If both ANN and > SNN enter SM due to low resources, and later SNN's disk space is restored, > SNN willl become ANN and ANN will become SNN. However, at this point, SNN > will not exit the SM, even if the disk is recovered. > Consider the following scenario: > * Initially, nn-1 is active and nn-2 is standby. The insufficient resources > of both nn-1 and nn-2 in dfs.namenode.name.dir, the NameNodeResourceMonitor > detects the resource issue and puts nn01 into safemode. > * At this point, nn-1 is in safemode (ON) and active, while nn-2 is in > safemode (OFF) and standby. > * After a period of time, the resources in nn-2's dfs.namenode.name.dir > recover, triggering failover. > * Now, nn-1 is in safe mode (ON) and standby, while nn-2 is in safe mode > (OFF) and active. > * Afterward, the resources in nn-1's dfs.namenode.name.dir recover. > * However, since nn-1 is standby but in safemode (ON), it unable to exit > safe mode automatically. > There are two possible ways fix this issues: > # If SNN is detected to be in SM(because low resource), it will exit. > # Or we already have HDFS-17231, we can revert HDFS-2914. Bringing NNRM back > to SNN. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org