[jira] [Commented] (HDFS-17368) HA: Standy should exit safemode when resources are from low available

ASF GitHub Bot (Jira) Sat, 09 Mar 2024 03:07:05 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824925#comment-17824925
 ]


ASF GitHub Bot commented on HDFS-17368:
---------------------------------------

zhuzilong2013 commented on code in PR #6518:
URL: https://github.com/apache/hadoop/pull/6518#discussion_r1518555454


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##########
@@ -1582,6 +1582,10 @@ void startStandbyServices(final Configuration conf, 
boolean isObserver)
       standbyCheckpointer = new StandbyCheckpointer(conf, this);
       standbyCheckpointer.start();
     }
+    if (isNoManualAndResourceLowSafeMode()) {
+      LOG.info("Standby should not enter safe mode when resources are low, 
exiting safe mode.");
+      leaveSafeMode(false);

Review Comment:
   I reused the logic from 
[HDFS-17231](https://issues.apache.org/jira/browse/HDFS-17231), and I believe 
there is no issue. HDFS-17231 enables the ANN to automatically exit 
ResourceLowSafeMode. 
   At the same time, I noticed that the 'leaveSafeMode(false)' method also 
exits 'StartupSafeMode'. I'm not sure if this is an issue; I mentioned this 
phenomenon in [HDFS-17402](https://issues.apache.org/jira/browse/HDFS-17402).
   If necessary, I can fix it. 





> HA: Standy should exit safemode when resources are from low available
> ---------------------------------------------------------------------
>
>                 Key: HDFS-17368
>                 URL: https://issues.apache.org/jira/browse/HDFS-17368
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Zilong Zhu
>            Assignee: Zilong Zhu
>            Priority: Major
>              Labels: pull-request-available
>
> The NameNodeResourceMonitor automatically enters safemode when it detects 
> that the resources are not suffcient. NNRM is only in ANN. If both ANN and 
> SNN enter SM due to low resources, and later SNN's disk space is restored, 
> SNN willl become ANN and ANN will become SNN. However, at this point, SNN 
> will not exit the SM, even if the disk is recovered.
> Consider the following scenario:
>  * Initially, nn-1 is active and nn-2 is standby. The insufficient resources 
> of both nn-1 and nn-2 in dfs.namenode.name.dir, the NameNodeResourceMonitor 
> detects the resource issue and puts nn01 into safemode.
>  * At this point, nn-1 is in safemode (ON) and active, while nn-2 is in 
> safemode (OFF) and standby.
>  * After a period of time, the resources in nn-2's dfs.namenode.name.dir 
> recover, triggering failover.
>  * Now, nn-1 is in safe mode (ON) and standby, while nn-2 is in safe mode 
> (OFF) and active.
>  * Afterward, the resources in nn-1's dfs.namenode.name.dir recover.
>  * However, since nn-1 is standby but in safemode (ON), it unable to exit 
> safe mode automatically.
> There are two possible ways fix this issues:
>  # If SNN is detected to be in SM(because low resource), it will exit.
>  # Or we already have HDFS-17231, we can revert HDFS-2914. Bringing NNRM back 
> to SNN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17368) HA: Standy should exit safemode when resources are from low available

Reply via email to