[ https://issues.apache.org/jira/browse/HDFS-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoqiao He resolved HDFS-17231. -------------------------------- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > HA: Safemode should exit when resources are from low to available > ----------------------------------------------------------------- > > Key: HDFS-17231 > URL: https://issues.apache.org/jira/browse/HDFS-17231 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 3.3.4, 3.3.6 > Reporter: kuper > Assignee: kuper > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: 企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png > > > The NameNodeResourceMonitor automatically enters safe mode when it detects > that the resources are not sufficient. When zkfc detects insufficient > resources, it triggers failover. Consider the following scenario: > * Initially, nn01 is active and nn02 is standby. Due to insufficient > resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the > resource issue and puts nn01 into safemode. Subsequently, zkfc triggers > failover. > * At this point, nn01 is in safemode (ON) and standby, while nn02 is in > safemode (OFF) and active. > * After a period of time, the resources in nn01's dfs.namenode.name.dir > recover, causing a slight instability and triggering failover again. > * Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode > (OFF) and standby. > * However, since nn01 is active but in safemode (ON), hdfs cannot be read > from or written to. > !企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png! > *reproduction* > # Increase the dfs.namenode.resource.du.reserved > # Increase the ha.health-monitor.check-interval.ms can avoid directly > switching to standby and stopping the NameNodeResourceMonitor thread. > Instead, it is necessary to wait for the NameNodeResourceMonitor to enter > safe mode before switching to standby. > # On the nn01 active node, using the dd command to create a file that > exceeds the threshold, triggering a low on available disk space condition. > # If the nn01 namenode process is not dead, the situation of nn01 safemode > (ON) and standby occurs. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org