[ https://issues.apache.org/jira/browse/HDFS-15555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188986#comment-17188986 ]
Akira Ajisaka commented on HDFS-15555: -------------------------------------- The following code refreshes the cache: https://github.com/apache/hadoop/blob/b6a3286d27b604322fddc1ec06ad563fd8a9d0f4/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcClient.java#L424-L428 {{failover}} is set to true when the IOException is in the Unavailable Exceptions. https://github.com/apache/hadoop/blob/b6a3286d27b604322fddc1ec06ad563fd8a9d0f4/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcClient.java#L441-L445 > RBF: Refresh cacheNS when SocketException occurs > ------------------------------------------------ > > Key: HDFS-15555 > URL: https://issues.apache.org/jira/browse/HDFS-15555 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf > Environment: HDFS 3.3.0, Java 11 > Reporter: Akira Ajisaka > Assignee: Akira Ajisaka > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Problem: > When active NameNode is restarted and loading fsimage, DFSRouters > significantly slow down. > Investigation: > When active NameNode is restarted and loading fsimage, RouterRpcClient > receives SocketException. Since > RouterRpcClient#isUnavailableException(IOException) returns false when the > argument is SocketException, the MembershipNameNodeResolver#cacheNS is not > refreshed. That's why the order of the NameNodes returned by > MemberShipNameNodeResolver#getNamenodesForNameserviceId(String) is unchanged > and the active NameNode is still returned first. Therefore RouterRpcClient > still tries to connect to the NameNode that is loading fsimage. > After loading the fsimage, the NameNode throws StandbyException. The > exception is one of the 'Unavailable Exception' and the cacheNS is refreshed. > Workaround: > Stop NameNode and wait 1 minute before starting NameNode instead of > restarting. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org