Akira Ajisaka created HDFS-15555: ------------------------------------ Summary: RBF: Refresh cacheNS when SocketException occurs Key: HDFS-15555 URL: https://issues.apache.org/jira/browse/HDFS-15555 Project: Hadoop HDFS Issue Type: Sub-task Components: rbf Reporter: Akira Ajisaka Assignee: Akira Ajisaka
Problem: When active NameNode is restarted and loading fsimage, DFSRouters significantly slow down. Investigation: When active NameNode is restarted and loading fsimage, RouterRpcClient receives SocketException. Since RouterRpcClient#isUnavailableException(IOException) returns false when the argument is SocketException, the MembershipNameNodeResolver#cacheNS is not refreshed. That's why the order of the NameNodes returned by MemberShipNameNodeResolver#getNamenodesForNameserviceId(String) is unchanged and the active NameNode is still returned first. Therefore RouterRpcClient still tries to connect to the NameNode that is loading fsimage. After loading the fsimage, the NameNode throws StandbyException. The exception is one of the 'Unavailable Exception' and the cacheNS is refreshed. Workaround: Stop NameNode and wait 1 minute before starting NameNode instead of restarting. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org