[ 
https://issues.apache.org/jira/browse/HDFS-15555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188986#comment-17188986
 ] 

Akira Ajisaka commented on HDFS-15555:
--------------------------------------

The following code refreshes the cache:
https://github.com/apache/hadoop/blob/b6a3286d27b604322fddc1ec06ad563fd8a9d0f4/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcClient.java#L424-L428

{{failover}} is set to true when the IOException is in the Unavailable 
Exceptions.
https://github.com/apache/hadoop/blob/b6a3286d27b604322fddc1ec06ad563fd8a9d0f4/hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterRpcClient.java#L441-L445

> RBF: Refresh cacheNS when SocketException occurs
> ------------------------------------------------
>
>                 Key: HDFS-15555
>                 URL: https://issues.apache.org/jira/browse/HDFS-15555
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: rbf
>         Environment: HDFS 3.3.0, Java 11
>            Reporter: Akira Ajisaka
>            Assignee: Akira Ajisaka
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Problem:
> When active NameNode is restarted and loading fsimage, DFSRouters 
> significantly slow down.
> Investigation:
> When active NameNode is restarted and loading fsimage, RouterRpcClient 
> receives SocketException. Since 
> RouterRpcClient#isUnavailableException(IOException) returns false when the 
> argument is SocketException, the MembershipNameNodeResolver#cacheNS is not 
> refreshed. That's why the order of the NameNodes returned by 
> MemberShipNameNodeResolver#getNamenodesForNameserviceId(String) is unchanged 
> and the active NameNode is still returned first. Therefore RouterRpcClient 
> still tries to connect to the NameNode that is loading fsimage.
> After loading the fsimage, the NameNode throws StandbyException. The 
> exception is one of the 'Unavailable Exception' and the cacheNS is refreshed.
> Workaround:
> Stop NameNode and wait 1 minute before starting NameNode instead of 
> restarting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to