[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753002#comment-16753002
 ] 

Fei Hui edited comment on HDFS-14230 at 1/26/19 11:25 AM:
----------------------------------------------------------

[~elgoiri] Deep into the code
* If HA is configured, retry policy is FailoverOnNetworkExceptionRetry 
(NameNodeProxies.java)
* Invoke in RetryInvocationHandler.java will call RPC and handle exception
* If RpcServer throws StandbyExeption, failoverAction is 
RetryAction.RetryDecision.FAILOVER_AND_RETRY, 
FailoverOnNetworkExceptionRetry#shouldRetry(RetryPolicies.java) will return 
FAILOVER_AND_RETRY if failovers < maxFailovers
* delay some milliseconds 
* call proxyProvider#performFailover, increase namenodes index and change 
current proxy


Overall, ha client will access namenode alternately until 
maxFailoverAttempts(default 15). So when rolling upgrade, ha client will 
succeed to access namenode, but fail to access router.



was (Author: ferhui):
[~elgoiri] Deep into the code
* If HA is configured, retry policy is FailoverOnNetworkExceptionRetry 
(NameNodeProxies.java)
* Invoke in RetryInvocationHandler.java will call RPC and handle exception
* If RpcServer throws StandbyExeption, failoverAction is 
RetryAction.RetryDecision.FAILOVER_AND_RETRY, 
FailoverOnNetworkExceptionRetry#shouldRetry(RetryPolicies.java) will return 
FAILOVER_AND_RETRY if failovers < maxFailovers
* delay some milliseconds 
* call proxyProvider#performFailover, increase namenodes index and change 
current proxy
Overall, ha client will access namenode alternately until 
maxFailoverAttempts(default 15). So when rolling upgrade, ha client will 
succeed to access namenode, but fail to access router.


> RBF: Throw StandbyException instead of IOException when no namenodes available
> ------------------------------------------------------------------------------
>
>                 Key: HDFS-14230
>                 URL: https://issues.apache.org/jira/browse/HDFS-14230
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>            Reporter: Fei Hui
>            Assignee: Fei Hui
>            Priority: Major
>         Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
>     at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will fail after some retries



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to