[ 
https://issues.apache.org/jira/browse/HDFS-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754679#comment-16754679
 ] 

CR Hota commented on HDFS-14230:
--------------------------------

[~elgoiri] [~brahmareddy] [~ferhui] Thanks for sharing your thoughts.

Here are my thoughts on this area.

I am inclined towards the approach where we do NOT retry from the same router. 
Retrying within the same router essentially can spike up RPC queue times for 
clients trying to access other clusters which are healthy. Multiple retries 
from same router would mean keeping handler threads blocked for a longer period 
of time and impacting other clients in the process. Clients going to a separate 
router would on a high level mean re-queuing of the RPC and overall adds better 
fairness.

Router is fundamentally different than HA client. Current HA client need not be 
multi-tenant aware. Router has to be more multi tenant aware (think of each 
downsteam name node as a tenant) and fair. HDFS-14090 plans to introduce better 
resource isolation, in such a case RetriableException would make more sense as 
dedicated/isolated resources will be allotted per name node.

> RBF: Throw StandbyException instead of IOException when no namenodes available
> ------------------------------------------------------------------------------
>
>                 Key: HDFS-14230
>                 URL: https://issues.apache.org/jira/browse/HDFS-14230
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>            Reporter: Fei Hui
>            Assignee: Fei Hui
>            Priority: Major
>         Attachments: HDFS-14230-HDFS-13891.001.patch, 
> HDFS-14230-HDFS-13891.002.patch
>
>
> Failover usually happens when upgrading namenodes. And there are no active 
> namenodes within some seconds, Accessing HDFS through router fails at this 
> moment. This could make jobs  failure or hang. Some hive jobs logs are as 
> follow  
> {code:java}
> 2019-01-03 16:12:08,337 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 133.33 sec
> MapReduce Total cumulative CPU time: 2 minutes 13 seconds 330 msec
> Ended Job = job_1542178952162_24411913
> Launching Job 4 out of 6
> Exception in thread "Thread-86" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): No namenode 
> available under nameservice Cluster3
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.shouldRetry(RouterRpcClient.java:328)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:488)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:495)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:385)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:760)
>     at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:1152)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
>     at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1804)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1338)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3925)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1014)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:849)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1867)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2130)
> {code}
> Deep into the code. Maybe we can throw StandbyException when no namenodes 
> available. Client will fail after some retries



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to