[ https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085760#comment-13085760 ]
Uma Maheswara Rao G commented on HADOOP-7488: --------------------------------------------- Hi Konstantin, Thanks alot for taking a look on this issue. {quote} If rpcTimeout > 0 then {{ handleTimeout()}} will throw SocketTimeoutException instead of going into ping loop. Can you control the required behavior by setting rpcTimeout > 0 rather introducing the # of pings limit. {quote} Yes, with this parameter also, we can control. I am planning to add below code in DataNode when gettng the proxy. {code} // get NN proxy DatanodeProtocol dnp = (DatanodeProtocol)RPC.waitForProxy(DatanodeProtocol.class, DatanodeProtocol.versionID, nnAddr, conf, socketTimeout, Long.MAX_VALUE); {code} Here the sockettimeout is rpcTimeOut. this property already used for createInterDataNodeProtocolProxy as rpcTimeOut. this.socketTimeout = conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY, HdfsConstants.READ_TIMEOUT); But my question is, if i use socketTimeout (default 60*1000 ms) as rpcTimeOut, default behaviour will be changed. I dont want to change the default behavior here. any suggestion for this? {quote} DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because during startup you cannot predict when NN will come online as it depends on the size of the image and edits. Also when NN becomes busy it is important for DNs to keep retrying rather than assuming the NN is dead. {quote} Yes. But there are some scenarios like network unplug may thorugh tomeouts and because of the timeout handlings, unneccerily system will be blocked for long time. As i know, even if we through that timeout exception out to JT or DN, they will handle it and retry again in their offerService methods. except in below condition {code} catch(RemoteException re) { String reClass = re.getClassName(); if (UnregisteredNodeException.class.getName().equals(reClass) || DisallowedDatanodeException.class.getName().equals(reClass) || IncorrectVersionException.class.getName().equals(reClass)) { LOG.warn("blockpool " + blockPoolId + " is shutting down", re); shouldServiceRun = false; return; } {code} {quote} And even if they don't this should be an HDFS change not generic IPC change, which affects many Hadoop components {quote} What i felt is, this particular issue will be applicable for all the components who is using Hadoop IPC. And also planned to retain the default behaviour as it is to not effect the other componenets. and if user really required then he will tune the configuration parameter based on his requirement. Anyway we decided to use rcpTimeOut right, IPC user code only should pass this value. In that case this will come under HDFS specific chnage. Also need to check the for MapReduce as well ( same situation for JT) {quote} As for HA I don't know what you did for HA and therefore cannot understand what problem you are trying to solve here. I can guess that you want DNs switch to another NN when they timeout rather than retrying. In this case you should be able to use rpcTimeout {quote} Yes, your guess is correct :-) In our HA solution, we are using *BackupNode* and Switching framework is *Zookeeper based* LeaderElection. DNs will contain both the active and standby node addresses configured. On any failure, DNs will try to switch to other NN. Here the scenario is, We unplugged the active NN network card, then all DN are blocked for long time. --Thanks > When Namenode network is unplugged, DFSClient operations waits for ever > ----------------------------------------------------------------------- > > Key: HADOOP-7488 > URL: https://issues.apache.org/jira/browse/HADOOP-7488 > Project: Hadoop Common > Issue Type: Bug > Components: ipc > Reporter: Uma Maheswara Rao G > Assignee: Uma Maheswara Rao G > Attachments: HADOOP-7488.patch > > > When NN/DN is shutdown gracefully, the DFSClient operations which are waiting > for a response from NN/DN, will throw exception & come out quickly > But when the NN/DN network is unplugged, the DFSClient operations which are > waiting for a response from NN/DN, waits for ever. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira