[jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Uma Maheswara Rao G (JIRA) Tue, 16 Aug 2011 07:57:53 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085760#comment-13085760
 ]


Uma Maheswara Rao G commented on HADOOP-7488:
---------------------------------------------

Hi Konstantin,

Thanks alot for taking a look on this issue.


{quote}
If rpcTimeout > 0 then {{ handleTimeout()}} will throw SocketTimeoutException 
instead of going into ping loop. Can you control the required behavior by 
setting rpcTimeout > 0 rather introducing the # of pings limit.
{quote}
 Yes, with this parameter also, we can control.

 I am planning to add below code in DataNode when gettng the proxy.

 {code}
        // get NN proxy
      DatanodeProtocol dnp = 
        (DatanodeProtocol)RPC.waitForProxy(DatanodeProtocol.class,
            DatanodeProtocol.versionID, nnAddr, conf, socketTimeout,
       Long.MAX_VALUE);
 {code}

  Here the sockettimeout is rpcTimeOut. 
 this property already used for createInterDataNodeProtocolProxy as rpcTimeOut.
 this.socketTimeout =  conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY,
                                      HdfsConstants.READ_TIMEOUT);

But my question is, if i use socketTimeout (default 60*1000 ms) as rpcTimeOut, 
default behaviour will be changed. I dont want to change the default behavior 
here.
 any suggestion for this? 

{quote}
DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because 
during startup you cannot predict when NN will come online as it depends on the 
size of the image and edits. Also when NN becomes busy it is important for DNs 
to keep retrying rather than assuming the NN is dead.
{quote}

Yes. But there are some scenarios like network unplug may thorugh tomeouts and 
because of the timeout handlings, unneccerily system will be blocked for long 
time.
As i know, even if we through that timeout exception out to JT or DN, they will 
handle it and retry again in their offerService methods.
except in below condition
{code}
 catch(RemoteException re) {
          String reClass = re.getClassName();
          if (UnregisteredNodeException.class.getName().equals(reClass) ||
              DisallowedDatanodeException.class.getName().equals(reClass) ||
              IncorrectVersionException.class.getName().equals(reClass)) {
            LOG.warn("blockpool " + blockPoolId + " is shutting down", re);
            shouldServiceRun = false;
            return;
          }
{code}


{quote}
And even if they don't this should be an HDFS change not generic IPC change, 
which affects many Hadoop components
{quote}
  
 What i felt is, this particular issue will be applicable for all the 
components who is using Hadoop IPC. And also planned to retain the default 
behaviour as it is to not effect the other componenets. and if user really 
required then he will tune the configuration parameter based on his requirement.

Anyway we decided to use rcpTimeOut right, IPC user code only should pass this 
value. In that case this will come under HDFS specific chnage. Also need to 
check the for MapReduce as well ( same situation for JT) 


{quote}
As for HA I don't know what you did for HA and therefore cannot understand what 
problem you are trying to solve here. I can guess that you want DNs switch to 
another NN when they timeout rather than retrying. In this case you should be 
able to use rpcTimeout
{quote}
 Yes, your guess is correct :-)
 In our HA solution, we are using *BackupNode* and Switching framework is 
*Zookeeper based* LeaderElection. DNs will contain both the active and standby 
node addresses configured. On any failure, DNs will try to switch to other NN. 
 Here the scenario is, We unplugged the active NN network card, then all DN are 
blocked for long time.


--Thanks

> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-7488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7488
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HADOOP-7488.patch
>
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting 
> for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are 
> waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-7488) When Namenode network is unplugged, DFSClient operations waits for ever

Reply via email to