[jira] Resolved: (HDFS-262) On a busy cluster, it is possible for the client to believe it cannot fetch a block when the client or datanodes are running slowly

Todd Lipcon (JIRA) Thu, 24 Sep 2009 22:43:49 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Todd Lipcon resolved HDFS-262.
------------------------------

    Resolution: Cannot Reproduce
      Assignee: Todd Lipcon

Hi Jim,

I believe this is the behavior already implemented. It sleeps for 3 seconds, 
then calls openInfo() once, which causes the block locations to be refreshed.

Resolving - feel free to reopen if I misunderstood.

> On a busy cluster, it is possible for the client to believe it cannot fetch a 
> block when the client or datanodes are running slowly
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-262
>                 URL: https://issues.apache.org/jira/browse/HDFS-262
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>         Environment: 100 node cluster, fedora, 1TB disk per machine available 
> for HDFS (two spindles) 16GB RAM, 8 cores
> running datanode, TaskTracker, HBaseRegionServer and the task being executed 
> by the TaskTracker. 
>            Reporter: Jim Kellerman
>            Assignee: Todd Lipcon
>
> On a heavily loaded node, the communication between a DFSClient can time out 
> or fail leading DFSClient to believe the datanode is non-responsive even 
> though the datanode is, in fact, healthy. It may run through all the retries 
> for that datanode leading DFSClient to mark the datanode "dead".  
> This can continue as DFSClient iterates through the other datanodes for the 
> block it is looking for, and then DFSClient will declare that it can't find 
> any servers for that block (even though all n (where n = replication factor) 
> datanodes are healthy (but slow) and have valid copies of the block.
> It is also possible that the process running the DFSClient is too slow and 
> misses (or times out) responses from the data node, resulting in the 
> DFSClient believing that the datanode is dead.
> Another possibility is that the block has been moved from one or more 
> datanodes since DFSClient$DFSInputStream.chooseDataNode() found the locations 
> of the block.
> When the retries for each datanode and all datanodes are exhausted, 
> DFSClient$DFSInputStream.chooseDataNode() issues the warning:
> {code}
>           if (nodes == null || nodes.length == 0) {
>             LOG.info("No node available for block: " + blockInfo);
>           }
>           LOG.info("Could not obtain block " + block.getBlock() + " from any 
> node:  " + ie);
> {code}
> It would be an improvement, and not impact performance under normal 
> conditions if  when DFSClient decides that it cannot find the block anywhere, 
> for it to retry finding the block by calling 
> {code}
> private static LocatedBlocks callGetBlockLocations()
> {code}
>  
> *once* , to attempt to recover from machine(s) being too busy, or the block 
> being relocated since the initial call to callGetBlockLocations(). If the 
> second attempt to find the block based on what the namenode told DFSClient,  
> then issue the messages and give up by throwing the exception it does today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HDFS-262) On a busy cluster, it is possible for the client to believe it cannot fetch a block when the client or datanodes are running slowly

Reply via email to