[ https://issues.apache.org/jira/browse/HDFS-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon resolved HDFS-262. ------------------------------ Resolution: Cannot Reproduce Assignee: Todd Lipcon Hi Jim, I believe this is the behavior already implemented. It sleeps for 3 seconds, then calls openInfo() once, which causes the block locations to be refreshed. Resolving - feel free to reopen if I misunderstood. > On a busy cluster, it is possible for the client to believe it cannot fetch a > block when the client or datanodes are running slowly > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-262 > URL: https://issues.apache.org/jira/browse/HDFS-262 > Project: Hadoop HDFS > Issue Type: Improvement > Environment: 100 node cluster, fedora, 1TB disk per machine available > for HDFS (two spindles) 16GB RAM, 8 cores > running datanode, TaskTracker, HBaseRegionServer and the task being executed > by the TaskTracker. > Reporter: Jim Kellerman > Assignee: Todd Lipcon > > On a heavily loaded node, the communication between a DFSClient can time out > or fail leading DFSClient to believe the datanode is non-responsive even > though the datanode is, in fact, healthy. It may run through all the retries > for that datanode leading DFSClient to mark the datanode "dead". > This can continue as DFSClient iterates through the other datanodes for the > block it is looking for, and then DFSClient will declare that it can't find > any servers for that block (even though all n (where n = replication factor) > datanodes are healthy (but slow) and have valid copies of the block. > It is also possible that the process running the DFSClient is too slow and > misses (or times out) responses from the data node, resulting in the > DFSClient believing that the datanode is dead. > Another possibility is that the block has been moved from one or more > datanodes since DFSClient$DFSInputStream.chooseDataNode() found the locations > of the block. > When the retries for each datanode and all datanodes are exhausted, > DFSClient$DFSInputStream.chooseDataNode() issues the warning: > {code} > if (nodes == null || nodes.length == 0) { > LOG.info("No node available for block: " + blockInfo); > } > LOG.info("Could not obtain block " + block.getBlock() + " from any > node: " + ie); > {code} > It would be an improvement, and not impact performance under normal > conditions if when DFSClient decides that it cannot find the block anywhere, > for it to retry finding the block by calling > {code} > private static LocatedBlocks callGetBlockLocations() > {code} > > *once* , to attempt to recover from machine(s) being too busy, or the block > being relocated since the initial call to callGetBlockLocations(). If the > second attempt to find the block based on what the namenode told DFSClient, > then issue the messages and give up by throwing the exception it does today. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.