[
https://issues.apache.org/jira/browse/HADOOP-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634852#action_12634852
]
ZhuGuanyin commented on HADOOP-4291:
------------------------------------
seems that after try all datanodes, it clears the deadnode list and retry,
enter an infinite loop.
We add some debug code as follows:
In DFSInputStream.blockSeekTo():
private synchronized DatanodeInfo blockSeekTo(long target) throws IOException {
while (s == null) {
LOG.info("blockSeekTo step 1");
DNAddrPair retval = chooseDataNode(targetBlock);
LOG.info("blockSeekTo step 2");
try {
blockReader = BlockReader.newBlockReader();
return chosenNode;
} catch (IOException ex) {
LOG.info("blockSeekTo step 3");
addToDeadNodes(chosenNode);
if (s != null) {
try {
s.close();
} catch (IOException iex) {
LOG.info("blockSeekTo step 4");
}
}
s = null;
LOG.info("blockSeekTo step 5");
}
LOG.info("blockSeekTo step 6");
}
return chosenNode;
}
In DFSInputStream. chooseDataNode ():
private DNAddrPair chooseDataNode(LocatedBlock block)
throws IOException {
LOG.info("chooseDataNode() step 1");
while (true) {
LOG.info("chooseDataNode() step 2");
DatanodeInfo[] nodes = block.getLocations();
try {
LOG.info("chooseDataNode() step 3, failures = " + failures);
DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
LOG.info("chooseDataNode() step 4");
InetSocketAddress targetAddr =
DataNode.createSocketAddr(chosenNode.getName());
LOG.info("chooseDataNode() step 5");
return new DNAddrPair(chosenNode, targetAddr);
} catch (IOException ie) {
String blockInfo = block.getBlock() + " file=" + src;
LOG.info("chooseDataNode() step 6, failures = " + failures);
if (failures >= MAX_BLOCK_ACQUIRE_FAILURES) {
throw new IOException("Could not obtain block: " + blockInfo);
}
if (nodes == null || nodes.length == 0) {
LOG.info("No node available for block: " + blockInfo);
}
LOG.info("Could not obtain block " + block.getBlock() + " from any
node: " + ie);
try {
Thread.sleep(3000);
} catch (InterruptedException iex) {
}
LOG.info("chooseDataNode() step 7, failures = " + failures);
deadNodes.clear(); //2nd option is to remove only nodes[blockId]
openInfo();
failures++;
LOG.info("chooseDataNode() step 8, failures = " + failures);
continue;
}
}
}
After we run ./hadoop dfs -cat /1.txt , we get the following stdout:
[EMAIL PROTECTED] baidu.com ~]$ ./hadoop fs -cat /1.txt
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:44 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: chooseDataNode() step 6, failures = 0
08/09/26 21:00:44 INFO fs.DFSClient: Could not obtain block blk_1225 from any
node: java.io.IOException: No live nodes contain current block
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 7, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 8, failures = 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 4
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 2
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 3
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 5
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 6
08/09/26 21:00:47 INFO fs.DFSClient: blockSeekTo step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 1
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 2
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 3, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: chooseDataNode() step 6, failures = 0
08/09/26 21:00:47 INFO fs.DFSClient: Could not obtain block blk_1225 from any
node: java.io.IOException: No live nodes contain current block
.........................................................................................................
> MapReduce Streaming job hang when all replications of the input file has
> corrupted!
> -----------------------------------------------------------------------------------
>
> Key: HADOOP-4291
> URL: https://issues.apache.org/jira/browse/HADOOP-4291
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.18.1
> Reporter: ZhuGuanyin
> Priority: Critical
>
> On some special cases, all replications of a given file has truncated to zero
> but the namenode still hold the original size (we don't know why), the
> mapreduce streaming job will hang if we don't specified mapred.task.timeout
> when the input files contain this corrupted file, even the dfs shell "cat"
> will hang when fetch data from this corrupted file.
> We found that job hang at DFSInputStream.blockSeekTo() when chosing a
> datanode. The following test will show:
> 1) Copy a little file to hdfs.
> 2) Get the file blocks and login to these datanodes, and truncate these
> blocks to zero.
> 3) Cat this file through dfs shell "cat"
> 4) Cat command will enter dead loop.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.