[
https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christian Kunz updated HADOOP-4132:
-----------------------------------
Summary: high rate of task failures because of bad or full datanodes (was:
high rate of task failures because of bad of full datanodes)
> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
> Key: HADOOP-4132
> URL: https://issues.apache.org/jira/browse/HADOOP-4132
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.17.1
> Reporter: Christian Kunz
> Priority: Blocker
>
> With 0.17 we notice a fast rate of task failures because of the same bad data
> nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a
> single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 189000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block
> blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node:
> xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 189000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node:
> xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node:
> xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node:
> xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block
> blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block
> locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.