[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Robert Chansler (JIRA) Thu, 11 Sep 2008 10:10:08 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630254#action_12630254
 ]


Robert Chansler commented on HADOOP-4132:
-----------------------------------------

The NameNode should be able to continue to provide administrative functions and 
file access even if file creations must be delayed or deferred. If the policy 
is to refuse creations, the client should receive a unambiguous message. 

So what should be the policy? It is probably infeasible to allocate the very 
last block. Is the cluster full when it takes too long to find a block? When 
too many DataNode report high utilization? If replication attempts fail too 
often?



> high rate of task failures because of bad or full datanodes
> -----------------------------------------------------------
>
>                 Key: HADOOP-4132
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>
> With 0.17 we notice a fast rate of task failures because of the same bad data 
> nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
> After running less than 20,000 map tasks, more than 2,500 of them reported a 
> single certain datanode as badFirstLink, with typical exception of the form:
> 08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: 189000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block 
> blk_-3650954811734254315
> 08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: 
> xxx.yyy.zzz.ttt:50010
> 08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.net.SocketTimeoutException: 189000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
> 08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
> 08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: 
> xxx.yyy.zzz.ttt:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
> 08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: 
> xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream 
> java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
> 08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
> 08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: 
> xxx.yyy.zzz.ttt:50010
> 08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: 
> java.io.IOException: Unable to create new block.
> 08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block 
> blk_4847638219960634858 bad datanode[2]
> Exception in thread "main" java.io.IOException: Could not get block 
> locations. Aborting...
> With several such bad datanodes the probability of jobs failing goes up a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4132) high rate of task failures because of bad or full datanodes

Reply via email to