high rate of task failures because of bad datanodes
---------------------------------------------------

                 Key: HADOOP-4132
                 URL: https://issues.apache.org/jira/browse/HADOOP-4132
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.17.1
            Reporter: Christian Kunz


With 0.17 we notice a fast rate of task failures because of the same bad data 
nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.

After running less than 20,000 map tasks, more than 2,500 of them reported a 
single certain datanode as badFirstLink, with typical exception of the form:

08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.net.SocketTimeoutException: 189000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: 
xxx.yyy.zzz.ttt:50010
08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.net.SocketTimeoutException: 189000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: 
xxx.yyy.zzz.ttt:50010
08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: 
xxx.yyy.zzz.ttt:50010
08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: 
xxx.yyy.zzz.ttt:50010
08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: 
java.io.IOException: Unable to create new block.
08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block 
blk_4847638219960634858 bad datanode[2]
Exception in thread "main" java.io.IOException: Could not get block locations. 
Aborting...

With several such bad datanodes the probability of jobs failing goes up a lot.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to