Lots of Datanode SocketTimeoutException

Clay Chiang Sat, 25 Feb 2012 00:45:31 -0800

Hi All,

    We have a HDFS cluster with ~200 nodes, and for some reason, it's
divided into 4 MR clusters which sharing the same HDFS.
    Recently, we saw a lots of SocketTimeoutException in datanode log, such
as:


2012-02-24 11:57:51,882 WARN  datanode.DataNode
(DataXceiver.java:readBlock(236)) - DatanodeRegistration(.....):Got
exception while serving blk_-5205544551109548677_55590565 to /xx.xx.xx.xx:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/... remote=/...]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:214)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:114)

   And actually it happened for months, but only recently there were lots
of " timeout while waiting for channel to be ready for *write*" (no read),
and the "remote" host all come from one of the MR cluster. Monitoring that
MR cluster from ganglia, no constant heavy load found. i test the network
by scp a large file between hosts and i don't think it's a network problem.

    i did some googling for this problem and found
https://issues.apache.org/jira/browse/HDFS-770 , which may be related to
but not solved.

    and in some datanode, we also found "Out of socket memory" in dmesg. (
does it hurt ? need some kernel tuning? ,  uname -a:   Linux ...
2.6.18-194.el5 #1 SMP Fri Apr 2 14:58:14 EDT 2010 x86_64 x86_64 x86_64
GNU/Linux)

     anybody who has some idea about this, pls help :) Thanks in advance.

-- 
Kindest Regards,
Clay Chiang

Lots of Datanode SocketTimeoutException

Reply via email to