Hi All, We have a HDFS cluster with ~200 nodes, and for some reason, it's divided into 4 MR clusters which sharing the same HDFS. Recently, we saw a lots of SocketTimeoutException in datanode log, such as:
2012-02-24 11:57:51,882 WARN datanode.DataNode (DataXceiver.java:readBlock(236)) - DatanodeRegistration(.....):Got exception while serving blk_-5205544551109548677_55590565 to /xx.xx.xx.xx: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/... remote=/...] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:214) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:114) And actually it happened for months, but only recently there were lots of " timeout while waiting for channel to be ready for *write*" (no read), and the "remote" host all come from one of the MR cluster. Monitoring that MR cluster from ganglia, no constant heavy load found. i test the network by scp a large file between hosts and i don't think it's a network problem. i did some googling for this problem and found https://issues.apache.org/jira/browse/HDFS-770 , which may be related to but not solved. and in some datanode, we also found "Out of socket memory" in dmesg. ( does it hurt ? need some kernel tuning? , uname -a: Linux ... 2.6.18-194.el5 #1 SMP Fri Apr 2 14:58:14 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux) anybody who has some idea about this, pls help :) Thanks in advance. -- Kindest Regards, Clay Chiang