hi, we encountered the trouble that a datanode is hung up when a socket exception connection reset happened with it. since then on, the hbase running over the hdfs cannot have access to the data tables (but -ROOT- and .META. are not affected), until we manually stopped the bad datanode. our environment is an 8 datanode cluster while regionservers are running over it. namenode and hbase master are running on 2 machines other than this 8 nodes. hadoop version is 0.20.2 and hbase is 0.20.6.
the log related to the troubled data block is as follows (we collected them from multiple datanodes and the hbase regionserver), sorted in time (but a slight time difference between nodes exists). i have three questions: 1. why the socket exception of connection reset is caught but still hangs up the datanode wk008? 2. why only one datanode is failed but any user table region became unaccessible through the hbase? 3. is there known bugfix for this issue? hadoop-hadoop-datanode-str-wk008.p-prd.log.2012-05-20:2012-05-20 17:13:49,854 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_9016752944216030896_4468475 src: /192.168.128.114:41922dest: / 192.168.128.114:50010 2012-05-20 17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.128.114:41922remote=/ 192.168.128.114:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2314) hbase-hadoop-regionserver-str-wk008.p-prd.log.2012-05-20:2012-05-20 17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_9016752944216030896_4468475 bad datanode[0] 192.168.128.114:50010 hbase-hadoop-regionserver-str-wk008.p-prd.log.2012-05-20:2012-05-20 17:14:37,295 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_9016752944216030896_4468475 in pipeline 192.168.128.114:50010, 192.168.128.104:50010, 192.168.128.105:50010: bad datanode 192.168.128.114:50010 hadoop-hadoop-datanode-str-wk008.p-prd.log.2012-05-20:2012-05-20 17:15:19,910 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_9016752944216030896_4468475 2 Exception java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:132) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:875) at java.lang.Thread.run(Thread.java:619) /* after this message wk008 stops generating any log message until it is rebooted */ hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 17:17:20,843 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_9016752944216030896_4468475 src: /192.168.128.114:55438dest: / 192.168.128.104:50010 hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 17:17:20,858 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_9016752944216030896_4468475 src: /192.168.128.104:43219dest: / 192.168.128.105:50010 hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 17:18:08,307 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls recoverBlock(block=blk_9016752944216030896_4468475, targets=[ 192.168.128.104:50010, 192.168.128.105:50010]) hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 17:18:08,322 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: oldblock=blk_9016752944216030896_4468475(length=1040384), newblock=blk_9016752944216030896_4468477(length=1040384), datanode= 192.168.128.104:50010 /* after wk008 is shutdown and restart */ hadoop-hadoop-datanode-str-wk007.p-prd.log.2012-05-20:2012-05-20 18:36:07,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_9016752944216030896_4468477 src: /192.168.128.104:35297dest: / 192.168.128.109:50010 hadoop-hadoop-datanode-str-wk007.p-prd.log.2012-05-20:2012-05-20 18:36:07,700 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_9016752944216030896_4468477 src: /192.168.128.104:35297 dest: / 192.168.128.109:50010 of size 2112732 hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_9016752944216030896_4468475 java.net.SocketException: Connection reset hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_9016752944216030896_4468475 terminating hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_9016752944216030896_4468475 1 : Thread is interrupted. hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_9016752944216030896_4468475 1 Exception java.net.SocketException: Socket closed hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,266 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_9016752944216030896_4468475 received exception java.io.IOException: Interrupted receiveBlock hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,267 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_9016752944216030896_4468475 java.io.EOFException: while trying to read 65557 bytes hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,267 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_9016752944216030896_4468475 Interrupted. hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,267 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_9016752944216030896_4468475 terminating hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,267 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_9016752944216030896_4468475 received exception java.io.EOFException: while trying to read 65557 bytes hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,312 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: oldblock=blk_9016752944216030896_4468475(length=1040384), newblock=blk_9016752944216030896_4468477(length=1040384), datanode= 192.168.128.105:50010 hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,329 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_9016752944216030896_4468477 src: /192.168.128.114:54157dest: / 192.168.128.104:50010 hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,329 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen already-open Block for append blk_9016752944216030896_4468477 hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,331 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_9016752944216030896_4468477 src: /192.168.128.104:46723dest: / 192.168.128.105:50010 hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,331 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Reopen already-open Block for append blk_9016752944216030896_4468477 hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,426 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file offset of block blk_9016752944216030896_4468477 from 0 to 1040384 meta file offset to 8135 hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,436 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file offset of block blk_9016752944216030896_4468477 from 0 to 1040384 meta file offset to 8135 hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,503 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / 192.168.128.104:46723, dest: /192.168.128.105:50010, bytes: 2112732, op: HDFS_WRITE, cliID: DFSClient_1758679091, srvID: DS-750403221-192.168.128.105-50010-1301616586785, blockid: blk_9016752944216030896_4468477 hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,503 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / 192.168.128.114:54157, dest: /192.168.128.104:50010, bytes: 2112732, op: HDFS_WRITE, cliID: DFSClient_1758679091, srvID: DS-1756587443-192.168.128.104-50010-1301616567281, blockid: blk_9016752944216030896_4468477 hadoop-hadoop-datanode-str-wk003.p-prd.log.2012-05-20:2012-05-20 18:36:33,503 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_9016752944216030896_4468477 terminating hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:36:33,503 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_9016752944216030896_4468477 terminating hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:38:14,041 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.128.104:50010, storageID=DS-1756587443-192.168.128.104-50010-1301616567281, infoPort=50075, ipcPort=50020) Starting thread to transfer block blk_9016752944216030896_4468477 to 192.168.128.109:50010 hadoop-hadoop-datanode-str-wk002.p-prd.log.2012-05-20:2012-05-20 18:38:14,150 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.128.104:50010, storageID=DS-1756587443-192.168.128.104-50010-1301616567281, infoPort=50075, ipcPort=50020):Transmitted block blk_9016752944216030896_4468477 to /192.168.128.109:50010