Hi, I have a Hadoop&HBase cluster, that runs Hadoop 1.1.2 and HBase 0.94.7. I notice an issue that stops normal cluster running. My use case: I have several MR jobs that read data from one HBase table in map phase and write data in 3 different tables during the reduce phase. I create table handler on my own, I don't TableOutputFormat. The only way out I found is to restart region server deamon on region server with problems.
On namenode: cat namenode.2013-10-02 | grep "blk_3136705509461132997_43329" Wed Oct 02 13:32:17 2013 GMT namenode 3852-0@namenode:0 [INFO] (IPC Server handler 29 on 22700) org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247. blk_3136705509461132997_43329 Wed Oct 02 13:33:38 2013 GMT namenode 3852-0@namenode:0 [INFO] (IPC Server handler 13 on 22700) org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_3136705509461132997_43329, newgenerationstamp=43366, newlength=40045568, newtargets=[10.81.18.101:50010], closeFile=false, deleteBlock=false) On region server: cat regionserver.2013-10-02 | grep "1380720737247" Wed Oct 02 13:32:17 2013 GMT regionserver 5854-0@datanode1:0 [INFO] (regionserver60020.logRoller) org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720701436, entries=149, filesize=63934833. for /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247 Wed Oct 02 13:33:37 2013 GMT regionserver 5854-0@datanode1:0 [WARN] (DataStreamer for file /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247 block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_3136705509461132997_43329 bad datanode[0] 10.80.40.176:50010 Wed Oct 02 13:33:37 2013 GMT regionserver 5854-0@datanode1:0 [WARN] (DataStreamer for file /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247 block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_3136705509461132997_43329 in pipeline 10.80.40.176:50010, 10.81.111.8:50010, 10.81.18.101:50010: bad datanode 10.80.40.176:50010 Wed Oct 02 13:33:43 2013 GMT regionserver 5854-0@datanode1:0 [INFO] (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247 retrying... Wed Oct 02 13:33:43 2013 GMT regionserver 5854-0@datanode1:0 [INFO] (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247 retrying... Wed Oct 02 13:33:44 2013 GMT regionserver 5854-0@datanode1:0 [INFO] (regionserver60020.logRoller) org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/.logs/datanode1,60020,1380637389766/datanode1%2C60020%2C1380637389766.1380720737247 retrying... cat regionserver.2013-10-02 | grep "1380720737247" | grep 'Could not complete' | wc -l 5640 In datanode logs, that runs on the same host with region server: cat datanode.2013-10-02 | grep "blk_3136705509461132997_43329" Wed Oct 02 13:32:17 2013 GMT datanode 5651-0@datanode1:0 [INFO] (org.apache.hadoop.hdfs.server.datanode.DataXceiver@ca6b1e3) org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_3136705509461132997_43329 src: /10.80.40.176:36721 dest: / 10.80.40.176:50010 Wed Oct 02 13:33:37 2013 GMT datanode 5651-0@datanode1:0 [INFO] (org.apache.hadoop.hdfs.server.datanode.DataXceiver@ca6b1e3) org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.80.40.176:50010, storageID=DS-812180968-10.80.40.176-50010-1380263000454, infoPort=50075, ipcPort=50020): Exception writing block blk_3136705509461132997_43329 to mirror 10.81.111.8:50010 Wed Oct 02 13:33:37 2013 GMT datanode 5651-0@datanode1:0 [INFO] (org.apache.hadoop.hdfs.server.datanode.DataXceiver@ca6b1e3) org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_3136705509461132997_43329 java.io.IOException: Connection reset by peer Wed Oct 02 13:33:38 2013 GMT datanode 5651-0@datanode1:0 [INFO] (PacketResponder 2 for Block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_3136705509461132997_43329 2 : Thread is interrupted. Wed Oct 02 13:33:38 2013 GMT datanode 5651-0@datanode1:0 [INFO] (PacketResponder 2 for Block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_3136705509461132997_43329 terminating Wed Oct 02 13:33:38 2013 GMT datanode 5651-0@datanode1:0 [INFO] (org.apache.hadoop.hdfs.server.datanode.DataXceiver@ca6b1e3) org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_3136705509461132997_43329 received exception java.io.IOException: Connection reset by peer On datanode2, called by datanode1 to replicate logs: cat datanode.2013-10-02 | grep "blk_3136705509461132997_43329" Wed Oct 02 13:32:17 2013 GMT datanode 5986-0@datanode2:0 [INFO] (org.apache.hadoop.hdfs.server.datanode.DataXceiver@6afc190b) org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_3136705509461132997_43329 src: /10.80.40.176:57279 dest: / 10.81.111.8:50010 Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [INFO] (IPC Server handler 18 on 50020) org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls recoverBlock(block=blk_3136705509461132997_43329, targets=[ 10.81.111.8:50010, 10.81.18.101:50010]) Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [WARN] (PacketResponder 1 for Block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): { java.nio.channels.ClosedChannelException | at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:135) | at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:326) | at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) | at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) | at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) | at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) | at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) | at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:135) | at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:939) | at java.lang.Thread.run(Thread.java:662) } Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [INFO] (PacketResponder 1 for Block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_3136705509461132997_43329 1 Exception java.nio.channels.ClosedChannelException Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [INFO] (org.apache.hadoop.hdfs.server.datanode.DataXceiver@6afc190b) org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_3136705509461132997_43329 java.nio.channels.ClosedByInterruptException Wed Oct 02 13:33:37 2013 GMT datanode 5986-0@datanode2:0 [INFO] (org.apache.hadoop.hdfs.server.datanode.DataXceiver@6afc190b) org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_3136705509461132997_43329 received exception java.io.IOException: Interrupted receiveBlock Wed Oct 02 13:33:38 2013 GMT datanode 5986-0@datanode2:0 [INFO] (PacketResponder 1 for Block blk_3136705509461132997_43329) org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_3136705509461132997_43329 terminating Wed Oct 02 13:33:38 2013 GMT datanode 5986-0@datanode2:0 [WARN] (IPC Server handler 18 on 50020) org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to getBlockMetaDataInfo for block (=blk_3136705509461132997_43329) from datanode (=10.81.111.8:50010) { java.io.IOException: Block blk_3136705509461132997_43329 length is 39980544 does not match block file length 40045568 | at org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockMetadata(FSDataset.java:1861) | at org.apache.hadoop.hdfs.server.datanode.FSDataset.startBlockRecovery(FSDataset.java:2166) | at org.apache.hadoop.hdfs.server.datanode.DataNode.startBlockRecovery(DataNode.java:1783) | at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1972) | at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2114) | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) | at java.lang.reflect.Method.invoke(Method.java:597) | at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) | at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) | at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) | at java.security.AccessController.doPrivileged(Native Method) | at javax.security.auth.Subject.doAs(Subject.java:396) | at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) | at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) } For me, this sounds like a HDFS issue, but I cannot find the root cause. Any idea is welcome. I checked OS file descriptors, HDFS file xcieves, anything looks ok. Thanks in advance, Ionut I.