Hello All,

With the help of the Hadoop community we've identified why the data node died. However, I'm still trying to resolve my HBase issues.

Is it normal that when a data node goes down it brings down 3 region servers with it? I was under the impression that the HBase region servers had some kind of failover mechanism that would prevent this. Since there are multiple copies of the data stored, it doesn't make sense that the inaccessibility of one copy causes all other copies to also become inaccessible.

~ Peter

On Thu, 21 Jun 2012 11:57:55 -0400, Peter Naudus <pnau...@dataraker.com> wrote:

Hello All,

As this problem has both a Hadoop and HBase component, rather than posting the same message to both groups, I'm posting the datanode portion of this problem under the title of "Single disk failure (with HDFS-457 applied) causes data node to die" to the Hadoop users group.

In our production cluster we have 9 servers: each with 4 disks. Each server runs both a data node and region server.

This morning one of our servers lost one of its disks. This caused a cascading effect which resulted in the data node and 3 of 9 region servers go down. The system was in the middle of compaction and everything was running normally.

All snippets are from the region server log on xxx.xxx.xxx.123

<snip>
12/06/21 10:26:54 INFO regionserver.HRegion: completed compaction on region fpl,P.2563737_TS3600_D.1328072400,1330649094970.68469595e2cca48506cd6756c7a9e6c3. after 2mins, 42sec
12/06/21 10:26:54 INFO regionserver.HRegion: Starting compaction on region fpl,P.69362_TS3600_D.1298955600,1322201140663.fb0ebebfe2257b13d596d74d608a822a.
12/06/21 10:26:54 INFO regionserver.Store: Started compaction of 2 file(s) in cf=D into hdfs://xxxx:8020/hbase/xxxx/fb0ebebfe2257b13d596d74d608a822a/.tmp, seqid=83988687, totalSize=2.0g
12/06/21 10:29:46 INFO regionserver.Store: Completed major compaction of 2 file(s), new file=hdfs://xxxx:8020/hbase/xxxx/fb0ebebfe2257b13d596d74d608a822a/D/1647142169914040514, size=2.0g; total size for store is 2.0g
</snip>

Then, the disk failed (on xxx.xxx.xxx.129) causing the data node on that server to die:

<snip>
12/06/21 12:35:23 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-8250761279849076686_833850java.io.IOException: Bad response 1 for block blk_-8250761279849076686_833850 from datanode xxx.xxx.xxx.129:50010
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2664)
...
12/06/21 12:35:23 INFO hdfs.DFSClient: Error Recovery for block blk_-8250761279849076686_833850 waiting for responder to exit.
12/06/21 12:35:23 WARN wal.HLog: IPC Server handler 1 on 60020 took 51641 ms appending an edit to hlog; editcount=1367, len~=34.2k
12/06/21 12:35:23 WARN hdfs.DFSClient: Error Recovery for block blk_-8250761279849076686_833850 bad datanode[1] xxx.xxx.xxx.129:50010
</snip>

Other then xxx.xxx.xxx.129, this data block was also stored on xxx.xxx.xxx.127 and xxx.xxx.xxx.123 (the data node local to this region server). However, the region server was unable to read from these data nodes either. The strange thing is that although *.129 went down, the other data nodes remained up.

<snip>
12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block blk_-8250761279849076686_833850 failed because recovery from primary datanode xxx.xxx.xxx.123:50010 failed 2 times. Pipeline was xxx.xxx.xxx.123:50010, xxx.xxx.xxx.129:50010, xxx.xxx.xxx.xxx:50010. Will retry...
12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block blk_-8250761279849076686_833850 bad datanode[1] xxx.xxx.xxx.129:50010
12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block blk_-8250761279849076686_833850 in pipeline xxx.xxx.xxx.123:50010, xxx.xxx.xxx.129:50010, xxx.xxx.xxx.127:50010: bad datanode xxx.xxx.xxx.129:50010
12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server handler 16 on 60020 caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

12/06/21 12:36:23 WARN hdfs.DFSClient: Failed recovery attempt #1 from primary datanode xxxx.xxx.xxx.123:50010
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block blk_-8250761279849076686_833850 is already being recovered, ignoring this request to recover it.
at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1765)
at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1950)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy10.recoverBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2793)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
</snip>

this is repeated until

<snip>
12/06/21 12:36:23 WARN hdfs.DFSClient: Error Recovery for block blk_-8250761279849076686_833850 failed because recovery from primary datanode xxx.xxx.xxx.123:50010 failed 6 times. Pipeline was xxx.xxx.xxx.123:50010, xxx.xxx.xxx.129:50010, 10.128.204.127:50010. Marking primary datanode as bad.

12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@1492dba1) from xxx.xxx.xxx.127:51903: output error

12/06/21 12:36:23 WARN ipc.HBaseServer: IPC Server handler 23 on 60020 caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

12/06/21 12:37:23 WARN hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xxx.xxx.xxx.127:50010
java.net.SocketTimeoutException: Call to /xxx.xxx.xxx.127:50020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.xxx.xxx.123:54996 remote=/xxx.xxx.xxx.127:50020]
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1138)
at org.apache.hadoop.ipc.Client.call(Client.java:1110)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy10.recoverBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2793)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.xxx.xxx.123:54996 remote=/xxx.xxx.xxx.127:50020]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:375)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724)
</snip>

which was repeated until xxx.xxx.xxx.123, xxx.xxx.xxx.127, and xxx.xxx.xxx.129 were all marked as bad

After this, I saw problems (with what I'm assuming is) with hbase trying to talk to other region servers

<snip>
12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@54d0259b) from xxx.xxx.xxx.127:54700: output error
12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 8 on 60020 caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@2d550e9d) from xxx.xxx.xxx.125:36524: output error
12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 4 on 60020 caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1389)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@36c3ec1f) from xxx.xxx.xxx.128:45630: output error
12/06/21 12:39:21 WARN ipc.HBaseServer: IPC Server handler 17 on 60020 caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1389)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
</snip>

The region servers on xxx.xxx.xxx.123 (this region server), xxx.xxx.xxx.128 and xxx.xxx.xxx.129 all went down. However, it is strange that there were I/O errors talking to xxx.xxx.xxx.125 and xxx.xxx.xxx.127 as well (which didn't go down).

Eventually the region server dies:

<snip>
java.io.InterruptedIOException: Aborting compaction of store D in region fpl,P.1596002_TS3600_D.1304226000,1334196560513.4106274c5a8852493fc20d2e50a7e428. because user requested stop.
at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:961)
at org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:743)
at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:808)
at org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:748)
at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)

12/06/21 12:39:24 ERROR hdfs.DFSClient: Exception closing file /hbase/.logs/xxxxxxpnb003.dataraker.net,60020,1338412070372/xxxxxxpnb003.dataraker.net%3A60020.1340272250127 : java.io.IOException: Error Recovery for block blk_-8250761279849076686_833850 failed because recovery from primary datanode 10.128.204.129:50010 failed 6 times. Pipeline was xxx.xxx.xxx.xxx:50010. Aborting...
java.io.IOException: Error Recovery for block blk_-8250761279849076686_833850 failed because recovery from primary datanode xxx.xxx.xxx.xxx:50010 failed 6 times. Pipeline was xxx.xxx.xxx.129:50010. Aborting...
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2833)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
12/06/21 12:39:24 INFO regionserver.ShutdownHook: Shutdown hook finished.
</snip>

The other two nodes that died also have similar logs.

I've uploaded the log to: http://www.linuxlefty.com/hbase-hbase-regionserver-xxxxxpnb003.dataraker.net.out

--

Sincerely,

    ~Peter

P
Peter Naudus
DataRaker

The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only.




--

Sincerely,

    ~Peter

P
Peter Naudus
DataRaker
Cell917.689.8451
Work: 703.639.4010

The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only.

Reply via email to