We are running an 80 node cluster:
Hdfs version: 0.20.2-cdh3u5
Hbase version: 0.90.6-cdh3u5
The issue we have is that infrequently region servers are crashing. So far it
has been once a week, not on the same day or time.
The error we are getting in RegionServer logs is:
2014-11-26 09:11:04,460 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=hd073.xxxxxxxx,60020,1407311682582, load=(requests=0, regions=227,
usedHeap=9293, maxHeap=12250): IOE in log roller
java.io.IOException: cannot get log writer
at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:677)
at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:624)
at
org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:560)
at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:96)
Caused by: java.io.IOException: java.io.IOException: Call to %NAMENODE%:8020
failed on local exception: java.io.IOException: Connection reset by peer
at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)
at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:674)
... 3 more
Caused by: java.io.IOException: Call to %NAMENODE%:8020 failed on local
exception: java.io.IOException: Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187)
at org.apache.hadoop.ipc.Client.call(Client.java:1155)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy7.create(Unknown Source)
at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy7.create(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3417)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:751)
at
org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:200)
at
org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:653)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:444)
at sun.reflect.GeneratedMethodAccessor364.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)
... 4 more
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:376)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:858)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:767)
2014-11-26 09:11:04,460 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.io.IOException: Call to %NAMENODE%:8020 failed on local
exception: java.io.IOException: Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187)
at org.apache.hadoop.ipc.Client.call(Client.java:1155)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy7.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy7.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3719)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3586)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2400(DFSClient.java:2792)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)
The servers aren't under any major load but they appear to be having issues
communicating to the namenode. There are what appear to be corresponding errors
in the DataNode log. Thos look like:
2014-11-26 00:02:15,423 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.100.2.76:50010,
storageID=DS-562360767-10.100.2.76-50010-1358397869707, infoPort=50075,
ipcPort=50020):Got exception while serving blk_-5442848061718769346_625833634
to /10.100.2.76:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch : java.nio.channels.SocketChannel[connected
local=/10.100.2.76:50010 remote=/10.100.2.76:55462]
at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175)
What I am having trouble proving and then making an educated guess on resolving
is whether this issue is an actual communication issue with the NameNode server
due to issues with that server or the issue I have is local write issues and
timeouts are due to local resource issues on the DataNode/RegionServer local
server.
We are running RS, DN, and TT on each of the worker server.
Any insight or suggestions would be much appreciated.
Thanks,
Adam Wilhelm