We are running an 80 node cluster:
Hdfs version: 0.20.2-cdh3u5
Hbase version: 0.90.6-cdh3u5

The issue we have is that infrequently region servers are crashing. So far it 
has been once a week, not on the same day or time.

The error we are getting in RegionServer logs is:

2014-11-26 09:11:04,460 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
serverName=hd073.xxxxxxxx,60020,1407311682582, load=(requests=0, regions=227, 
usedHeap=9293, maxHeap=12250): IOE in log roller
java.io.IOException: cannot get log writer
        at 
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:677)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:624)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:560)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:96)
Caused by: java.io.IOException: java.io.IOException: Call to %NAMENODE%:8020 
failed on local exception: java.io.IOException: Connection reset by peer
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:674)
        ... 3 more
Caused by: java.io.IOException: Call to %NAMENODE%:8020 failed on local 
exception: java.io.IOException: Connection reset by peer
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187)
        at org.apache.hadoop.ipc.Client.call(Client.java:1155)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy7.create(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy7.create(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3417)
        at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:751)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:200)
        at 
org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:653)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:444)
        at sun.reflect.GeneratedMethodAccessor364.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)
        ... 4 more
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
        at sun.nio.ch.IOUtil.read(IOUtil.java:175)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:376)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:858)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:767)
2014-11-26 09:11:04,460 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: java.io.IOException: Call to %NAMENODE%:8020 failed on local 
exception: java.io.IOException: Connection reset by peer
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187)
        at org.apache.hadoop.ipc.Client.call(Client.java:1155)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy7.addBlock(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy7.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3719)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3586)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2400(DFSClient.java:2792)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)

The servers aren't under any major load but they appear to be having issues 
communicating to the namenode. There are what appear to be corresponding errors 
in the DataNode log. Thos look like:

2014-11-26 00:02:15,423 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(10.100.2.76:50010, 
storageID=DS-562360767-10.100.2.76-50010-1358397869707, infoPort=50075, 
ipcPort=50020):Got exception while serving blk_-5442848061718769346_625833634 
to /10.100.2.76:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for 
channel to be ready for write. ch : java.nio.channels.SocketChannel[connected 
local=/10.100.2.76:50010 remote=/10.100.2.76:55462]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175)


What I am having trouble proving and then making an educated guess on resolving 
is whether this issue is an actual communication issue with the NameNode server 
due to issues with that server or the issue I have is local write issues and 
timeouts are due to local resource issues on the DataNode/RegionServer local 
server.

We are running RS, DN, and TT on each of the worker server.

Any insight or suggestions would be much appreciated.

Thanks,


Adam Wilhelm

Reply via email to