You can confirm if it's really a GC issue by taking a look at the master log; if the log splitting started before the errors in the region server log, then something happened to the region server. It's usually GC, but it could be something else.
The fact that mslab helped does point to a GC issue. Regarding configurations, you can tweak zookeeper.session.timeout to set the timeout higher than the default 180000ms (3 minutes). It might also be worth making sure that your region server and zookeeper agree on the same timeout when the region server starts, depending on how you start zookeeper it might end up being different. The line looks like this: 17:39:20,189 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0xsome_hex with negotiated timeout 180000 for client blah_address J-D On Mon, Apr 18, 2011 at 9:33 AM, Andy Sautins <[email protected]> wrote: > > J-D, thanks for the response. I suspect you are right that the error > message was a red herring. However, I don't see "Client session timed out" > in any of our logs when the region server dies so I'm not sure if it is the > situation you are thinking of. Note that we are going to enable GC logging > to get an idea of how long we pause in GC in our regionservers. We also set > hbase.hregion.memstore.mslab.enabled to true in hbase-site.xml which seemed > to have helped, although I don't have too many datapoints at this point to > confirm. Before we enabled mslab we had region servers die every day or two. > After enabling mslab we just went 4 days without a regionserver dying which > was better. > > Following down our latest Region server that died, it seems to die because > it can't close the hlog which sounds like the situation you mentioned in your > response, although I'm not sure I fully understand IO Fencing. The stack > traces are below. > > Are there any configuration parameters that could help keep the > regionserver from dying when it encounters this situation? We generally just > restart the regionserver and it recovers nicely, but it would be ideal to not > need to login and re-start regionservers due to this situation. > > At this point it sounds like my best bet is to try to understand if we are > still seeing big GC pauses even with mslab enabled to identify if that is our > problem. > > Thanks for your help > > Andy > > > > 2011-04-18 02:04:12,862 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: > Could not append. Requesting close of hlog > java.io.IOException: Error Recovery for block > blk_-4878958038737199123_30707596 failed because recovery from primary > datanode 10.18.0.28:50010 failed 6 times. Pipeline was 10.18.0.28:50010. > Aborting... > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477) > 2011-04-18 02:04:12,862 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: > Could not append. Requesting close of hlog > java.io.IOException: Error Recovery for block > blk_-4878958038737199123_30707596 failed because recovery from primary > datanode 10.18.0.28:50010 failed 6 times. Pipeline was 10.18.0.28:50010. > Aborting... > > The above exception seems to correlate to the following error on the > datanode. > > 2011-04-18 02:04:08,773 WARN > org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to > getBlockMetaDataInfo for block (=blk_-4878958038737199123_30707596) from > datanode (=10.18.0.28:50010) > java.io.IOException: Block blk_-4878958038737199123_30711904 length is > 45855569 does not match block file length 45853290 > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockMetadata(FSDataset.java:1661) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.startBlockRecovery(FSDataset.java:1921) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.startBlockRecovery(DataNode.java:1658) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1772) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1930) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Unknown Source) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409) > 2011-04-18 02:04:08,774 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 2 on 50020, call recoverBlock(blk_-4878958038737199123_30707596, false, > [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2cda6e31) from > 10.18.0.28:52651: error: java.io.IOException: All datanodes failed: > block=blk_-4878958038737199123_30707596, datanodeids=[10.18.0.28:50010] > > And the block does look like it is associated with a log file: > > 2011-04-18 01:37:17,661 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /user/hbase/.logs/hd28.dfs.returnpath.net,60020,1302812828956/hd28.dfs.returnpath.net%3A60020.1303112237535. > blk_-4878958038737199123_30707596 >
