What is your distributed hardware/services configuration? Where are your masters and slaves and what spec is maintained by each?
You have compaction set to zero but the issues happen near a major compaction event, so are you running manual compactions during a heavy put operation? On Tue, Dec 3, 2013 at 6:45 PM, Bill Sanchez <bill.sanchez2...@gmail.com>wrote: > Hello, > > I am seeking some advice on my hbase issue. I am trying to configure a > system that will eventually load and store approximately 50GB-80GB of data > daily. This data consists of files that are roughly 3MB-5MB each with some > reaching 20MB and some as small as 1MB. The load job does roughly 20,000 > puts to the same table spread across an initial set of 20 pre-split regions > on 20 region servers. During the first load I see some splitting (ending > with around 50 regions) and in subsequent loads the number of regions will > go much higher. > > After running similarly sized loads about 4 or 5 times I start to see the > following behavior that I cannot explain. The table in question has > VERSIONS=1 and some of these test loads use the same data, but not all. > Below is a summary of the behavior along with a few of the configuration > settings I have tried so far. > > Environment: > > HBase 0.94.13-security with Kerberos enabled > Zookeeper 3.4.5 > Hadoop 1.0.4 > > Symptoms: > > 1. Requests per second fall to 0 for all region servers > 2. Log files show socket timeout exceptions after waiting for scans of > META > 3. Region servers sometimes eventually show up as dead > 4. Once HBase reaches a broken state some regions show up as in a > transition state indefinitely > 5. All of these issues seem to happen around the time of major compaction > events > > This issue seems to be sensitive to hbase.rpc.timeout which I increased > significantly but only served to lengthen the amount of time until I see > socket timeout exceptions. > > A few notes: > > 1. I don't see massive GC in the gc log. > 2. Originally Snappy compression was enabled, but as a test I turned it > off and it doesn't seem to make any difference in the testing. > 3. The WAL is disabled for the table involved in the load > 4. TeraSort appears to run normally in HDFS > 5. The HBase randomWrite and randomRead tests appear to run normally on > this cluster (although randomWrite does not write anywhere close to > 3MB-5MB) > 6. Ganglia is available in my environment > > Settings already altered: > > 1. hbase.rpc.timeout=900000 (I realize this may be too high) > 2. hbase.regionserver.handler.count=100 > 3. ipc.server.max.callqueue.size=10737418240 > 4. hbase.regionserver.lease.period=900000 > 5. hbase.hregion.majorcompaction=0 (I have been manually compacting > between loads with no difference in behavior) > 6. hbase.hregion.memstore.flush.size=268435456 > 7. dfs.datanode.max.xcievers=131072 > 8. dfs.datanode.handler.count=100 > 9. ipc.server.listen.queue.size=256 > 10. -Xmx16384m XX:+UseConcMarkSweepGC -XX:+UseMembar -verbose:gc > -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/logs/gc.log -Xms16384m > -XX:PrintFLSStatistics=1 -XX:+CMSParallelRemarkEnabled > -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseParNewGC > 11. I have tried other GC settings but they don't seem to have any real > impact on GC performance in this case > > Any advice is appreciated. > > Thanks >