What is your distributed hardware/services configuration? Where are your
masters and slaves and what spec is maintained by each?

You have compaction set to zero but the issues happen near a major
compaction event, so are you running manual compactions during a heavy put
operation?


On Tue, Dec 3, 2013 at 6:45 PM, Bill Sanchez <bill.sanchez2...@gmail.com>wrote:

> Hello,
>
> I am seeking some advice on my hbase issue.  I am trying to configure a
> system that will eventually load and store approximately 50GB-80GB of data
> daily.  This data consists of files that are roughly 3MB-5MB each with some
> reaching 20MB and some as small as 1MB.  The load job does roughly 20,000
> puts to the same table spread across an initial set of 20 pre-split regions
> on 20 region servers.  During the first load I see some splitting (ending
> with around 50 regions) and in subsequent loads the number of regions will
> go much higher.
>
> After running similarly sized loads about 4 or 5 times I start to see the
> following behavior that I cannot explain.  The table in question has
> VERSIONS=1 and some of these test loads use the same data, but not all.
> Below is a summary of the behavior along with a few of the configuration
> settings I have tried so far.
>
> Environment:
>
> HBase 0.94.13-security with Kerberos enabled
> Zookeeper 3.4.5
> Hadoop 1.0.4
>
> Symptoms:
>
> 1.  Requests per second fall to 0 for all region servers
> 2.  Log files show socket timeout exceptions after waiting for scans of
> META
> 3.  Region servers sometimes eventually show up as dead
> 4.  Once HBase reaches a broken state some regions show up as in a
> transition state indefinitely
> 5.  All of these issues seem to happen around the time of major compaction
> events
>
> This issue seems to be sensitive to hbase.rpc.timeout which I increased
> significantly but only served to lengthen the amount of time until I see
> socket timeout exceptions.
>
> A few notes:
>
> 1.  I don't see massive GC in the gc log.
> 2.  Originally Snappy compression was enabled, but as a test I turned it
> off and it doesn't seem to make any difference in the testing.
> 3.  The WAL is disabled for the table involved in the load
> 4.  TeraSort appears to run normally in HDFS
> 5.  The HBase randomWrite and randomRead tests appear to run normally on
> this cluster (although randomWrite does not write anywhere close to
> 3MB-5MB)
> 6.  Ganglia is available in my environment
>
> Settings already altered:
>
> 1.  hbase.rpc.timeout=900000 (I realize this may be too high)
> 2.  hbase.regionserver.handler.count=100
> 3.  ipc.server.max.callqueue.size=10737418240
> 4.  hbase.regionserver.lease.period=900000
> 5.  hbase.hregion.majorcompaction=0 (I have been manually compacting
> between loads with no difference in behavior)
> 6.  hbase.hregion.memstore.flush.size=268435456
> 7.  dfs.datanode.max.xcievers=131072
> 8.  dfs.datanode.handler.count=100
> 9.  ipc.server.listen.queue.size=256
> 10.  -Xmx16384m XX:+UseConcMarkSweepGC -XX:+UseMembar -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/logs/gc.log -Xms16384m
> -XX:PrintFLSStatistics=1 -XX:+CMSParallelRemarkEnabled
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseParNewGC
> 11. I have tried other GC settings but they don't seem to have any real
> impact on GC performance in this case
>
> Any advice is appreciated.
>
> Thanks
>

Reply via email to