Garbage collection issues

Simon Kelly Fri, 18 May 2012 04:54:58 -0700

Hi

Firstly, let me complement the Hbase team on a great piece of software.
We're running a few clusters that are working well but we're really
struggling with a new one I'm trying to setup and could use a bit of help.
I have read as much as I can but just can't seem to get it right.


The difference between this cluster the others is that this one's load is
99% writes. Each write contains about 40 columns to a single table and
column family and the total data size varies between about 1 & 2K. The load
per server varies between 20  and 90 requests per second at different times
of the day. The row keys are UUID's so are uniformly distributed across the
(currently 60) regions.

The problem seems to be that after some time a GC cycle takes longer that
expected one of the regionservers and the master kills the regionserver.

This morning I ran the system up till the first regionserver failure and
recorded the data with Ganglia. I have attached the following ganglia
graphs:

   - hbase.regionserver.compactionQueueSize
   - hbase.regionserver.memstoreSizeMB
   - requests_per_minute (to the service that calls hbase)
   - request_processing_time (of the service that calls hbase)

Any assistance would be greatly appreciated. I did have GC logging on so
have access to all that data too.

Best regards
Simon Kelly

*Cluster details*
*----------------------*
Its running on 5 machines with the following specs:

   - CPUs: 4 x 2.39 GHz
   - RAM: 8 GB
   - Ubuntu 10.04.2 LTS

The Hadoop cluster (version 1.0.1, r1243785) is running over all the
machines that has 8TB of capacity (60% unused). On top of that is Hbase
version 0.92.1, r1298924. All the servers run Hadoop datanodes and Hbase
regionservers. One server hosts the Hadoop primary namenode and the Hbase
master. 3 servers form the Zookeeper quorum.

The Hbase config is as follows:

   - HBASE_OPTS="-Xmn128m -ea -XX:+UseConcMarkSweepGC
   -XX:+CMSIncrementalMode -XX:+UseParNewGC
   -XX:CMSInitiatingOccupancyFraction=70"
   - HBASE_HEAPSIZE=4096


   - hbase.rootdir : hdfs://server1:8020/hbase
   - hbase.cluster.distributed : true
   - hbase.zookeeper.property.clientPort : 2222
   - hbase.zookeeper.quorum : server1,server2,server3
   - zookeeper.session.timeout : 30000
   - hbase.regionserver.maxlogs : 16
   - hbase.regionserver.handler.count : 50
   - hbase.regionserver.codecs : lzo
   - hbase.master.startup.retainassign : false
   - hbase.hregion.majorcompaction : 0

(for the benefit of those without the attachements I'll describe the graphs:

   - 0900 - system starts
   - 1010 - memstore reaches 1.2GB and flushes to 500MB, a few hbase
   compactions happen and a slight increase in request_processing_time
   - 1040 - memstore reaches 1.0GB and flushes to 500MB (no hbase
   compactions)
   - 1110 - memstore reaches 1.0GB and flushes to 300MB, a few more hbase
   compactions happen and a slightly larger increase in request_processing_time
   - 1200 - memstore reaches 1.3GB and flushes to 200MB, more hbase
   compactions and increase in request_processing_time
   - 1230 - hbase logs for server1 record: We slept 13318ms instead of
   3000ms and regionserver1 is killed by master, request_processing_time goes
   way up
   - 1326 - hbase logs for server3 record: We slept 77377ms instead of
   3000ms and regionserver2 is killed by master

)

Garbage collection issues

Reply via email to