So I'm looking at ganglia so the numbers are somewhat approximate (this is for a server that just crashed about an 1/2 hour ago due to running out of memory):

Store files are hovering just below 1k. Over the last 24 hours it has varied by about 100 files (I'm looking at hbase.regionserver.storefiles).

Block cache count is about 24k varied by about 2k. Our block cache free goes between 0.7G and 0.4G. It looks like we have almost 3G free after restarting a region server.

The evicted block count went from 210k to 320k over a 24 hour period. Hit ratio is close to 100 (the graph isn't very detailed so I'm guess it is like 98-99%).

Block cache size stays at about 2GB.


On 10/30/2012 6:21 PM, Jeff Whiting wrote:
We have no coprossesors.  We are running replication from this cluster to 
another one.

What is the best way to see how many store files we have? Or checking on the 
block cache?


On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote:

Are you using any coprocessors? Can you see how many store files are

The no of blocks getting cached will give you an idea too..


On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <> wrote:

We have 6 region server given 10G of memory for hbase.  Each region server
has an average of about 100 regions and across the cluster we are averaging
about 100 requests / second with a pretty even read / write load.  We are
running cdh4 (0.92.1-cdh4.0.1, rUnknown)

I feel that looking over our load and our requests that the 10GB of memory
should be enough to handle the load and that we shouldn't really be pushing
the the memory limits.

However what we are seeing is that our memory usage goes up slowly until
the region server starts sputtering due to gc collection issues and it will
eventually get timed out by zookeeper and be killed.

We'll see aborts like this in the log:
2012-10-29 08:10:52,132 FATAL 
ABORTING region server,60020,**1351233245547:
Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException:
Server REPORT rejected; currently processing,60020,**1351233245547
as dead server
2012-10-29 08:10:52,250 FATAL 
RegionServer abort: loaded coprocessors are: []
2012-10-29 08:10:52,392 FATAL 
ABORTING region server,60020,**1351233245547:
0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf received
expired from ZooKeeper, aborting
2012-10-29 08:10:52,401 FATAL 
RegionServer abort: loaded coprocessors are: []

Which are "caused" by:
2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.**Sleeper: We
slept 29014ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see****zkexpired<>
2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.**Sleeper: We
slept 28121ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see****zkexpired<>
2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.**Sleeper: We
slept 31124ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see****zkexpired<>
2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.**Sleeper: We
slept 32209ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see****zkexpired<>
2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.**Sleeper: We
slept 32557ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see****zkexpired<>
2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.**Sleeper: We
slept 33741ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see****zkexpired<>

We'll also see a bunch of responseTooSlow and operationTooSlow as GC kicks
in and really kills the region server's performance.

We have the jvm metrics kicking out to ganglia and looking at
jvm.RegionServer.metrics.**memHeapUsedM you can see that it will go up
over time and eventually run out of memory.  I can also see in
hmaster:60010/master-status that the usedHeapMB just goes up and I can make
a pretty educated guess as to what server will go down next. It will take
several days to a week of continuous running (after restarting a region
server) before we have a potential problem.

Our next one to go will probably be ds6 and jmap -heap shows:
concurrent mark-sweep generation:
    capacity = 10398531584 (9916.8125MB)
    used     = 9036165000 (8617.558479309082MB)
    free     = 1362366584 (1299.254020690918MB)
    86.89847145248619% used

So we are using 86% of the 10GB heep allocated to the concurrent mark and
sweep generation.  Looking at ds6 in the web interface where has
information about the a tasks it isn't running rpc stuff it doesn't show
any compactions or any background tasks happening. Nor is there any active
rpc call that are longer than 0 seconds (it seems to be handling the
requests just fine).

At this point I feel somewhat lost as to how to debug the problem. I'm not
sure what to do next to figure out what is going on.  Any suggestions as to
what to look for or debug where the memory is being used? I can generate
heap dumps via jmap (although it effectively kills the region server) but I
don't really know what to look for to see where the memory is going. I also
have jmx setup on each region server and can connect to it that way.


Jeff Whiting
Qualtrics Senior Software Engineer

Jeff Whiting
Qualtrics Senior Software Engineer

Reply via email to