Also can any of the other call queues just fill up for ever and cause OOME as well? I don't see any code the limits the queue size based off of the amount of memory they are using so it seems like any of them (priorityCallQueue, the replicaitonQueue or the callQueue which are all in the HBaseServer) could suffer from the same problem I'm seeing in replicationQueue. Some thought may need to be put into how those queues are handled, how big we allow them to get, and when to block / drop the call rather than run out of memory. It does look like the queue does have a max number of items in it as defined by (ipc.server.max.callqueue.size).

~Jeff

On 11/1/2012 5:44 PM, Jeff Whiting wrote:
So this is some of what I'm seeing as I go through the profiles:

(a) 2GB - org.apache.hadoop.hbase.io.hfile.LruBlockCache
    This looks like it is the block cache and we aren't having any problems 
with that...

(b) 1.4GB - org.apache.hadoop.hbase.regionserver.HRegionServer -- java.util.concurrent.ConcurrentHashMap$Segment[] It looks like it belongs to the member variable "onlineRegions" which has a member variable "segments".
    I'm guessing this is the memstores that hbase is currently holding onto.

(c) 4.3GB -- java.util.concurrent.LinkedBlockingQueue -- org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server This is the one that keeps growing and not shrinking that is causing us to run out of memory. However the cause isn't immediately clear like the other 2 in MAT. These seem to be the references to the LinkedBlockingQueue (you'll need a wide monitor to read it well):
Class Name | Shallow Heap | Retained Heap
---------------------------------------------------------------------------------------------------------------------------------------------------------
java.util.concurrent.LinkedBlockingQueue @ 0x2aaab3583d30 |           80 | 
4,616,431,568
|- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3c70 REPL IPC Server handler 2 on 60020 Thread| 192 | 384,392 |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3d30 REPL IPC Server handler 1 on 60020 Thread| 192 | 384,392 |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3df0 REPL IPC Server handler 0 on 60020 Thread| 192 | 205,976 |- replicationQueue org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ 0x2aaab392dbe0 | 240 | 3,968 ---------------------------------------------------------------------------------------------------------------------------------------------------------

So it looks like it is shared between the myCallQueue and the replicationQueue. JProfiler is showing the same thing. I'm having a hard time figuring out much more.

(d) 977MB -- In other (no common root)
This just seems to be other stuff going on in the region server but I'm not really concerned about it...as I don't think it is the culprit.


Overall it looks like it has to do with replication. So this cluster is in the middle of an replication chain A -> B -> C where this cluster is B. So can we tell if it is running out of memory because it is being replicated too? Or because it is trying to replicate somewhere else.

Thanks,
~Jeff

On 10/30/2012 11:39 PM, Stack wrote:
On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <je...@qualtrics.com> wrote:
However what we are seeing is that our memory usage goes up slowly until the
region server starts sputtering due to gc collection issues and it will
eventually get timed out by zookeeper and be killed.

Hey Jeff.  You have GC logging enabled?  Might not tell you more than
you already know, that something is retaining more and more objects
over time.   You have a dumped heap?  What have you used to poke at
it?  You generally want to find the objects that have the deepest size
(Not all profilers let you do this though).  This is usually enough to
give you a clue.

Anything particular about the character of your load?  Ram asks if any
big cells in the mix?

St.Ack



At this point I feel somewhat lost as to how to debug the problem. I'm not
sure what to do next to figure out what is going on.  Any suggestions as to
what to look for or debug where the memory is being used? I can generate
heap dumps via jmap (although it effectively kills the region server) but I
don't really know what to look for to see where the memory is going. I also
have jmx setup on each region server and can connect to it that way.

Thanks,
~Jeff

--
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com



--
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com

Reply via email to