[ 
https://issues.apache.org/jira/browse/HBASE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929831#action_12929831
 ] 

Jean-Daniel Cryans commented on HBASE-3206:
-------------------------------------------

Things we could also do when under pressure:

 - jettison the block cache
 - switch new incoming calls that are requesting to cache the blocks to 
non-caching
 - force flush big memstores
 - close some handlers (I wonder if it's even possible)
 - force lower the scanner caching of incoming next() invocations
 - kill requests that are in flight (evil!), preferably the ones that are the 
longest running or taking the most resources

I guess it's going to take a lot of careful crafting since doing any of these 
could make things even worse.

> Detect slow GC loops of death
> -----------------------------
>
>                 Key: HBASE-3206
>                 URL: https://issues.apache.org/jira/browse/HBASE-3206
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>             Fix For: 0.92.0
>
>
> Something that has been bothering me for a while was to understand when a 
> region server was being slow because of frequent and small GC pauses. I 
> usually go into that RS's GC output, watch it going for a while then decide 
> if it's under some kind of memory pressure. Here's an example (grepped "Full" 
> from the GC log):
> {noformat}
> 12:03:42.460-0800: [Full GC [CMS2010-11-08T12:03:43.081-0800: 
> [CMS-concurrent-mark: 4.381/5.819 secs] [Times: user=60.51 sys=2.54, 
> real=5.82 secs] 
> 12:04:06.916-0800: [Full GC [CMS2010-11-08T12:04:07.316-0800: 
> [CMS-concurrent-mark: 4.006/5.080 secs] [Times: user=55.16 sys=2.13, 
> real=5.08 secs] 
> 12:04:32.559-0800: [Full GC [CMS2010-11-08T12:04:33.286-0800: 
> [CMS-concurrent-mark: 4.133/5.303 secs] [Times: user=53.61 sys=2.40, 
> real=5.30 secs] 
> 12:05:24.299-0800: [Full GC [CMS2010-11-08T12:05:25.397-0800: 
> [CMS-concurrent-sweep: 1.325/1.388 secs] [Times: user=4.66 sys=0.15, 
> real=1.38 secs] 
> 12:05:50.069-0800: [Full GC [CMS2010-11-08T12:05:50.240-0800: 
> [CMS-concurrent-mark: 4.831/6.346 secs] [Times: user=69.43 sys=2.76, 
> real=6.35 secs] 
> 12:06:16.146-0800: [Full GC [CMS2010-11-08T12:06:16.631-0800: 
> [CMS-concurrent-mark: 4.942/7.010 secs] [Times: user=69.25 sys=2.69, 
> real=7.01 secs] 
> 12:07:08.899-0800: [Full GC [CMS2010-11-08T12:07:10.033-0800: 
> [CMS-concurrent-sweep: 1.197/1.202 secs] [Times: user=1.96 sys=0.04, 
> real=1.20 secs] 
> 12:08:01.871-0800: [Full GC [CMS2010-11-08T12:08:01.949-0800: 
> [CMS-concurrent-mark: 4.154/5.443 secs] [Times: user=61.11 sys=2.29, 
> real=5.44 secs] 
> 12:08:53.343-0800: [Full GC [CMS2010-11-08T12:08:53.549-0800: 
> [CMS-concurrent-mark: 4.447/5.713 secs] [Times: user=65.19 sys=2.42, 
> real=5.72 secs] 
> 12:09:42.841-0800: [Full GC [CMS2010-11-08T12:09:43.664-0800: 
> [CMS-concurrent-mark: 4.025/5.053 secs] [Times: user=51.40 sys=2.02, 
> real=5.06 secs]
> {noformat}
> In this case, that RS's TT was down so it was getting all the non-local maps 
> at the end of the job at the same time... generating a >1000% CPU usage. With 
> scanner caching set to 10k, it's easy to understand that there's memory 
> pressure since we have all those objects in flight that we don't account for.
> One solution I was thinking of was to have a sleeper thread that sleeps for 1 
> sec all the time and outputs when it sees that it slept for a bit more than 1 
> sec. Then let's say the region server records that it saw a few of those 
> under x minutes and decides to somehow throttle the traffic.
> What I often saw is that if this situation is kept unnoticed, we end up GCing 
> more and more and in some cases I saw a region server going almost zombie for 
> 2 hours before finally getting it's lease expired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to