[ https://issues.apache.org/jira/browse/HBASE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929831#action_12929831 ]
Jean-Daniel Cryans commented on HBASE-3206: ------------------------------------------- Things we could also do when under pressure: - jettison the block cache - switch new incoming calls that are requesting to cache the blocks to non-caching - force flush big memstores - close some handlers (I wonder if it's even possible) - force lower the scanner caching of incoming next() invocations - kill requests that are in flight (evil!), preferably the ones that are the longest running or taking the most resources I guess it's going to take a lot of careful crafting since doing any of these could make things even worse. > Detect slow GC loops of death > ----------------------------- > > Key: HBASE-3206 > URL: https://issues.apache.org/jira/browse/HBASE-3206 > Project: HBase > Issue Type: Improvement > Reporter: Jean-Daniel Cryans > Fix For: 0.92.0 > > > Something that has been bothering me for a while was to understand when a > region server was being slow because of frequent and small GC pauses. I > usually go into that RS's GC output, watch it going for a while then decide > if it's under some kind of memory pressure. Here's an example (grepped "Full" > from the GC log): > {noformat} > 12:03:42.460-0800: [Full GC [CMS2010-11-08T12:03:43.081-0800: > [CMS-concurrent-mark: 4.381/5.819 secs] [Times: user=60.51 sys=2.54, > real=5.82 secs] > 12:04:06.916-0800: [Full GC [CMS2010-11-08T12:04:07.316-0800: > [CMS-concurrent-mark: 4.006/5.080 secs] [Times: user=55.16 sys=2.13, > real=5.08 secs] > 12:04:32.559-0800: [Full GC [CMS2010-11-08T12:04:33.286-0800: > [CMS-concurrent-mark: 4.133/5.303 secs] [Times: user=53.61 sys=2.40, > real=5.30 secs] > 12:05:24.299-0800: [Full GC [CMS2010-11-08T12:05:25.397-0800: > [CMS-concurrent-sweep: 1.325/1.388 secs] [Times: user=4.66 sys=0.15, > real=1.38 secs] > 12:05:50.069-0800: [Full GC [CMS2010-11-08T12:05:50.240-0800: > [CMS-concurrent-mark: 4.831/6.346 secs] [Times: user=69.43 sys=2.76, > real=6.35 secs] > 12:06:16.146-0800: [Full GC [CMS2010-11-08T12:06:16.631-0800: > [CMS-concurrent-mark: 4.942/7.010 secs] [Times: user=69.25 sys=2.69, > real=7.01 secs] > 12:07:08.899-0800: [Full GC [CMS2010-11-08T12:07:10.033-0800: > [CMS-concurrent-sweep: 1.197/1.202 secs] [Times: user=1.96 sys=0.04, > real=1.20 secs] > 12:08:01.871-0800: [Full GC [CMS2010-11-08T12:08:01.949-0800: > [CMS-concurrent-mark: 4.154/5.443 secs] [Times: user=61.11 sys=2.29, > real=5.44 secs] > 12:08:53.343-0800: [Full GC [CMS2010-11-08T12:08:53.549-0800: > [CMS-concurrent-mark: 4.447/5.713 secs] [Times: user=65.19 sys=2.42, > real=5.72 secs] > 12:09:42.841-0800: [Full GC [CMS2010-11-08T12:09:43.664-0800: > [CMS-concurrent-mark: 4.025/5.053 secs] [Times: user=51.40 sys=2.02, > real=5.06 secs] > {noformat} > In this case, that RS's TT was down so it was getting all the non-local maps > at the end of the job at the same time... generating a >1000% CPU usage. With > scanner caching set to 10k, it's easy to understand that there's memory > pressure since we have all those objects in flight that we don't account for. > One solution I was thinking of was to have a sleeper thread that sleeps for 1 > sec all the time and outputs when it sees that it slept for a bit more than 1 > sec. Then let's say the region server records that it saw a few of those > under x minutes and decides to somehow throttle the traffic. > What I often saw is that if this situation is kept unnoticed, we end up GCing > more and more and in some cases I saw a region server going almost zombie for > 2 hours before finally getting it's lease expired. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.