On Tue, Dec 6, 2011 at 5:10 PM, Derek Wollenstein <de...@klout.com> wrote: > -XX:NewSize=192m -XX:MaxNewSize=192m
What if you tried a bigger newsize? Maybe the YGC's would run longer but they'd happen less often? Its only YGC pauses that are giving you issue? > We've also set hfile.block.cache.size to 0.5 ( believing that incresing the > space for cache should improve performance; I'm willing to accept that > this could be causing problems, I just haven't seen this reported) > This should be fine. If you are doing mostly reads or trying to make reads come out of cache, upping cache makes sense. > This seems to continue for about for about 16 hours Is it for sure duration and not a change in the loading? If you did a rolling restart at this time -- it'd mess up latencies' for a short time -- and the server is brought back on line w/ same regions (see the graceful_stop.sh script in the http://hbase.apache.org/book.html), GC'ing has same profile? > But as we hit some critical time, we end up stuck running multiple young > GC/second. Note that they at appear to be successful (if I'm reading the > logs correctly) > 011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew: > 166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K), > 0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs] > 2011-12-06T23:49:42.522+0000: 49410.161: > [CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04 > sys=0.02, real=0.58 secs] > 2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K (176960 > K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs > processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1 > 9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17 > sys=0.01, real=0.03 secs] > 2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start] > 2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew: > 165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K), > 0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] > 2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew: > 165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K), > 0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] > 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep: 0.542/0.625 > secs] [Times: user=1.73 sys=0.02, real=0.62 secs] > 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start] > 2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset: 0.051/0.051 > secs] [Times: user=0.10 sys=0.00, real=0.05 secs] > 2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew: > 163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K), > 0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs] > 2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew: > 164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K), > 0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs] > 2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew: > 165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K), > 0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs] > 2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew: > 165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K), > 0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] > 2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew: > 167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K), > 0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] > ... Above is hard to read all wrapped in mail. Put up more of the log in a pastebin? > Note that we are doing about 2 YGC/second (about 80ms). I believe that this > situation represents us using up some undesirable amount of memory > resources, but I can't find any particular reason why we're going from > using a reasonable amount of GC to using GC constantly. One theory is > that this is the result of something in the young generation being > promoted? But I'm not really sure what would be causing > this pathological behavior. Any ideas are alternative places to look would > be greatly appreciated. > All latencies go up or just 95/99% percentile? St.Ack