On Tue, Dec 6, 2011 at 5:10 PM, Derek Wollenstein <de...@klout.com> wrote:
> -XX:NewSize=192m -XX:MaxNewSize=192m

What if you tried a bigger newsize?  Maybe the YGC's would run longer
but they'd happen less often?

Its only YGC pauses that are giving you issue?


> We've also set hfile.block.cache.size to 0.5 ( believing that incresing the
> space for cache should improve performance;  I'm willing to accept that
> this could be causing problems, I just haven't seen this reported)
>

This should be fine. If you are doing mostly reads or trying to make
reads come out of cache, upping cache makes sense.


> This seems to continue for about for about 16 hours

Is it for sure duration and not a change in the loading?

If you did a rolling restart at this time -- it'd mess up latencies'
for a short time -- and the server is brought back on line w/ same
regions (see the graceful_stop.sh script in the
http://hbase.apache.org/book.html), GC'ing has same profile?


> But as we hit some critical time, we end up stuck running multiple young
> GC/second.  Note that they at appear to be successful (if I'm reading the
> logs correctly)
> 011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
> 166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K),
> 0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:42.522+0000: 49410.161:
> [CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04
> sys=0.02, real=0.58 secs]
> 2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K (176960
> K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs
> processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
> 9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
> sys=0.01, real=0.03 secs]
> 2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start]
> 2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
> 165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K),
> 0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
> 165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K),
> 0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep: 0.542/0.625
> secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
> 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start]
> 2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset: 0.051/0.051
> secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
> 2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
> 163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K),
> 0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
> 164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K),
> 0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
> 165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K),
> 0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
> 2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
> 165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K),
> 0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
> 167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K),
> 0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> ...

Above is hard to read all wrapped in mail.  Put up more of the log in
a pastebin?


> Note that we are doing about 2 YGC/second (about 80ms). I believe that this
> situation represents us using up some undesirable amount of memory
> resources, but I can't find any particular reason why we're going from
> using a reasonable amount of GC to using GC constantly.   One theory is
> that this is the result of something in the young generation being
> promoted?  But I'm not really sure what would be causing
> this pathological behavior.  Any ideas are alternative places to look would
> be greatly appreciated.
>


All latencies go up or just 95/99% percentile?

St.Ack

Reply via email to