Re: Cluster Wide Pauses

Ted Dunning Tue, 11 Jan 2011 15:32:11 -0800

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)



On Tue, Jan 11, 2011 at 2:57 PM, Wayne <wav...@gmail.com> wrote:

> What is shared across all nodes that could stop everything? Originally I
> suspected the node with the .META. table and GC pauses but could never find
> the smoking gun. What JVM are you running? We use u23 and I am starting to
> consider rolling back to u16 per the previous post.
>
> On Tue, Jan 11, 2011 at 5:51 PM, Ted Dunning <tdunn...@maprtech.com>
> wrote:
>
> > I am running YCSB to load a modest amount of data (2 M rows) and see
> > something similar.
> >
> > For example:
> >
> > 10 sec: 135327 operations; 13505.69 current ops/sec; [INSERT
> > AverageLatency(ms)=0.84]
> >  20 sec: 204440 operations; 6883.08 current ops/sec; [INSERT
> > AverageLatency(ms)=3.51]
> >  30 sec: 250166 operations; 4572.14 current ops/sec; [INSERT
> > AverageLatency(ms)=0.25]
> >  40 sec: 348603 operations; 9843.7 current ops/sec; [INSERT
> > AverageLatency(ms)=3.02]
> > * 50 sec: 348603 operations; 0 current ops/sec;  *
> >  60 sec: 447040 operations; 9821.11 current ops/sec; [INSERT
> > AverageLatency(ms)=3.59]
> >  70 sec: 504461 operations; 5741.53 current ops/sec; [INSERT
> > AverageLatency(ms)=1.88]
> >  80 sec: 598797 operations; 9433.6 current ops/sec; [INSERT
> > AverageLatency(ms)=2.53]
> >  90 sec: 676724 operations; 7791.92 current ops/sec; [INSERT
> > AverageLatency(ms)=1.49]
> >  100 sec: 758754 operations; 8202.18 current ops/sec; [INSERT
> > AverageLatency(ms)=2.71]
> >  110 sec: 779263 operations; 2050.9 current ops/sec; [INSERT
> > AverageLatency(ms)=2.26]
> >  120 sec: 890004 operations; 11060.83 current ops/sec; [INSERT
> > AverageLatency(ms)=2.48]
> >  130 sec: 996641 operations; 10662.63 current ops/sec; [INSERT
> > AverageLatency(ms)=1.56]
> > * 140 sec: 996641 operations; 0 current ops/sec;  *
> >  150 sec: 1111600 operations; 11494.75 current ops/sec; [INSERT
> > AverageLatency(ms)=3.81]
> >  160 sec: 1175494 operations; 6371.56 current ops/sec; [INSERT
> > AverageLatency(ms)=3.08]
> >  170 sec: 1209919 operations; 3442.16 current ops/sec; [INSERT
> > AverageLatency(ms)=0.04]
> > * 180 sec: 1209919 operations; 0 current ops/sec;  *
> >  190 sec: 1265127 operations; 5515.84 current ops/sec; [INSERT
> > AverageLatency(ms)=10.25]
> >  200 sec: 1382184 operations; 11705.7 current ops/sec; [INSERT
> > AverageLatency(ms)=0.43]
> >  210 sec: 1431400 operations; 4921.11 current ops/sec; [INSERT
> > AverageLatency(ms)=2.39]
> > * 220 sec: 1431400 operations; 0 current ops/sec;  *
> >  230 sec: 1490315 operations; 5869.2 current ops/sec; [INSERT
> > AverageLatency(ms)=9.58]
> >  240 sec: 1513430 operations; 2311.27 current ops/sec; [INSERT
> > AverageLatency(ms)=0.29]
> >  250 sec: 1599562 operations; 8613.2 current ops/sec; [INSERT
> > AverageLatency(ms)=3.86]
> >  260 sec: 1725739 operations; 12616.44 current ops/sec; [INSERT
> > AverageLatency(ms)=1.35]
> >  270 sec: 1823203 operations; 9746.4 current ops/sec; [INSERT
> > AverageLatency(ms)=1.39]
> >  280 sec: 1911486 operations; 8827.42 current ops/sec; [INSERT
> > AverageLatency(ms)=1.25]
> >  290 sec: 2000000 operations; 8850.51 current ops/sec; [INSERT
> > AverageLatency(ms)=1.41]
> >
> >
> > This is with 0.89 running on 5 region servers on a non-hdfs
> append-enabled
> > substrate.
> >
> > I am logging GC's on both master and regions and I don't see a
> correlation
> > between either.  I am still verifying that, so it could be wrong, but I
> am
> > pretty sure not.  What I do see is lots of small compactions that look
> > roughly like this:
> >
> > {Heap before GC invocations=181 (full 23):
> >  par new generation   total 19136K, used 17895K [0x00007f9aa40a0000,
> > 0x00007f9aa5560000, 0x00007f9aabda0000)
> >  eden space 17024K,  99% used [0x00007f9aa40a0000, 0x00007f9aa513b058,
> > 0x00007f9aa5140000)
> >  from space 2112K,  42% used [0x00007f9aa5350000, 0x00007f9aa542ef38,
> > 0x00007f9aa5560000)
> >  to   space 2112K,   0% used [0x00007f9aa5140000, 0x00007f9aa5140000,
> > 0x00007f9aa5350000)
> >  concurrent mark-sweep generation total 252700K, used 83769K
> > [0x00007f9aabda0000, 0x00007f9abb467000, 0x00007f9ae28a0000)
> >  concurrent-mark-sweep perm gen total 29120K, used 17377K
> > [0x00007f9ae28a0000, 0x00007f9ae4510000, 0x00007f9ae7ca0000)
> > 542.472: [GC 542.472: [ParNew
> > Desired survivor size 1081344 bytes, new threshold 4 (max 4)
> > - age   1:     274408 bytes,     274408 total
> > - age   2:      47352 bytes,     321760 total
> > - age   3:      10304 bytes,     332064 total
> > - age   4:      10472 bytes,     342536 total
> > : 17895K->510K(19136K), 0.0586100 secs] 101665K->84784K(271836K)
> icms_dc=1
> > ,
> > 0.0586740 secs] [Times: user=0.36 sys=0.00, real=0.06 secs]
> > Heap after GC invocations=182 (full 23):
> >  par new generation   total 19136K, used 510K [0x00007f9aa40a0000,
> > 0x00007f9aa5560000, 0x00007f9aabda0000)
> >  eden space 17024K,   0% used [0x00007f9aa40a0000, 0x00007f9aa40a0000,
> > 0x00007f9aa5140000)
> >  from space 2112K,  24% used [0x00007f9aa5140000, 0x00007f9aa51bfa98,
> > 0x00007f9aa5350000)
> >  to   space 2112K,   0% used [0x00007f9aa5350000, 0x00007f9aa5350000,
> > 0x00007f9aa5560000)
> >  concurrent mark-sweep generation total 252700K, used 84274K
> > [0x00007f9aabda0000, 0x00007f9abb467000, 0x00007f9ae28a0000)
> >  concurrent-mark-sweep perm gen total 29120K, used 17377K
> > [0x00007f9ae28a0000, 0x00007f9ae4510000, 0x00007f9ae7ca0000)
> > }
> > 542.730: [CMS-concurrent-mark: 0.060/2.344 secs] [Times: user=1.48
> > sys=0.05,
> > real=2.35 secs]
> > 542.730: [CMS-concurrent-preclean-start]
> > 542.744: [CMS-concurrent-preclean: 0.014/0.014 secs] [Times: user=0.01
> > sys=0.00, real=0.01 secs]
> > 542.744: [CMS-concurrent-abortable-preclean-start]
> >
> >
> > It is unusual to see things last 4 generations.  There are clearly full
> > collections going on, but not a lot of them and the heap is small enough
> on
> > this system that it should be fast.
> >
> > I will be gather more information.
> >
> > On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav...@gmail.com> wrote:
> >
> > >  We have very frequent cluster wide pauses that stop all reads and
> writes
> > > for seconds. We are constantly loading data to this cluster of 10
> nodes.
> > > These pauses can happen as frequently as every minute but sometimes are
> > not
> > > seen for 15+ minutes. Basically watching the Region server list with
> > > request
> > > counts is the only evidence of what is going on. All reads and writes
> > > totally stop and if there is ever any activity it is on the node
> hosting
> > > the
> > > .META. table with a request count of region count + 1. This problem
> seems
> > > to
> > > be worse with a larger region size. We tried a 1GB region size and saw
> > this
> > > more than we saw actual activity (and stopped using a larger region
> size
> > > because of it). We went back to the default region size and it was
> > better,
> > > but we had too many regions so now we are up to 512M for a region size
> > and
> > > we are seeing it more again.
> > >
> > > Does anyone know what this is? We have dug into all of the logs to find
> > > some
> > > sort of pause but are not able to find anything. Is this an wal hlog
> > roll?
> > > Is this a region split or compaction? Of course our biggest fear is a
> GC
> > > pause on the master but we do not have java logging turned on with the
> > > master to tell. What could possibly stop the entire cluster from
> working
> > > for
> > > seconds at a time very frequently?
> > >
> > > Thanks in advance for any ideas of what could be causing this.
> > >
> >
>

Re: Cluster Wide Pauses

Reply via email to