A split should not cause all nodes to halt. But if you start a job that has good distribution on a otherwise clean cluster, it can appear that way, because they happen roughly at the same time. I have seen that some times. Friso
On 12 jan 2011, at 13:50, Wayne wrote: > Do you see pauses across all nodes at the same time? Does a split pause *ALL > * nodes and if so why? > > On Wed, Jan 12, 2011 at 4:35 AM, Friso van Vollenhoven < > fvanvollenho...@xebia.com> wrote: > >> We have been on u23 for some time and are doing fine with HBase 0.89 and >> CDH3b2. Also with write heavy loads. We see RS activity drop around memstore >> flushes, compactions and especially splits. >> >> Friso >> >> >> >> On 11 jan 2011, at 23:57, Wayne wrote: >> >>> What is shared across all nodes that could stop everything? Originally I >>> suspected the node with the .META. table and GC pauses but could never >> find >>> the smoking gun. What JVM are you running? We use u23 and I am starting >> to >>> consider rolling back to u16 per the previous post. >>> >>> On Tue, Jan 11, 2011 at 5:51 PM, Ted Dunning <tdunn...@maprtech.com> >> wrote: >>> >>>> I am running YCSB to load a modest amount of data (2 M rows) and see >>>> something similar. >>>> >>>> For example: >>>> >>>> 10 sec: 135327 operations; 13505.69 current ops/sec; [INSERT >>>> AverageLatency(ms)=0.84] >>>> 20 sec: 204440 operations; 6883.08 current ops/sec; [INSERT >>>> AverageLatency(ms)=3.51] >>>> 30 sec: 250166 operations; 4572.14 current ops/sec; [INSERT >>>> AverageLatency(ms)=0.25] >>>> 40 sec: 348603 operations; 9843.7 current ops/sec; [INSERT >>>> AverageLatency(ms)=3.02] >>>> * 50 sec: 348603 operations; 0 current ops/sec; * >>>> 60 sec: 447040 operations; 9821.11 current ops/sec; [INSERT >>>> AverageLatency(ms)=3.59] >>>> 70 sec: 504461 operations; 5741.53 current ops/sec; [INSERT >>>> AverageLatency(ms)=1.88] >>>> 80 sec: 598797 operations; 9433.6 current ops/sec; [INSERT >>>> AverageLatency(ms)=2.53] >>>> 90 sec: 676724 operations; 7791.92 current ops/sec; [INSERT >>>> AverageLatency(ms)=1.49] >>>> 100 sec: 758754 operations; 8202.18 current ops/sec; [INSERT >>>> AverageLatency(ms)=2.71] >>>> 110 sec: 779263 operations; 2050.9 current ops/sec; [INSERT >>>> AverageLatency(ms)=2.26] >>>> 120 sec: 890004 operations; 11060.83 current ops/sec; [INSERT >>>> AverageLatency(ms)=2.48] >>>> 130 sec: 996641 operations; 10662.63 current ops/sec; [INSERT >>>> AverageLatency(ms)=1.56] >>>> * 140 sec: 996641 operations; 0 current ops/sec; * >>>> 150 sec: 1111600 operations; 11494.75 current ops/sec; [INSERT >>>> AverageLatency(ms)=3.81] >>>> 160 sec: 1175494 operations; 6371.56 current ops/sec; [INSERT >>>> AverageLatency(ms)=3.08] >>>> 170 sec: 1209919 operations; 3442.16 current ops/sec; [INSERT >>>> AverageLatency(ms)=0.04] >>>> * 180 sec: 1209919 operations; 0 current ops/sec; * >>>> 190 sec: 1265127 operations; 5515.84 current ops/sec; [INSERT >>>> AverageLatency(ms)=10.25] >>>> 200 sec: 1382184 operations; 11705.7 current ops/sec; [INSERT >>>> AverageLatency(ms)=0.43] >>>> 210 sec: 1431400 operations; 4921.11 current ops/sec; [INSERT >>>> AverageLatency(ms)=2.39] >>>> * 220 sec: 1431400 operations; 0 current ops/sec; * >>>> 230 sec: 1490315 operations; 5869.2 current ops/sec; [INSERT >>>> AverageLatency(ms)=9.58] >>>> 240 sec: 1513430 operations; 2311.27 current ops/sec; [INSERT >>>> AverageLatency(ms)=0.29] >>>> 250 sec: 1599562 operations; 8613.2 current ops/sec; [INSERT >>>> AverageLatency(ms)=3.86] >>>> 260 sec: 1725739 operations; 12616.44 current ops/sec; [INSERT >>>> AverageLatency(ms)=1.35] >>>> 270 sec: 1823203 operations; 9746.4 current ops/sec; [INSERT >>>> AverageLatency(ms)=1.39] >>>> 280 sec: 1911486 operations; 8827.42 current ops/sec; [INSERT >>>> AverageLatency(ms)=1.25] >>>> 290 sec: 2000000 operations; 8850.51 current ops/sec; [INSERT >>>> AverageLatency(ms)=1.41] >>>> >>>> >>>> This is with 0.89 running on 5 region servers on a non-hdfs >> append-enabled >>>> substrate. >>>> >>>> I am logging GC's on both master and regions and I don't see a >> correlation >>>> between either. I am still verifying that, so it could be wrong, but I >> am >>>> pretty sure not. What I do see is lots of small compactions that look >>>> roughly like this: >>>> >>>> {Heap before GC invocations=181 (full 23): >>>> par new generation total 19136K, used 17895K [0x00007f9aa40a0000, >>>> 0x00007f9aa5560000, 0x00007f9aabda0000) >>>> eden space 17024K, 99% used [0x00007f9aa40a0000, 0x00007f9aa513b058, >>>> 0x00007f9aa5140000) >>>> from space 2112K, 42% used [0x00007f9aa5350000, 0x00007f9aa542ef38, >>>> 0x00007f9aa5560000) >>>> to space 2112K, 0% used [0x00007f9aa5140000, 0x00007f9aa5140000, >>>> 0x00007f9aa5350000) >>>> concurrent mark-sweep generation total 252700K, used 83769K >>>> [0x00007f9aabda0000, 0x00007f9abb467000, 0x00007f9ae28a0000) >>>> concurrent-mark-sweep perm gen total 29120K, used 17377K >>>> [0x00007f9ae28a0000, 0x00007f9ae4510000, 0x00007f9ae7ca0000) >>>> 542.472: [GC 542.472: [ParNew >>>> Desired survivor size 1081344 bytes, new threshold 4 (max 4) >>>> - age 1: 274408 bytes, 274408 total >>>> - age 2: 47352 bytes, 321760 total >>>> - age 3: 10304 bytes, 332064 total >>>> - age 4: 10472 bytes, 342536 total >>>> : 17895K->510K(19136K), 0.0586100 secs] 101665K->84784K(271836K) >> icms_dc=1 >>>> , >>>> 0.0586740 secs] [Times: user=0.36 sys=0.00, real=0.06 secs] >>>> Heap after GC invocations=182 (full 23): >>>> par new generation total 19136K, used 510K [0x00007f9aa40a0000, >>>> 0x00007f9aa5560000, 0x00007f9aabda0000) >>>> eden space 17024K, 0% used [0x00007f9aa40a0000, 0x00007f9aa40a0000, >>>> 0x00007f9aa5140000) >>>> from space 2112K, 24% used [0x00007f9aa5140000, 0x00007f9aa51bfa98, >>>> 0x00007f9aa5350000) >>>> to space 2112K, 0% used [0x00007f9aa5350000, 0x00007f9aa5350000, >>>> 0x00007f9aa5560000) >>>> concurrent mark-sweep generation total 252700K, used 84274K >>>> [0x00007f9aabda0000, 0x00007f9abb467000, 0x00007f9ae28a0000) >>>> concurrent-mark-sweep perm gen total 29120K, used 17377K >>>> [0x00007f9ae28a0000, 0x00007f9ae4510000, 0x00007f9ae7ca0000) >>>> } >>>> 542.730: [CMS-concurrent-mark: 0.060/2.344 secs] [Times: user=1.48 >>>> sys=0.05, >>>> real=2.35 secs] >>>> 542.730: [CMS-concurrent-preclean-start] >>>> 542.744: [CMS-concurrent-preclean: 0.014/0.014 secs] [Times: user=0.01 >>>> sys=0.00, real=0.01 secs] >>>> 542.744: [CMS-concurrent-abortable-preclean-start] >>>> >>>> >>>> It is unusual to see things last 4 generations. There are clearly full >>>> collections going on, but not a lot of them and the heap is small enough >> on >>>> this system that it should be fast. >>>> >>>> I will be gather more information. >>>> >>>> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav...@gmail.com> wrote: >>>> >>>>> We have very frequent cluster wide pauses that stop all reads and >> writes >>>>> for seconds. We are constantly loading data to this cluster of 10 >> nodes. >>>>> These pauses can happen as frequently as every minute but sometimes are >>>> not >>>>> seen for 15+ minutes. Basically watching the Region server list with >>>>> request >>>>> counts is the only evidence of what is going on. All reads and writes >>>>> totally stop and if there is ever any activity it is on the node >> hosting >>>>> the >>>>> .META. table with a request count of region count + 1. This problem >> seems >>>>> to >>>>> be worse with a larger region size. We tried a 1GB region size and saw >>>> this >>>>> more than we saw actual activity (and stopped using a larger region >> size >>>>> because of it). We went back to the default region size and it was >>>> better, >>>>> but we had too many regions so now we are up to 512M for a region size >>>> and >>>>> we are seeing it more again. >>>>> >>>>> Does anyone know what this is? We have dug into all of the logs to find >>>>> some >>>>> sort of pause but are not able to find anything. Is this an wal hlog >>>> roll? >>>>> Is this a region split or compaction? Of course our biggest fear is a >> GC >>>>> pause on the master but we do not have java logging turned on with the >>>>> master to tell. What could possibly stop the entire cluster from >> working >>>>> for >>>>> seconds at a time very frequently? >>>>> >>>>> Thanks in advance for any ideas of what could be causing this. >>>>> >>>> >> >>