You're welcome.  Glad you figured it out!  Very strange, and it
totally makes sense this is a kernel/os bug.

-ryan

On Tue, Aug 17, 2010 at 5:28 PM, George P. Stathis <[email protected]> wrote:
> OK, high load averages while idle is a confirmed Lucid issue:
> https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910
>
> Will downgrade back to 9.10 and try again. Thanks for helping out
> Ryan. If you hadn't confirmed that this is not normal, I would have
> wasted a lot more time looking in the wrong places. Now I just have to
> rebuild us a new cluster...
>
> -GS
>
> On Tue, Aug 17, 2010 at 7:39 PM, George P. Stathis <[email protected]> 
> wrote:
>> Yeap. Load average definitely goes back down to 0.10 or below once
>> hadoop/hbase is shut off. Same issue is happening on the slaves as
>> well BTW. We are running on Ubuntu 10.04 which is the only recent
>> change other than moving to a cluster setup. We were running a
>> pseudo-distributed setup before on one machine running Ubuntu 9.10.
>> Top shows just as much activity under %id but load average was close
>> to zero when idle! Environment is hypervisor'ed as well, so I suspect
>> that's not really the issue...:-) Know anyone running on an Ubuntu
>> 10.04 setup?
>>
>> -GS
>>
>> On Tue, Aug 17, 2010 at 7:08 PM, Ryan Rawson <[email protected]> wrote:
>>> This is odd, the run queue has quite a bit of processes in it, but the
>>> actual cpu data says 'nothings happening'.
>>>
>>> Can't tell which processes are in the run queue here, does the load
>>> average go down when you take hadoop/hbase down?
>>>
>>> If so, that seems to identify the locus of the problem, but doesn't
>>> really explain WHY this is so.  Consider I've never really seen this
>>> before I'd just write it off to using a hypervisor'ed environment and
>>> blame other people stealing all your cpus :-)
>>>
>>> On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <[email protected]> 
>>> wrote:
>>>> vmstat 2 for 2 mins below. Looks like everything is in idle (github
>>>> gist paste if it's easier to read: http://gist.github.com/532512):
>>>>
>>>>
>>>> procs -----------memory---------- ---swap-- -----io---- -system-- 
>>>> ----cpu----
>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id 
>>>> wa
>>>>  0  0      0 15097116 248428 1398444    0    0     0    50    5   24
>>>> 0  0 100  0
>>>>  0  0      0 15096948 248428 1398444    0    0     0     0  281  281
>>>> 0  0 100  0
>>>>  0  0      0 15096948 248428 1398444    0    0     0     0  279  260
>>>> 0  0 100  0
>>>>  0  0      0 15096948 248428 1398444    0    0     0     0  199  216
>>>> 0  0 100  0
>>>>  4  0      0 15096612 248428 1398448    0    0     0     0  528  467
>>>> 0  0 100  0
>>>>  0  0      0 15096612 248428 1398448    0    0     0     0  208  213
>>>> 0  0 100  0
>>>>  4  0      0 15096460 248428 1398448    0    0     0     0  251  261
>>>> 0  0 100  0
>>>>  0  0      0 15096460 248428 1398448    0    0     0    12  242  248
>>>> 0  0 100  0
>>>>  0  0      0 15096460 248428 1398448    0    0     0    34  228  230
>>>> 0  0 100  0
>>>>  0  0      0 15096476 248428 1398448    0    0     0     0  266  272
>>>> 0  0 100  0
>>>>  0  0      0 15096324 248428 1398448    0    0     0    10  179  206
>>>> 0  0 100  0
>>>>  1  0      0 15096340 248428 1398448    0    0     0     0  225  254
>>>> 0  0 100  0
>>>>  1  0      0 15096188 248428 1398448    0    0     0     0  263  245
>>>> 0  0 100  0
>>>>  0  0      0 15096188 248428 1398448    0    0     0     2  169  210
>>>> 0  0 100  0
>>>>  0  0      0 15096188 248428 1398448    0    0     0     0  201  238
>>>> 0  0 100  0
>>>>  0  0      0 15096036 248428 1398448    0    0     0    10  174  202
>>>> 0  0 100  0
>>>>  0  0      0 15096036 248428 1398448    0    0     0     6  207  222
>>>> 0  0 100  0
>>>>  0  0      0 15095884 248428 1398448    0    0     0     0  198  242
>>>> 0  0 100  0
>>>>  2  0      0 15095884 248428 1398448    0    0     0     0  177  215
>>>> 0  0 100  0
>>>>  0  0      0 15095884 248428 1398448    0    0     0     0  244  265
>>>> 0  0 100  0
>>>>  0  0      0 15095732 248428 1398448    0    0     0     4  197  222
>>>> 0  0 100  0
>>>>  6  0      0 15095732 248428 1398448    0    0     0     6  267  260
>>>> 0  0 100  0
>>>>  0  0      0 15095732 248428 1398448    0    0     0     0  240  239
>>>> 0  0 100  0
>>>>  0  0      0 15095580 248428 1398448    0    0     0     8  180  210
>>>> 0  0 100  0
>>>>  5  0      0 15095580 248428 1398448    0    0     0     0  193  224
>>>> 0  0 100  0
>>>>  1  0      0 15095580 248428 1398448    0    0     0    36  161  191
>>>> 0  0 100  0
>>>>  0  0      0 15095428 248428 1398448    0    0     0     0  176  216
>>>> 0  0 100  0
>>>>  4  0      0 15095428 248428 1398448    0    0     0     0  202  236
>>>> 0  0 100  0
>>>>  0  0      0 15095428 248428 1398448    0    0     0     6  191  220
>>>> 0  0 100  0
>>>>  1  0      0 15095428 248428 1398448    0    0     0     0  188  238
>>>> 0  0 100  0
>>>>  2  0      0 15095276 248428 1398448    0    0     0     0  174  206
>>>> 0  0 100  0
>>>>  1  0      0 15095276 248428 1398448    0    0     0     0  225  249
>>>> 0  0 100  0
>>>>  0  0      0 15095124 248428 1398448    0    0     0     0  222  263
>>>> 0  0 100  0
>>>>  1  0      0 15095124 248428 1398448    0    0     0     6  187  236
>>>> 0  0 100  0
>>>>  5  0      0 15094940 248428 1398448    0    0     0     0  453  434
>>>> 0  0 100  0
>>>>  4  0      0 15094788 248428 1398448    0    0     0     2  227  225
>>>> 0  0 100  0
>>>>  0  0      0 15094788 248428 1398448    0    0     0     0  213  236
>>>> 0  0 100  0
>>>>  4  0      0 15094788 248428 1398448    0    0     0     6  257  253
>>>> 0  0 100  0
>>>>  0  0      0 15094636 248428 1398448    0    0     0     0  215  230
>>>> 0  0 100  0
>>>>  0  0      0 15094652 248428 1398448    0    0     0     0  259  285
>>>> 0  0 100  0
>>>>  0  0      0 15094500 248428 1398448    0    0     0    14  194  219
>>>> 0  0 100  0
>>>>  0  0      0 15094516 248428 1398448    0    0     0     0  227  257
>>>> 0  0 100  0
>>>>  4  0      0 15094516 248428 1398448    0    0     0    36  266  263
>>>> 0  0 100  0
>>>>  1  0      0 15094516 248428 1398448    0    0     0     0  202  213
>>>> 0  0 100  0
>>>>  0  0      0 15094364 248428 1398448    0    0     0     0  204  240
>>>> 0  0 100  0
>>>>  0  0      0 15094212 248428 1398448    0    0     0     6  161  194
>>>> 0  0 100  0
>>>>  0  0      0 15094212 248428 1398448    0    0     0     0  191  215
>>>> 0  0 100  0
>>>>  1  0      0 15094212 248428 1398448    0    0     0     0  216  238
>>>> 0  0 100  0
>>>>  5  0      0 15094212 248428 1398448    0    0     0     6  169  202
>>>> 0  0 100  0
>>>>  0  0      0 15094060 248428 1398448    0    0     0     0  172  216
>>>> 0  0 100  0
>>>>  2  0      0 15094060 248428 1398448    0    0     0     6  201  196
>>>> 0  0 100  0
>>>>  1  0      0 15094060 248428 1398448    0    0     0     0  196  218
>>>> 0  0 100  0
>>>> procs -----------memory---------- ---swap-- -----io---- -system-- 
>>>> ----cpu----
>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id 
>>>> wa
>>>>  0  0      0 15093908 248428 1398448    0    0     0     0  206  236
>>>> 0  0 100  0
>>>>  4  0      0 15093908 248428 1398448    0    0     0     0  197  219
>>>> 0  0 100  0
>>>>  0  0      0 15093908 248428 1398448    0    0     0     0  186  227
>>>> 0  0 100  0
>>>>  0  0      0 15093756 248428 1398448    0    0     0     0  168  182
>>>> 0  0 100  0
>>>>  0  0      0 15093756 248428 1398448    0    0     0     0  206  239
>>>> 0  0 100  0
>>>>  0  0      0 15093604 248428 1398448    0    0     0     6  281  248
>>>> 0  0 100  0
>>>>  0  0      0 15093452 248428 1398448    0    0     0     0  185  198
>>>> 0  0 100  0
>>>>  5  0      0 15093452 248428 1398448    0    0     0     0  265  253
>>>> 0  0 100  0
>>>>  0  0      0 15093300 248428 1398448    0    0     0    36  194  211
>>>> 0  0 100  0
>>>>  0  0      0 15093300 248428 1398448    0    0     0     0  228  242
>>>> 0  0 100  0
>>>>  0  0      0 15093300 248428 1398448    0    0     0     0  290  262
>>>> 0  0 100  0
>>>>  0  0      0 15093300 248428 1398448    0    0     0     6  187  207
>>>> 0  0 100  0
>>>>
>>>>
>>>> On Tue, Aug 17, 2010 at 6:54 PM, Ryan Rawson <[email protected]> wrote:
>>>>> what does vmstat say?  Run it like 'vmstat 2' for a minute or two and
>>>>> paste the results.
>>>>>
>>>>> With no cpu being consumed by java, it seems like there must be
>>>>> another hidden variable here.  Some zombied process perhaps. Or some
>>>>> kind of super IO wait or something else.
>>>>>
>>>>> Since you are running on a hypervisor environment, i cant really say
>>>>> what is happening to your instance, although one would think the LA
>>>>> numbers would be unaffected by outside processes.
>>>>>
>>>>> On Tue, Aug 17, 2010 at 3:49 PM, George P. Stathis <[email protected]> 
>>>>> wrote:
>>>>>> Actually, there is nothing in %wa but a ton sitting in %id. This is
>>>>>> from the Master:
>>>>>>
>>>>>> top - 18:30:24 up 5 days, 20:10,  1 user,  load average: 2.55, 1.99, 1.25
>>>>>> Tasks:  89 total,   1 running,  88 sleeping,   0 stopped,   0 zombie
>>>>>> Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  
>>>>>> 0.2%st
>>>>>> Mem:  17920228k total,  2795464k used, 15124764k free,   248428k buffers
>>>>>> Swap:        0k total,        0k used,        0k free,  1398388k cached
>>>>>>
>>>>>> I have atop installed which is reporting the hadoop/hbase java daemons
>>>>>> as the most active processes (barely taking any CPU time though and
>>>>>> most of the time in sleep mode):
>>>>>>
>>>>>> ATOP - domU-12-31-39-18-1 2010/08/17  18:31:46               10 seconds 
>>>>>> elapsed
>>>>>> PRC | sys   0.01s | user   0.00s | #proc     89 | #zombie    0 | #exit   
>>>>>>    0 |
>>>>>> CPU | sys      0% | user      0% | irq       0% | idle    200% | wait    
>>>>>>   0% |
>>>>>> cpu | sys      0% | user      0% | irq       0% | idle    100% | cpu000 
>>>>>> w  0% |
>>>>>> CPL | avg1   2.55 | avg5    2.12 | avg15   1.35 | csw     2397 | intr    
>>>>>> 2034 |
>>>>>> MEM | tot   17.1G | free   14.4G | cache   1.3G | buff  242.6M | slab  
>>>>>> 193.1M |
>>>>>> SWP | tot    0.0M | free    0.0M |              | vmcom   1.6G | vmlim   
>>>>>> 8.5G |
>>>>>> NET | transport   | tcpi     330 | tcpo     169 | udpi     566 | udpo    
>>>>>>  147 |
>>>>>> NET | network     | ipi      896 | ipo      316 | ipfrw      0 | deliv   
>>>>>>  896 |
>>>>>> NET | eth0   ---- | pcki     777 | pcko     197 | si  248 Kbps | so   70 
>>>>>> Kbps |
>>>>>> NET | lo     ---- | pcki     119 | pcko     119 | si    9 Kbps | so    9 
>>>>>> Kbps |
>>>>>>
>>>>>>  PID  CPU COMMAND-LINE                                                  
>>>>>> 1/1
>>>>>> 17613   0% atop
>>>>>> 17150   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m 
>>>>>> -XX:+HeapDumpOnOutOfMemor
>>>>>> 16527   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server 
>>>>>> -Dcom.sun.managem
>>>>>> 16839   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server 
>>>>>> -Dcom.sun.managem
>>>>>> 16735   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server 
>>>>>> -Dcom.sun.managem
>>>>>> 17083   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m 
>>>>>> -XX:+HeapDumpOnOutOfMemor
>>>>>>
>>>>>> Same with atop:
>>>>>>
>>>>>>  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>>>>> 16527 ubuntu    20   0 2352M   98M 10336 S  0.0  0.6  0:42.05
>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
>>>>>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
>>>>>> -Dhadoop.log.dir=/var/log/h
>>>>>> 16735 ubuntu    20   0 2403M 81544 10236 S  0.0  0.5  0:01.56
>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
>>>>>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
>>>>>> -Dhadoop.log.dir=/var/log/h
>>>>>> 17083 ubuntu    20   0 4557M 45388 10912 S  0.0  0.3  0:00.65
>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m
>>>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
>>>>>> -XX:+CMSIncrementalMode -server -XX:+Heap
>>>>>>    1 root      20   0 23684  1880  1272 S  0.0  0.0  0:00.23 /sbin/init
>>>>>>  587 root      20   0  247M  4088  2432 S  0.0  0.0 -596523h-14:-8
>>>>>> /usr/sbin/console-kit-daemon --no-daemon
>>>>>>  3336 root      20   0 49256  1092   540 S  0.0  0.0  0:00.36 
>>>>>> /usr/sbin/sshd
>>>>>> 16430 nobody    20   0 34408  3704  1060 S  0.0  0.0  0:00.01 gmond
>>>>>> 17150 ubuntu    20   0 2519M  112M 11312 S  0.0  0.6 -596523h-14:-8
>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m
>>>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
>>>>>> -XX:+CMSIncrementalMode -server -XX
>>>>>>
>>>>>>
>>>>>> So I'm a bit perplexed. Are there any hadoop / hbase specific tricks
>>>>>> that can reveal what these processes are doing?
>>>>>>
>>>>>> -GS
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans 
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> It's not normal, but then again I don't have access to your machines
>>>>>>> so I can only speculate.
>>>>>>>
>>>>>>> Does "top" show you which process is in %wa? If so and it's a java
>>>>>>> process, can you figure what's going on in there?
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Tue, Aug 17, 2010 at 11:03 AM, George Stathis <[email protected]> 
>>>>>>> wrote:
>>>>>>> > Hello,
>>>>>>> >
>>>>>>> > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase
>>>>>>> > 0.20.3. Our small setup as of right now consists of one master and 
>>>>>>> > four
>>>>>>> > slaves with a replication factor of 2:
>>>>>>> >
>>>>>>> > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 
>>>>>>> > namenode, 1
>>>>>>> > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses 
>>>>>>> > its' own
>>>>>>> > dedicated EMS drive)
>>>>>>> > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 
>>>>>>> > datanode, 1
>>>>>>> > tasktracker, 1 regionserver
>>>>>>> >
>>>>>>> > We have also installed Ganglia to monitor the cluster stats as we are 
>>>>>>> > about
>>>>>>> > to run some performance tests but, right out of the box, we are 
>>>>>>> > noticing
>>>>>>> > high system loads (especially on the master node) without any activity
>>>>>>> > happening on the clister. Of course, the CPUs are not being utilized 
>>>>>>> > at all,
>>>>>>> > but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15
>>>>>>> > minute load times are all above 100% most of the time (i.e. there are 
>>>>>>> > more
>>>>>>> > than two processes at a time competing for the 2 CPUs time).
>>>>>>> >
>>>>>>> > Question1: is this normal?
>>>>>>> > Question2: does it matter since each process barely uses any of the 
>>>>>>> > CPU
>>>>>>> > time?
>>>>>>> >
>>>>>>> > Thank you in advance and pardon the noob questions.
>>>>>>> >
>>>>>>> > -GS
>>>>>>> >
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to