Re: High OS Load Numbers when idle

Ryan Rawson Tue, 17 Aug 2010 16:09:15 -0700

This is odd, the run queue has quite a bit of processes in it, but the
actual cpu data says 'nothings happening'.


Can't tell which processes are in the run queue here, does the load
average go down when you take hadoop/hbase down?

If so, that seems to identify the locus of the problem, but doesn't
really explain WHY this is so.  Consider I've never really seen this
before I'd just write it off to using a hypervisor'ed environment and
blame other people stealing all your cpus :-)

On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <[email protected]> wrote:
> vmstat 2 for 2 mins below. Looks like everything is in idle (github
> gist paste if it's easier to read: http://gist.github.com/532512):
>
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  0  0      0 15097116 248428 1398444    0    0     0    50    5   24
> 0  0 100  0
>  0  0      0 15096948 248428 1398444    0    0     0     0  281  281
> 0  0 100  0
>  0  0      0 15096948 248428 1398444    0    0     0     0  279  260
> 0  0 100  0
>  0  0      0 15096948 248428 1398444    0    0     0     0  199  216
> 0  0 100  0
>  4  0      0 15096612 248428 1398448    0    0     0     0  528  467
> 0  0 100  0
>  0  0      0 15096612 248428 1398448    0    0     0     0  208  213
> 0  0 100  0
>  4  0      0 15096460 248428 1398448    0    0     0     0  251  261
> 0  0 100  0
>  0  0      0 15096460 248428 1398448    0    0     0    12  242  248
> 0  0 100  0
>  0  0      0 15096460 248428 1398448    0    0     0    34  228  230
> 0  0 100  0
>  0  0      0 15096476 248428 1398448    0    0     0     0  266  272
> 0  0 100  0
>  0  0      0 15096324 248428 1398448    0    0     0    10  179  206
> 0  0 100  0
>  1  0      0 15096340 248428 1398448    0    0     0     0  225  254
> 0  0 100  0
>  1  0      0 15096188 248428 1398448    0    0     0     0  263  245
> 0  0 100  0
>  0  0      0 15096188 248428 1398448    0    0     0     2  169  210
> 0  0 100  0
>  0  0      0 15096188 248428 1398448    0    0     0     0  201  238
> 0  0 100  0
>  0  0      0 15096036 248428 1398448    0    0     0    10  174  202
> 0  0 100  0
>  0  0      0 15096036 248428 1398448    0    0     0     6  207  222
> 0  0 100  0
>  0  0      0 15095884 248428 1398448    0    0     0     0  198  242
> 0  0 100  0
>  2  0      0 15095884 248428 1398448    0    0     0     0  177  215
> 0  0 100  0
>  0  0      0 15095884 248428 1398448    0    0     0     0  244  265
> 0  0 100  0
>  0  0      0 15095732 248428 1398448    0    0     0     4  197  222
> 0  0 100  0
>  6  0      0 15095732 248428 1398448    0    0     0     6  267  260
> 0  0 100  0
>  0  0      0 15095732 248428 1398448    0    0     0     0  240  239
> 0  0 100  0
>  0  0      0 15095580 248428 1398448    0    0     0     8  180  210
> 0  0 100  0
>  5  0      0 15095580 248428 1398448    0    0     0     0  193  224
> 0  0 100  0
>  1  0      0 15095580 248428 1398448    0    0     0    36  161  191
> 0  0 100  0
>  0  0      0 15095428 248428 1398448    0    0     0     0  176  216
> 0  0 100  0
>  4  0      0 15095428 248428 1398448    0    0     0     0  202  236
> 0  0 100  0
>  0  0      0 15095428 248428 1398448    0    0     0     6  191  220
> 0  0 100  0
>  1  0      0 15095428 248428 1398448    0    0     0     0  188  238
> 0  0 100  0
>  2  0      0 15095276 248428 1398448    0    0     0     0  174  206
> 0  0 100  0
>  1  0      0 15095276 248428 1398448    0    0     0     0  225  249
> 0  0 100  0
>  0  0      0 15095124 248428 1398448    0    0     0     0  222  263
> 0  0 100  0
>  1  0      0 15095124 248428 1398448    0    0     0     6  187  236
> 0  0 100  0
>  5  0      0 15094940 248428 1398448    0    0     0     0  453  434
> 0  0 100  0
>  4  0      0 15094788 248428 1398448    0    0     0     2  227  225
> 0  0 100  0
>  0  0      0 15094788 248428 1398448    0    0     0     0  213  236
> 0  0 100  0
>  4  0      0 15094788 248428 1398448    0    0     0     6  257  253
> 0  0 100  0
>  0  0      0 15094636 248428 1398448    0    0     0     0  215  230
> 0  0 100  0
>  0  0      0 15094652 248428 1398448    0    0     0     0  259  285
> 0  0 100  0
>  0  0      0 15094500 248428 1398448    0    0     0    14  194  219
> 0  0 100  0
>  0  0      0 15094516 248428 1398448    0    0     0     0  227  257
> 0  0 100  0
>  4  0      0 15094516 248428 1398448    0    0     0    36  266  263
> 0  0 100  0
>  1  0      0 15094516 248428 1398448    0    0     0     0  202  213
> 0  0 100  0
>  0  0      0 15094364 248428 1398448    0    0     0     0  204  240
> 0  0 100  0
>  0  0      0 15094212 248428 1398448    0    0     0     6  161  194
> 0  0 100  0
>  0  0      0 15094212 248428 1398448    0    0     0     0  191  215
> 0  0 100  0
>  1  0      0 15094212 248428 1398448    0    0     0     0  216  238
> 0  0 100  0
>  5  0      0 15094212 248428 1398448    0    0     0     6  169  202
> 0  0 100  0
>  0  0      0 15094060 248428 1398448    0    0     0     0  172  216
> 0  0 100  0
>  2  0      0 15094060 248428 1398448    0    0     0     6  201  196
> 0  0 100  0
>  1  0      0 15094060 248428 1398448    0    0     0     0  196  218
> 0  0 100  0
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  0  0      0 15093908 248428 1398448    0    0     0     0  206  236
> 0  0 100  0
>  4  0      0 15093908 248428 1398448    0    0     0     0  197  219
> 0  0 100  0
>  0  0      0 15093908 248428 1398448    0    0     0     0  186  227
> 0  0 100  0
>  0  0      0 15093756 248428 1398448    0    0     0     0  168  182
> 0  0 100  0
>  0  0      0 15093756 248428 1398448    0    0     0     0  206  239
> 0  0 100  0
>  0  0      0 15093604 248428 1398448    0    0     0     6  281  248
> 0  0 100  0
>  0  0      0 15093452 248428 1398448    0    0     0     0  185  198
> 0  0 100  0
>  5  0      0 15093452 248428 1398448    0    0     0     0  265  253
> 0  0 100  0
>  0  0      0 15093300 248428 1398448    0    0     0    36  194  211
> 0  0 100  0
>  0  0      0 15093300 248428 1398448    0    0     0     0  228  242
> 0  0 100  0
>  0  0      0 15093300 248428 1398448    0    0     0     0  290  262
> 0  0 100  0
>  0  0      0 15093300 248428 1398448    0    0     0     6  187  207
> 0  0 100  0
>
>
> On Tue, Aug 17, 2010 at 6:54 PM, Ryan Rawson <[email protected]> wrote:
>> what does vmstat say?  Run it like 'vmstat 2' for a minute or two and
>> paste the results.
>>
>> With no cpu being consumed by java, it seems like there must be
>> another hidden variable here.  Some zombied process perhaps. Or some
>> kind of super IO wait or something else.
>>
>> Since you are running on a hypervisor environment, i cant really say
>> what is happening to your instance, although one would think the LA
>> numbers would be unaffected by outside processes.
>>
>> On Tue, Aug 17, 2010 at 3:49 PM, George P. Stathis <[email protected]> 
>> wrote:
>>> Actually, there is nothing in %wa but a ton sitting in %id. This is
>>> from the Master:
>>>
>>> top - 18:30:24 up 5 days, 20:10,  1 user,  load average: 2.55, 1.99, 1.25
>>> Tasks:  89 total,   1 running,  88 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  
>>> 0.2%st
>>> Mem:  17920228k total,  2795464k used, 15124764k free,   248428k buffers
>>> Swap:        0k total,        0k used,        0k free,  1398388k cached
>>>
>>> I have atop installed which is reporting the hadoop/hbase java daemons
>>> as the most active processes (barely taking any CPU time though and
>>> most of the time in sleep mode):
>>>
>>> ATOP - domU-12-31-39-18-1 2010/08/17  18:31:46               10 seconds 
>>> elapsed
>>> PRC | sys   0.01s | user   0.00s | #proc     89 | #zombie    0 | #exit      
>>> 0 |
>>> CPU | sys      0% | user      0% | irq       0% | idle    200% | wait      
>>> 0% |
>>> cpu | sys      0% | user      0% | irq       0% | idle    100% | cpu000 w  
>>> 0% |
>>> CPL | avg1   2.55 | avg5    2.12 | avg15   1.35 | csw     2397 | intr    
>>> 2034 |
>>> MEM | tot   17.1G | free   14.4G | cache   1.3G | buff  242.6M | slab  
>>> 193.1M |
>>> SWP | tot    0.0M | free    0.0M |              | vmcom   1.6G | vmlim   
>>> 8.5G |
>>> NET | transport   | tcpi     330 | tcpo     169 | udpi     566 | udpo     
>>> 147 |
>>> NET | network     | ipi      896 | ipo      316 | ipfrw      0 | deliv    
>>> 896 |
>>> NET | eth0   ---- | pcki     777 | pcko     197 | si  248 Kbps | so   70 
>>> Kbps |
>>> NET | lo     ---- | pcki     119 | pcko     119 | si    9 Kbps | so    9 
>>> Kbps |
>>>
>>>  PID  CPU COMMAND-LINE                                                  1/1
>>> 17613   0% atop
>>> 17150   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m 
>>> -XX:+HeapDumpOnOutOfMemor
>>> 16527   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server 
>>> -Dcom.sun.managem
>>> 16839   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server 
>>> -Dcom.sun.managem
>>> 16735   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server 
>>> -Dcom.sun.managem
>>> 17083   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m 
>>> -XX:+HeapDumpOnOutOfMemor
>>>
>>> Same with atop:
>>>
>>>  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>> 16527 ubuntu    20   0 2352M   98M 10336 S  0.0  0.6  0:42.05
>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
>>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
>>> -Dhadoop.log.dir=/var/log/h
>>> 16735 ubuntu    20   0 2403M 81544 10236 S  0.0  0.5  0:01.56
>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
>>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
>>> -Dhadoop.log.dir=/var/log/h
>>> 17083 ubuntu    20   0 4557M 45388 10912 S  0.0  0.3  0:00.65
>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m
>>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
>>> -XX:+CMSIncrementalMode -server -XX:+Heap
>>>    1 root      20   0 23684  1880  1272 S  0.0  0.0  0:00.23 /sbin/init
>>>  587 root      20   0  247M  4088  2432 S  0.0  0.0 -596523h-14:-8
>>> /usr/sbin/console-kit-daemon --no-daemon
>>>  3336 root      20   0 49256  1092   540 S  0.0  0.0  0:00.36 /usr/sbin/sshd
>>> 16430 nobody    20   0 34408  3704  1060 S  0.0  0.0  0:00.01 gmond
>>> 17150 ubuntu    20   0 2519M  112M 11312 S  0.0  0.6 -596523h-14:-8
>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m
>>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
>>> -XX:+CMSIncrementalMode -server -XX
>>>
>>>
>>> So I'm a bit perplexed. Are there any hadoop / hbase specific tricks
>>> that can reveal what these processes are doing?
>>>
>>> -GS
>>>
>>>
>>>
>>> On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans <[email protected]> 
>>> wrote:
>>>>
>>>> It's not normal, but then again I don't have access to your machines
>>>> so I can only speculate.
>>>>
>>>> Does "top" show you which process is in %wa? If so and it's a java
>>>> process, can you figure what's going on in there?
>>>>
>>>> J-D
>>>>
>>>> On Tue, Aug 17, 2010 at 11:03 AM, George Stathis <[email protected]> 
>>>> wrote:
>>>> > Hello,
>>>> >
>>>> > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase
>>>> > 0.20.3. Our small setup as of right now consists of one master and four
>>>> > slaves with a replication factor of 2:
>>>> >
>>>> > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, 1
>>>> > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its' 
>>>> > own
>>>> > dedicated EMS drive)
>>>> > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 
>>>> > datanode, 1
>>>> > tasktracker, 1 regionserver
>>>> >
>>>> > We have also installed Ganglia to monitor the cluster stats as we are 
>>>> > about
>>>> > to run some performance tests but, right out of the box, we are noticing
>>>> > high system loads (especially on the master node) without any activity
>>>> > happening on the clister. Of course, the CPUs are not being utilized at 
>>>> > all,
>>>> > but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15
>>>> > minute load times are all above 100% most of the time (i.e. there are 
>>>> > more
>>>> > than two processes at a time competing for the 2 CPUs time).
>>>> >
>>>> > Question1: is this normal?
>>>> > Question2: does it matter since each process barely uses any of the CPU
>>>> > time?
>>>> >
>>>> > Thank you in advance and pardon the noob questions.
>>>> >
>>>> > -GS
>>>> >
>>>
>>
>

Re: High OS Load Numbers when idle

Reply via email to