This is odd, the run queue has quite a bit of processes in it, but the actual cpu data says 'nothings happening'.
Can't tell which processes are in the run queue here, does the load average go down when you take hadoop/hbase down? If so, that seems to identify the locus of the problem, but doesn't really explain WHY this is so. Consider I've never really seen this before I'd just write it off to using a hypervisor'ed environment and blame other people stealing all your cpus :-) On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <[email protected]> wrote: > vmstat 2 for 2 mins below. Looks like everything is in idle (github > gist paste if it's easier to read: http://gist.github.com/532512): > > > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 > 0 0 100 0 > 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 > 0 0 100 0 > 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 > 0 0 100 0 > 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 > 0 0 100 0 > 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 > 0 0 100 0 > 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 > 0 0 100 0 > 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 > 0 0 100 0 > 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 > 0 0 100 0 > 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 > 0 0 100 0 > 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 > 0 0 100 0 > 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 > 0 0 100 0 > 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 > 0 0 100 0 > 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 > 0 0 100 0 > 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 > 0 0 100 0 > 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 > 0 0 100 0 > 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 > 0 0 100 0 > 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 > 0 0 100 0 > 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 > 0 0 100 0 > 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 > 0 0 100 0 > 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 > 0 0 100 0 > 0 0 0 15095732 248428 1398448 0 0 0 4 197 222 > 0 0 100 0 > 6 0 0 15095732 248428 1398448 0 0 0 6 267 260 > 0 0 100 0 > 0 0 0 15095732 248428 1398448 0 0 0 0 240 239 > 0 0 100 0 > 0 0 0 15095580 248428 1398448 0 0 0 8 180 210 > 0 0 100 0 > 5 0 0 15095580 248428 1398448 0 0 0 0 193 224 > 0 0 100 0 > 1 0 0 15095580 248428 1398448 0 0 0 36 161 191 > 0 0 100 0 > 0 0 0 15095428 248428 1398448 0 0 0 0 176 216 > 0 0 100 0 > 4 0 0 15095428 248428 1398448 0 0 0 0 202 236 > 0 0 100 0 > 0 0 0 15095428 248428 1398448 0 0 0 6 191 220 > 0 0 100 0 > 1 0 0 15095428 248428 1398448 0 0 0 0 188 238 > 0 0 100 0 > 2 0 0 15095276 248428 1398448 0 0 0 0 174 206 > 0 0 100 0 > 1 0 0 15095276 248428 1398448 0 0 0 0 225 249 > 0 0 100 0 > 0 0 0 15095124 248428 1398448 0 0 0 0 222 263 > 0 0 100 0 > 1 0 0 15095124 248428 1398448 0 0 0 6 187 236 > 0 0 100 0 > 5 0 0 15094940 248428 1398448 0 0 0 0 453 434 > 0 0 100 0 > 4 0 0 15094788 248428 1398448 0 0 0 2 227 225 > 0 0 100 0 > 0 0 0 15094788 248428 1398448 0 0 0 0 213 236 > 0 0 100 0 > 4 0 0 15094788 248428 1398448 0 0 0 6 257 253 > 0 0 100 0 > 0 0 0 15094636 248428 1398448 0 0 0 0 215 230 > 0 0 100 0 > 0 0 0 15094652 248428 1398448 0 0 0 0 259 285 > 0 0 100 0 > 0 0 0 15094500 248428 1398448 0 0 0 14 194 219 > 0 0 100 0 > 0 0 0 15094516 248428 1398448 0 0 0 0 227 257 > 0 0 100 0 > 4 0 0 15094516 248428 1398448 0 0 0 36 266 263 > 0 0 100 0 > 1 0 0 15094516 248428 1398448 0 0 0 0 202 213 > 0 0 100 0 > 0 0 0 15094364 248428 1398448 0 0 0 0 204 240 > 0 0 100 0 > 0 0 0 15094212 248428 1398448 0 0 0 6 161 194 > 0 0 100 0 > 0 0 0 15094212 248428 1398448 0 0 0 0 191 215 > 0 0 100 0 > 1 0 0 15094212 248428 1398448 0 0 0 0 216 238 > 0 0 100 0 > 5 0 0 15094212 248428 1398448 0 0 0 6 169 202 > 0 0 100 0 > 0 0 0 15094060 248428 1398448 0 0 0 0 172 216 > 0 0 100 0 > 2 0 0 15094060 248428 1398448 0 0 0 6 201 196 > 0 0 100 0 > 1 0 0 15094060 248428 1398448 0 0 0 0 196 218 > 0 0 100 0 > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 0 0 15093908 248428 1398448 0 0 0 0 206 236 > 0 0 100 0 > 4 0 0 15093908 248428 1398448 0 0 0 0 197 219 > 0 0 100 0 > 0 0 0 15093908 248428 1398448 0 0 0 0 186 227 > 0 0 100 0 > 0 0 0 15093756 248428 1398448 0 0 0 0 168 182 > 0 0 100 0 > 0 0 0 15093756 248428 1398448 0 0 0 0 206 239 > 0 0 100 0 > 0 0 0 15093604 248428 1398448 0 0 0 6 281 248 > 0 0 100 0 > 0 0 0 15093452 248428 1398448 0 0 0 0 185 198 > 0 0 100 0 > 5 0 0 15093452 248428 1398448 0 0 0 0 265 253 > 0 0 100 0 > 0 0 0 15093300 248428 1398448 0 0 0 36 194 211 > 0 0 100 0 > 0 0 0 15093300 248428 1398448 0 0 0 0 228 242 > 0 0 100 0 > 0 0 0 15093300 248428 1398448 0 0 0 0 290 262 > 0 0 100 0 > 0 0 0 15093300 248428 1398448 0 0 0 6 187 207 > 0 0 100 0 > > > On Tue, Aug 17, 2010 at 6:54 PM, Ryan Rawson <[email protected]> wrote: >> what does vmstat say? Run it like 'vmstat 2' for a minute or two and >> paste the results. >> >> With no cpu being consumed by java, it seems like there must be >> another hidden variable here. Some zombied process perhaps. Or some >> kind of super IO wait or something else. >> >> Since you are running on a hypervisor environment, i cant really say >> what is happening to your instance, although one would think the LA >> numbers would be unaffected by outside processes. >> >> On Tue, Aug 17, 2010 at 3:49 PM, George P. Stathis <[email protected]> >> wrote: >>> Actually, there is nothing in %wa but a ton sitting in %id. This is >>> from the Master: >>> >>> top - 18:30:24 up 5 days, 20:10, 1 user, load average: 2.55, 1.99, 1.25 >>> Tasks: 89 total, 1 running, 88 sleeping, 0 stopped, 0 zombie >>> Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, >>> 0.2%st >>> Mem: 17920228k total, 2795464k used, 15124764k free, 248428k buffers >>> Swap: 0k total, 0k used, 0k free, 1398388k cached >>> >>> I have atop installed which is reporting the hadoop/hbase java daemons >>> as the most active processes (barely taking any CPU time though and >>> most of the time in sleep mode): >>> >>> ATOP - domU-12-31-39-18-1 2010/08/17 18:31:46 10 seconds >>> elapsed >>> PRC | sys 0.01s | user 0.00s | #proc 89 | #zombie 0 | #exit >>> 0 | >>> CPU | sys 0% | user 0% | irq 0% | idle 200% | wait >>> 0% | >>> cpu | sys 0% | user 0% | irq 0% | idle 100% | cpu000 w >>> 0% | >>> CPL | avg1 2.55 | avg5 2.12 | avg15 1.35 | csw 2397 | intr >>> 2034 | >>> MEM | tot 17.1G | free 14.4G | cache 1.3G | buff 242.6M | slab >>> 193.1M | >>> SWP | tot 0.0M | free 0.0M | | vmcom 1.6G | vmlim >>> 8.5G | >>> NET | transport | tcpi 330 | tcpo 169 | udpi 566 | udpo >>> 147 | >>> NET | network | ipi 896 | ipo 316 | ipfrw 0 | deliv >>> 896 | >>> NET | eth0 ---- | pcki 777 | pcko 197 | si 248 Kbps | so 70 >>> Kbps | >>> NET | lo ---- | pcki 119 | pcko 119 | si 9 Kbps | so 9 >>> Kbps | >>> >>> PID CPU COMMAND-LINE 1/1 >>> 17613 0% atop >>> 17150 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >>> -XX:+HeapDumpOnOutOfMemor >>> 16527 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>> -Dcom.sun.managem >>> 16839 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>> -Dcom.sun.managem >>> 16735 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>> -Dcom.sun.managem >>> 17083 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >>> -XX:+HeapDumpOnOutOfMemor >>> >>> Same with atop: >>> >>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command >>> 16527 ubuntu 20 0 2352M 98M 10336 S 0.0 0.6 0:42.05 >>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote >>> -Dhadoop.log.dir=/var/log/h >>> 16735 ubuntu 20 0 2403M 81544 10236 S 0.0 0.5 0:01.56 >>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote >>> -Dhadoop.log.dir=/var/log/h >>> 17083 ubuntu 20 0 4557M 45388 10912 S 0.0 0.3 0:00.65 >>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC >>> -XX:+CMSIncrementalMode -server -XX:+Heap >>> 1 root 20 0 23684 1880 1272 S 0.0 0.0 0:00.23 /sbin/init >>> 587 root 20 0 247M 4088 2432 S 0.0 0.0 -596523h-14:-8 >>> /usr/sbin/console-kit-daemon --no-daemon >>> 3336 root 20 0 49256 1092 540 S 0.0 0.0 0:00.36 /usr/sbin/sshd >>> 16430 nobody 20 0 34408 3704 1060 S 0.0 0.0 0:00.01 gmond >>> 17150 ubuntu 20 0 2519M 112M 11312 S 0.0 0.6 -596523h-14:-8 >>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC >>> -XX:+CMSIncrementalMode -server -XX >>> >>> >>> So I'm a bit perplexed. Are there any hadoop / hbase specific tricks >>> that can reveal what these processes are doing? >>> >>> -GS >>> >>> >>> >>> On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans <[email protected]> >>> wrote: >>>> >>>> It's not normal, but then again I don't have access to your machines >>>> so I can only speculate. >>>> >>>> Does "top" show you which process is in %wa? If so and it's a java >>>> process, can you figure what's going on in there? >>>> >>>> J-D >>>> >>>> On Tue, Aug 17, 2010 at 11:03 AM, George Stathis <[email protected]> >>>> wrote: >>>> > Hello, >>>> > >>>> > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase >>>> > 0.20.3. Our small setup as of right now consists of one master and four >>>> > slaves with a replication factor of 2: >>>> > >>>> > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, 1 >>>> > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its' >>>> > own >>>> > dedicated EMS drive) >>>> > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 >>>> > datanode, 1 >>>> > tasktracker, 1 regionserver >>>> > >>>> > We have also installed Ganglia to monitor the cluster stats as we are >>>> > about >>>> > to run some performance tests but, right out of the box, we are noticing >>>> > high system loads (especially on the master node) without any activity >>>> > happening on the clister. Of course, the CPUs are not being utilized at >>>> > all, >>>> > but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15 >>>> > minute load times are all above 100% most of the time (i.e. there are >>>> > more >>>> > than two processes at a time competing for the 2 CPUs time). >>>> > >>>> > Question1: is this normal? >>>> > Question2: does it matter since each process barely uses any of the CPU >>>> > time? >>>> > >>>> > Thank you in advance and pardon the noob questions. >>>> > >>>> > -GS >>>> > >>> >> >
