Re: KVM with hugepages generate huge load with two guests
Hi, So, nobody has any idea what's going wrong with all these massive IRQs and spin_locks that cause virtual machines to almost completely stop? :( Thanks, Dmitry On Wed, Dec 1, 2010 at 5:38 AM, Dmitry Golubev lastg...@gmail.com wrote: Hi, Sorry it took so slow to reply you - there are only few moments when I can poke a production server and I need to notify people in advance about that :( Can you post kvm_stat output while slowness is happening? 'perf top' on the host? and on the guest? I took 'perf top' and first thing I saw is that while guest is on acpi_pm, it shows more or less normal amount of IRQs (under 1000/s), however when I switched back to the default (which is nohz with kvm_clock), there are 40 times (!!!) more IRQs under normal operation (about 40 000/s). When the slowdown is happening, there are a lot of _spin_lock events and a lot of messages like: WARNING: failed to keep up with mmap data. Last read 810 msecs ago. As I told before, switching to acpi_pm does not save the day, but makes situation a lot more workable (i.e., servers recover faster from the period of slowness). During slowdowns on acpi_pm I also see _spin_lock Raw data follows: vmstat -5 on the host: procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 131904 13952 205872 0 0 0 24 2495 9813 6 3 91 0 0 0 0 132984 13952 205872 0 0 0 47 2596 9851 5 3 91 1 1 0 0 132148 13952 205872 0 0 0 54 2644 10559 3 3 93 1 0 1 0 129084 13952 205872 0 0 0 38 3039 9752 7 3 87 2 6 0 0 126388 13952 205872 0 0 0 311 15619 9009 42 17 39 2 9 0 0 125868 13960 205872 0 0 6 86 4659 6504 98 2 0 0 8 0 0 123320 13960 205872 0 0 0 26 4682 6649 98 2 0 0 8 0 0 126252 13960 205872 0 0 0 124 4923 6776 98 2 0 0 8 0 0 125376 13960 205872 0 0 136 11 4287 5865 98 2 0 0 9 0 0 123812 13960 205872 0 0 205 51 4497 6134 98 2 0 0 8 0 0 126020 13960 205872 0 0 904 26 4483 5999 98 2 0 0 8 0 0 124052 13960 205872 0 0 15 10 4397 6200 98 2 0 0 8 0 0 125928 13960 205872 0 0 14 41 4335 5823 98 2 0 0 8 0 0 126184 13960 205872 0 0 6 14 4966 6588 98 2 0 0 8 0 0 123588 13960 205872 0 0 143 18 5234 6891 98 2 0 0 8 0 0 126640 13960 205872 0 0 6 91 5554 7334 98 2 0 0 8 0 0 123144 13960 205872 0 0 146 11 5235 7145 98 2 0 0 8 0 0 125856 13968 205872 0 0 1282 98 5481 7159 98 2 0 0 9 19 0 124124 13968 205872 0 0 782 2433 8587 8987 97 3 0 0 8 0 0 122584 13968 205872 0 0 432 90 5359 6960 98 2 0 0 8 0 0 125320 13968 205872 0 0 3074 52 5448 7095 97 3 0 0 8 0 0 121436 13968 205872 0 0 2519 81 5714 7279 98 2 0 0 8 0 0 124436 13968 205872 0 0 1 56 5242 6864 98 2 0 0 8 0 0 111324 13968 205872 0 0 2 22 10660 6686 97 3 0 0 8 0 0 107824 13968 205872 0 0 0 24 14329 8147 97 3 0 0 8 0 0 110420 13968 205872 0 0 0 68 13486 6985 98 2 0 0 8 0 0 110024 13968 205872 0 0 0 19 13085 6659 98 2 0 0 8 0 0 109932 13968 205872 0 0 0 3 12952 6415 98 2 0 0 8 0 0 108552 13968 205880 0 0 2 41 13400 7349 98 2 0 0 Few shots with kvm_stat on the host: Every 2.0s: kvm_stat -1 Wed Dec 1 04:45:47 2010 efer_reload 0 0 exits 56264102 14074 fpu_reload 311506 50 halt_exits 4733166 935 halt_wakeup 3845079 840 host_state_reload 8795964 4085 hypercalls 0 0 insn_emulation 13573212 7249 insn_emulation_fail 0 0 invlpg 1846050 20 io_exits 3579406 843 irq_exits 3038887 4879 irq_injections 5242157 3681 irq_window 124361 540 largepages 2253 0 mmio_exits 64274 20 mmu_cache_miss 664011 16 mmu_flooded 164506 1 mmu_pde_zapped 212686 8 mmu_pte_updated 729268 0 mmu_pte_write 81323616 551 mmu_recycled 277 0 mmu_shadow_zapped 652691 23 mmu_unsync 5630 8 nmi_injections 0 0 nmi_window 0 0 pf_fixed 17470658 218 pf_guest
Re: KVM with hugepages generate huge load with two guests
Hi, Sorry it took so slow to reply you - there are only few moments when I can poke a production server and I need to notify people in advance about that :( Can you post kvm_stat output while slowness is happening? 'perf top' on the host? and on the guest? I took 'perf top' and first thing I saw is that while guest is on acpi_pm, it shows more or less normal amount of IRQs (under 1000/s), however when I switched back to the default (which is nohz with kvm_clock), there are 40 times (!!!) more IRQs under normal operation (about 40 000/s). When the slowdown is happening, there are a lot of _spin_lock events and a lot of messages like: WARNING: failed to keep up with mmap data. Last read 810 msecs ago. As I told before, switching to acpi_pm does not save the day, but makes situation a lot more workable (i.e., servers recover faster from the period of slowness). During slowdowns on acpi_pm I also see _spin_lock Raw data follows: vmstat -5 on the host: procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 0 0 0 131904 13952 20587200 024 2495 9813 6 3 91 0 0 0 0 132984 13952 20587200 047 2596 9851 5 3 91 1 1 0 0 132148 13952 20587200 054 2644 10559 3 3 93 1 0 1 0 129084 13952 20587200 038 3039 9752 7 3 87 2 6 0 0 126388 13952 20587200 0 311 15619 9009 42 17 39 2 9 0 0 125868 13960 20587200 686 4659 6504 98 2 0 0 8 0 0 123320 13960 20587200 026 4682 6649 98 2 0 0 8 0 0 126252 13960 20587200 0 124 4923 6776 98 2 0 0 8 0 0 125376 13960 20587200 13611 4287 5865 98 2 0 0 9 0 0 123812 13960 20587200 20551 4497 6134 98 2 0 0 8 0 0 126020 13960 20587200 90426 4483 5999 98 2 0 0 8 0 0 124052 13960 205872001510 4397 6200 98 2 0 0 8 0 0 125928 13960 205872001441 4335 5823 98 2 0 0 8 0 0 126184 13960 20587200 614 4966 6588 98 2 0 0 8 0 0 123588 13960 20587200 14318 5234 6891 98 2 0 0 8 0 0 126640 13960 20587200 691 5554 7334 98 2 0 0 8 0 0 123144 13960 20587200 14611 5235 7145 98 2 0 0 8 0 0 125856 13968 20587200 128298 5481 7159 98 2 0 0 9 19 0 124124 13968 20587200 782 2433 8587 8987 97 3 0 0 8 0 0 122584 13968 20587200 43290 5359 6960 98 2 0 0 8 0 0 125320 13968 20587200 307452 5448 7095 97 3 0 0 8 0 0 121436 13968 20587200 251981 5714 7279 98 2 0 0 8 0 0 124436 13968 20587200 156 5242 6864 98 2 0 0 8 0 0 111324 13968 20587200 222 10660 6686 97 3 0 0 8 0 0 107824 13968 20587200 024 14329 8147 97 3 0 0 8 0 0 110420 13968 20587200 068 13486 6985 98 2 0 0 8 0 0 110024 13968 20587200 019 13085 6659 98 2 0 0 8 0 0 109932 13968 20587200 0 3 12952 6415 98 2 0 0 8 0 0 108552 13968 20588000 241 13400 7349 98 2 0 0 Few shots with kvm_stat on the host: Every 2.0s: kvm_stat -1 Wed Dec 1 04:45:47 2010 efer_reload0 0 exits 56264102 14074 fpu_reload31150650 halt_exits 4733166 935 halt_wakeup 3845079 840 host_state_reload8795964 4085 hypercalls 0 0 insn_emulation 13573212 7249 insn_emulation_fail0 0 invlpg 184605020 io_exits 3579406 843 irq_exits3038887 4879 irq_injections 5242157 3681 irq_window124361 540 largepages 2253 0 mmio_exits 6427420 mmu_cache_miss66401116 mmu_flooded 164506 1 mmu_pde_zapped212686 8 mmu_pte_updated 729268 0 mmu_pte_write 81323616 551 mmu_recycled 277 0 mmu_shadow_zapped 65269123 mmu_unsync 5630 8 nmi_injections 0 0 nmi_window 0 0 pf_fixed17470658 218 pf_guest1335220581 remote_tlb_flush 189893096 request_irq0 0 signal_exits 0 0 tlb_flush5827433 108 Every 2.0s: kvm_stat -1 Wed Dec 1 04:47:33 2010 efer_reload0 0 exits
Re: KVM with hugepages generate huge load with two guests
Just out of curiocity: did you try updating the BIOS on your motherboard? The issus you're facing seems to be quite unique, and I've seen more than once how various different weird issues were fixed just by updating the BIOS. Provided they actually did they own homework and fixed something and released the fixes too... ;) Thank you for reply, I really appreciate that somebody found time to answer. Unfortunately for this investigation I managed to upgrade BIOS version few months ago. I just checked - there are no newer versions. I do see, however, that many people advise to change to acpi_pm ckocksource (and, thus, disable nohz option) in case similar problems are experienced - I did not invent this workaround (got the idea here: http://forum.proxmox.com/threads/5144-100-CPU-on-host-VM-hang-every-night?p=29143#post29143 ). Looks like an ancient bug. I even upgraded my qemu-kvm to version 0.13 without any significant changes to this behavior. It is really weird, however how one guest can work fine, but two start messing with each other. Shouldn't there be some kind of isolation between them? As they both start to behave exactly the same at exactly the same time. And it does not happen once a month or a year, but pretty frequently. Thanks, Dmitry -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM with hugepages generate huge load with two guests
Thanks for the answer. Are you sure it is hugepages related? Well, empirically it looked like either hugepages-related, or regression of qemu-kvm 0.12.3 - 0.12.5, as this did not happen until I upgraded (needed to avoid disk corruption caused by a bug in 0.12.3) and put hugepages. However as frequency of problem does seem related to memory each guest consumes (more memory = faster the problem appears) and in the beginning it might have been that the memory consumption of the guests did not hit some kind of threshold, maybe it is not really hugepages related. Can you post kvm_stat output while slowness is happening? 'perf top' on the host? and on the guest? OK, I will test this and write back. Thanks, Dmitry -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM with hugepages generate huge load with two guests
185212 340 243 9 1591 1212 10 14 76 1 9 0 1228 223644 17016 18516400 2872 857 18515 5134 45 20 9 26 0 3 1228 224840 17016 18516400 1080 786 8281 5490 35 12 21 33 2 0 1228 222032 17016 18516400 118499 21056 3713 26 17 48 9 1 0 1228 221784 17016 18516400 207569 3089 3749 9 7 73 11 3 0 1228 220544 17016 18516400 1501 150 3815 3520 7 8 73 12 3 0 1228 219736 17024 18516400 1129 103 7726 4177 20 11 60 9 0 4 1228 217224 17024 18516400 2844 211 6068 4643 9 7 60 23 Thanks, Dmitry On Thu, Nov 18, 2010 at 8:53 AM, Dmitry Golubev lastg...@gmail.com wrote: Hi, Sorry to bother you again. I have more info: 1. router with 32MB of RAM (hugepages) and 1VCPU ... Is it too much to have 3 guests with hugepages? OK, this router is also out of equation - I disabled hugepages for it. There should be also additional pages available to guests because of that. I think this should be pretty reproducible... Two exactly similar 64bit Linux 2.6.32 guests with 3500MB of virtual RAM and 4 VCPU each, running on a Core2Quad (4 real cores) machine with 8GB of RAM and 3546 2MB hugepages on a 64bit Linux 2.6.35 host (libvirt 0.8.3) from Ubuntu Maverick. Still no swapping and the effect is pretty much the same: one guest runs well, two guests work for some minutes - then slow down few hundred times, showing huge load both inside (unlimited rapid growth of loadaverage) and outside (host load is not making it unresponsive though - but loaded to the max). Load growth on host is instant and finite ('r' column change indicate this sudden rise): # vmstat 5 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si so bi bo in cs us sy id wa 1 3 0 194220 30680 76712 0 0 319 28 2633 1960 6 6 67 20 1 2 0 193776 30680 76712 0 0 4 231 55081 78491 3 39 17 41 10 1 0 185508 30680 76712 0 0 4 87 53042 34212 55 27 9 9 12 0 0 185180 30680 76712 0 0 2 95 41007 21990 84 16 0 0 Thanks, Dmitry On Wed, Nov 17, 2010 at 4:19 AM, Dmitry Golubev lastg...@gmail.com wrote: Hi, Maybe you remember that I wrote few weeks ago about KVM cpu load problem with hugepages. The problem was lost hanging, however I have now some new information. So the description remains, however I have decreased both guest memory and the amount of hugepages: Ram = 8GB, hugepages = 3546 Total of 2 virual machines: 1. router with 32MB of RAM (hugepages) and 1VCPU 2. linux guest with 3500MB of RAM (hugepages) and 4VCPU Everything works fine until I start the second linux guest with the same 3500MB of guest RAM also in hugepages and also 4VCPU. The rest of description is the same as before: after a while the host shows loadaverage of about 8 (on a Core2Quad) and it seems that both big guests consume exactly the same amount of resources. The hosts seems responsive though. Inside the guests, however, things are not so good - the load sky rockets to at least 20. Guests are not responsive and even a 'ps' executes inappropriately slow (may take few minutes - here, however, load builds up and it seems that machine becomes slower with time, unlike host, which shows the jump in resource consumption instantly). It also seem that the more guests uses memory, the faster the problem appers. Still at least a gig of RAM is free on each guest and there is no swap activity inside the guest. The most important thing - why I went back and quoted older message than the last one, is that there is no more swap activity on host, so the previous track of thought may also be wrong and I returned to the beginning. There is plenty of RAM now and swap on host is always on 0 as seen in 'top'. And there is 100% cpu load, equally shared between the two large guests. To stop the load I can destroy either large guest. Additionally, I have just discovered that suspending any large guest works as well. Moreover, after resume, the load does not come back for a while. Both methods stop the high load instantly (faster than a second). As you were asking for a 'top' inside the guest, here it is: top - 03:27:27 up 42 min, 1 user, load average: 18.37, 7.68, 3.12 Tasks: 197 total, 23 running, 174 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 89.2%sy, 0.0%ni, 10.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 3510912k total, 1159760k used, 2351152k free, 62568k buffers Swap: 4194296k total, 0k used, 4194296k free, 484492k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12303 root 20 0 0 0 0 R 100 0.0 0:33.72 vpsnetclean 11772 99 20 0 149m 11m 2104 R 82 0.3 0:15.10 httpd 10906 99 20 0 149m 11m 2124 R 73 0.3 0:11.52 httpd 10247 99 20 0 149m 11m
Re: KVM with hugepages generate huge load with two guests
Hi, Sorry to bother you again. I have more info: 1. router with 32MB of RAM (hugepages) and 1VCPU ... Is it too much to have 3 guests with hugepages? OK, this router is also out of equation - I disabled hugepages for it. There should be also additional pages available to guests because of that. I think this should be pretty reproducible... Two exactly similar 64bit Linux 2.6.32 guests with 3500MB of virtual RAM and 4 VCPU each, running on a Core2Quad (4 real cores) machine with 8GB of RAM and 3546 2MB hugepages on a 64bit Linux 2.6.35 host (libvirt 0.8.3) from Ubuntu Maverick. Still no swapping and the effect is pretty much the same: one guest runs well, two guests work for some minutes - then slow down few hundred times, showing huge load both inside (unlimited rapid growth of loadaverage) and outside (host load is not making it unresponsive though - but loaded to the max). Load growth on host is instant and finite ('r' column change indicate this sudden rise): # vmstat 5 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 1 3 0 194220 30680 7671200 31928 2633 1960 6 6 67 20 1 2 0 193776 30680 7671200 4 231 55081 78491 3 39 17 41 10 1 0 185508 30680 7671200 487 53042 34212 55 27 9 9 12 0 0 185180 30680 7671200 295 41007 21990 84 16 0 0 Thanks, Dmitry On Wed, Nov 17, 2010 at 4:19 AM, Dmitry Golubev lastg...@gmail.com wrote: Hi, Maybe you remember that I wrote few weeks ago about KVM cpu load problem with hugepages. The problem was lost hanging, however I have now some new information. So the description remains, however I have decreased both guest memory and the amount of hugepages: Ram = 8GB, hugepages = 3546 Total of 2 virual machines: 1. router with 32MB of RAM (hugepages) and 1VCPU 2. linux guest with 3500MB of RAM (hugepages) and 4VCPU Everything works fine until I start the second linux guest with the same 3500MB of guest RAM also in hugepages and also 4VCPU. The rest of description is the same as before: after a while the host shows loadaverage of about 8 (on a Core2Quad) and it seems that both big guests consume exactly the same amount of resources. The hosts seems responsive though. Inside the guests, however, things are not so good - the load sky rockets to at least 20. Guests are not responsive and even a 'ps' executes inappropriately slow (may take few minutes - here, however, load builds up and it seems that machine becomes slower with time, unlike host, which shows the jump in resource consumption instantly). It also seem that the more guests uses memory, the faster the problem appers. Still at least a gig of RAM is free on each guest and there is no swap activity inside the guest. The most important thing - why I went back and quoted older message than the last one, is that there is no more swap activity on host, so the previous track of thought may also be wrong and I returned to the beginning. There is plenty of RAM now and swap on host is always on 0 as seen in 'top'. And there is 100% cpu load, equally shared between the two large guests. To stop the load I can destroy either large guest. Additionally, I have just discovered that suspending any large guest works as well. Moreover, after resume, the load does not come back for a while. Both methods stop the high load instantly (faster than a second). As you were asking for a 'top' inside the guest, here it is: top - 03:27:27 up 42 min, 1 user, load average: 18.37, 7.68, 3.12 Tasks: 197 total, 23 running, 174 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 89.2%sy, 0.0%ni, 10.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 3510912k total, 1159760k used, 2351152k free, 62568k buffers Swap: 4194296k total, 0k used, 4194296k free, 484492k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12303 root 20 0 0 0 0 R 100 0.0 0:33.72 vpsnetclean 11772 99 20 0 149m 11m 2104 R 82 0.3 0:15.10 httpd 10906 99 20 0 149m 11m 2124 R 73 0.3 0:11.52 httpd 10247 99 20 0 149m 11m 2128 R 31 0.3 0:05.39 httpd 3916 root 20 0 86468 11m 1476 R 16 0.3 0:15.14 cpsrvd-ssl 10919 99 20 0 149m 11m 2124 R 8 0.3 0:03.43 httpd 11296 99 20 0 149m 11m 2112 R 7 0.3 0:03.26 httpd 12265 99 20 0 149m 11m 2088 R 7 0.3 0:08.01 httpd 12317 root 20 0 99.6m 1384 716 R 7 0.0 0:06.57 crond 12326 503 20 0 8872 96 72 R 7 0.0 0:01.13 php 3634 root 20 0 74804 1176 596 R 6 0.0 0:12.15 crond 11864 32005 20 0 87224 13m 2528 R 6 0.4 0:30.84 cpsrvd-ssl 12275 root 20 0 30628 9976 1364 R 6 0.3 0:24.68 cpgs_chk 11305 99 20 0 149m 11m 2104 R 6 0.3 0:02.53
Re: KVM with hugepages generate huge load with two guests
25 days, 23:38, 2 users, load average: 8.50, 5.07, 10.39 Tasks: 133 total, 1 running, 132 sleeping, 0 stopped, 0 zombie Cpu(s): 99.1%us, 0.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 8193472k total, 8071776k used, 121696k free, 45296k buffers Swap: 11716412k total, 0k used, 11714844k free, 197236k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8426 libvirt- 20 0 3771m 27m 3904 S 199 0.3 10:28.33 kvm 8374 libvirt- 20 0 3815m 32m 3908 S 199 0.4 8:11.53 kvm 1557 libvirt- 20 0 225m 7720 2092 S 1 0.1 436:54.45 kvm 72 root 20 0 0 0 0 S 0 0.0 6:22.54 kondemand/3 379 root 20 0 0 0 0 S 0 0.0 58:20.99 md3_raid5 1 root 20 0 23768 1944 1228 S 0 0.0 0:00.95 init 2 root 20 0 0 0 0 S 0 0.0 0:00.24 kthreadd 3 root 20 0 0 0 0 S 0 0.0 0:12.66 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:07.58 migration/0 5 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0 0.0 0:15.05 migration/1 7 root 20 0 0 0 0 S 0 0.0 0:19.64 ksoftirqd/1 8 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/1 9 root RT 0 0 0 0 S 0 0.0 0:07.21 migration/2 10 root 20 0 0 0 0 S 0 0.0 0:41.74 ksoftirqd/2 11 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/2 12 root RT 0 0 0 0 S 0 0.0 0:13.62 migration/3 13 root 20 0 0 0 0 S 0 0.0 0:24.63 ksoftirqd/3 14 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/3 15 root 20 0 0 0 0 S 0 0.0 1:17.11 events/0 16 root 20 0 0 0 0 S 0 0.0 1:33.30 events/1 17 root 20 0 0 0 0 S 0 0.0 4:15.28 events/2 18 root 20 0 0 0 0 S 0 0.0 1:13.49 events/3 19 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset 20 root 20 0 0 0 0 S 0 0.0 0:00.02 khelper 21 root 20 0 0 0 0 S 0 0.0 0:00.00 netns 22 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr 23 root 20 0 0 0 0 S 0 0.0 0:00.00 pm 25 root 20 0 0 0 0 S 0 0.0 0:02.47 sync_supers 26 root 20 0 0 0 0 S 0 0.0 0:03.86 bdi-default Please help... Thanks, Dmitry On Sat, Oct 2, 2010 at 1:30 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Sep 30, 2010 at 12:07:15PM +0300, Dmitry Golubev wrote: Hi, I am not sure what's really happening, but every few hours (unpredictable) two virtual machines (Linux 2.6.32) start to generate huge cpu loads. It looks like some kind of loop is unable to complete or something... So the idea is: 1. I have two linux 2.6.32 x64 (openvz, proxmox project) guests running on linux 2.6.35 x64 (ubuntu maverick) host with a Q6600 Core2Quad on qemu-kvm 0.12.5 and libvirt 0.8.3 and another one small 32bit linux virtual machine (16MB of ram) with a router inside (i doubt it contributes to the problem). 2. All these machines use hufetlbfs. The server has 8GB of RAM, I reserved 3696 huge pages (page size is 2MB) on the server, and I am running the main guests each having 3550MB of virtual memory. The third guest, as I wrote before, takes 16MB of virtual memory. 3. Once run, the guests reserve huge pages for themselves normally. As mem-prealloc is default, they grab all the memory they should have, leaving 6 pages unreserved (HugePages_Free - HugePages_Rsvd = 6) all times - so as I understand they should not want to get any more, right? 4. All virtual machines run perfectly normal without any disturbances for few hours. They do not, however, use all their memory, so maybe the issue arises when they pass some kind of a threshold. 5. At some point of time both guests exhibit cpu load over the top (16-24). At the same time, host works perfectly well, showing load of 8 and that both kvm processes use CPU equally and fully. This point of time is unpredictable - it can be anything from one to twenty hours, but it will be less than a day. Sometimes the load disappears in a moment, but usually it stays like that, and everything works extremely slow (even a 'ps' command executes some 2-5 minutes). 6. If I am patient, I can start rebooting the gueat systems - once they have restarted, everything returns to normal. If I destroy one of the guests (virsh destroy), the other one starts working normally at once (!). I am relatively new to kvm and I am absolutely lost here. I have not experienced such problems before, but recently I upgraded from ubuntu lucid (I think it was linux 2.6.32, qemukvm 0.12.3 and libvirt 0.7.5) and started to use hugepages
Re: KVM with hugepages generate huge load with two guests
Please don't top post. Sorry Please use 'top' to find out which processes are busy, the aggregate statistics don't help to find out what the problem is. The thing is - all more or less active processes become busy, like httpd, etc - I can't identify any single process that generates all the load. I see at least 10 different processes in the list that look busy in each guest... From what I see, there is nothing out of the ordinary in guest 'top', except that the whole guest becomes extremely slow. But OK, I will try to repeat the problem few hours later and send you the whole 'top' output if it is required. Thanks, Dmitry -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM with hugepages generate huge load with two guests
1648 99 1 0 0 8 0 7528 58992 7096 3192000 045 2536 1745 100 0 0 0 9 1 7528 59024 7096 3192000 138 2548 1635 99 1 0 0 8 0 7528 58768 7096 3192000 044 2496 1741 100 0 0 0 8 0 7528 59388 7096 3192000 018 2429 1617 100 0 0 0 8 0 7528 58868 7096 3192000 051 2600 1745 100 0 0 0 8 0 7528 59156 7104 3192000 047 2441 1682 100 0 0 0 9 0 7528 57380 7104 3192000 040 2709 1690 100 1 0 0 8 0 7528 58056 7104 3192000 1 8 3127 1629 100 0 0 0 9 0 7528 57544 7104 3192000 036 2615 1704 100 0 0 0 8 0 7528 57196 7104 3192000 026 2530 1710 100 0 0 0 8 0 7528 59792 6836 2864800 042 2613 1761 100 0 0 0 8 0 7528 59156 6844 29036007864 2641 1757 100 0 0 0 8 0 7528 59576 6844 29164002626 2462 1632 100 0 0 0 8 0 7528 59716 6844 2916400 0 8 2414 1706 100 0 0 0 8 0 7528 59600 6844 2916400 048 2505 1649 100 0 0 0 8 0 7528 59600 6844 2916400 0 9 2373 1648 99 1 0 0 8 0 7528 59492 6844 2916400 010 2387 1564 100 1 0 0 8 0 7528 59624 6844 2916400 040 2551 1691 100 0 0 0 8 0 7528 59080 6844 2916400 050 2733 1643 100 0 0 0 8 0 7528 58956 6844 2916400 029 2823 1652 100 0 0 0 8 1 7528 58624 6844 2916400 034 2478 1633 100 0 0 0 8 0 7528 58716 6844 2916400 039 2398 1688 99 1 0 0 8 0 7528 57592 6844 3022800 21230 2373 1666 100 1 0 0 8 0 7528 57468 6852 3022800 051 2453 1695 100 0 0 0 8 0 7528 58244 6852 3022800 026 2756 1617 99 1 0 0 8 0 7528 58244 6852 3022800 0 112 3872 1952 99 1 0 0 9 1 7528 58320 6852 3022800 048 2718 1719 100 0 0 0 8 0 7528 58204 6852 3022800 017 2692 1697 100 0 0 0 8 0 7528 59220 6852 3022800 548 4666 1651 98 2 0 0 9 0 7528 57716 6852 3022800 0 101 5128 1874 98 2 0 0 9 0 7528 55692 6860 3022800 5 100 5875 1825 97 3 0 0 9 1 7528 55668 6860 3022800 0 156 3910 1960 99 1 0 0 8 0 7528 55668 6860 3022800 038 2578 1671 100 0 0 0 8 0 7528 55600 6860 3022800 281 2783 1888 100 0 0 0 9 1 7528 59660 5188 2832000 050 2601 1918 100 0 0 0 8 0 7528 63280 5196 2832800 263 4347 1855 99 1 0 0 8 0 7528 62560 5196 2832800 0 101 3383 1748 99 1 0 0 9 0 7528 62132 5196 2832800 150 2656 1724 100 0 0 0 One guest showed this: top - 23:11:35 up 53 min, 1 user, load average: 20.42, 14.61, 7.26 Tasks: 205 total, 40 running, 165 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 99.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3510920k total, 972876k used, 2538044k free,30256k buffers Swap: 4194296k total,0k used, 4194296k free, 321288k cached The other one: top - 23:11:12 up 1 day, 9:19, 1 user, load average: 19.38, 14.54, 7.40 Tasks: 219 total, 15 running, 204 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 99.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 3510920k total, 1758688k used, 1752232k free, 298000k buffers Swap: 4194296k total,0k used, 4194296k free, 577068k cached Thanks, Dmitry On Sun, Oct 3, 2010 at 12:28 PM, Avi Kivity a...@redhat.com wrote: On 09/30/2010 11:07 AM, Dmitry Golubev wrote: Hi, I am not sure what's really happening, but every few hours (unpredictable) two virtual machines (Linux 2.6.32) start to generate huge cpu loads. It looks like some kind of loop is unable to complete or something... What does 'top' inside the guest show when this is happening? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM with hugepages generate huge load with two guests
Hi, Thanks for reply. Well, although there is plenty of RAM left (about 100MB), some swap space was used during the operation: Mem: 8193472k total, 8089788k used, 103684k free, 5768k buffers Swap: 11716412k total,36636k used, 11679776k free, 103112k cached I am not sure why, though. Are you saying that there are bursts of memory usage that push some pages to swap and they are not unswapped although used? I will try to replicate the problem now and send you some better printout from the moment the problem happens. I have not noticed anything unusual when I was watching the system - there was plenty of RAM free and a few megabytes in swap... Is there any kind of check I can try during the problem occurring? Or should I free 50-100MB from hugepages and the system shall be stable again? Thanks, Dmitry On Sat, Oct 2, 2010 at 1:30 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Sep 30, 2010 at 12:07:15PM +0300, Dmitry Golubev wrote: Hi, I am not sure what's really happening, but every few hours (unpredictable) two virtual machines (Linux 2.6.32) start to generate huge cpu loads. It looks like some kind of loop is unable to complete or something... So the idea is: 1. I have two linux 2.6.32 x64 (openvz, proxmox project) guests running on linux 2.6.35 x64 (ubuntu maverick) host with a Q6600 Core2Quad on qemu-kvm 0.12.5 and libvirt 0.8.3 and another one small 32bit linux virtual machine (16MB of ram) with a router inside (i doubt it contributes to the problem). 2. All these machines use hufetlbfs. The server has 8GB of RAM, I reserved 3696 huge pages (page size is 2MB) on the server, and I am running the main guests each having 3550MB of virtual memory. The third guest, as I wrote before, takes 16MB of virtual memory. 3. Once run, the guests reserve huge pages for themselves normally. As mem-prealloc is default, they grab all the memory they should have, leaving 6 pages unreserved (HugePages_Free - HugePages_Rsvd = 6) all times - so as I understand they should not want to get any more, right? 4. All virtual machines run perfectly normal without any disturbances for few hours. They do not, however, use all their memory, so maybe the issue arises when they pass some kind of a threshold. 5. At some point of time both guests exhibit cpu load over the top (16-24). At the same time, host works perfectly well, showing load of 8 and that both kvm processes use CPU equally and fully. This point of time is unpredictable - it can be anything from one to twenty hours, but it will be less than a day. Sometimes the load disappears in a moment, but usually it stays like that, and everything works extremely slow (even a 'ps' command executes some 2-5 minutes). 6. If I am patient, I can start rebooting the gueat systems - once they have restarted, everything returns to normal. If I destroy one of the guests (virsh destroy), the other one starts working normally at once (!). I am relatively new to kvm and I am absolutely lost here. I have not experienced such problems before, but recently I upgraded from ubuntu lucid (I think it was linux 2.6.32, qemukvm 0.12.3 and libvirt 0.7.5) and started to use hugepages. These two virtual machines are not normally run on the same host system (i have a corosync/pacemaker cluster with drbd storage), but when one of the hosts is not abailable, they start running on the same host. That is the reason I have not noticed this earlier. Unfortunately, I don't have any spare hardware to experiment and this is a production system, so my debugging options are rather limited. Do you have any ideas, what could be wrong? Is there swapping activity on the host when this happens? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM with hugepages generate huge load with two guests
OK, I have repeated the problem. The two machines were working fine for few hours without some services running (these would take up some gigabyte additionally in total), I ran these services again and some 40 minutes later the problem reappeared (may be a coincidence, though, but I don't think so). From top command output it looks like this: top - 03:38:10 up 2 days, 20:08, 1 user, load average: 9.60, 6.92, 5.36 Tasks: 143 total, 3 running, 140 sleeping, 0 stopped, 0 zombie Cpu(s): 85.7%us, 4.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 10.0%si, 0.0%st Mem: 8193472k total, 8056700k used, 136772k free, 4912k buffers Swap: 11716412k total,64884k used, 11651528k free,55640k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 21306 libvirt- 20 0 3781m 10m 2408 S 190 0.1 31:36.09 kvm 4984 libvirt- 20 0 3771m 19m 1440 S 180 0.2 390:30.04 kvm Comparing to the previous shot i sent before (that was taken few hours ago), and you will not see much difference in my opinion. Note that I have 8GB of RAM and totally both VMs take up 7GB. There is nothing else running on the server, except the VMs and cluster software (drbd, pacemaker etc). Right now the drbd sync process is taking some cpu resources - that is why the libvirt processes do not show as 200% (physically, it is a quad-core processor). Is almost 1GB really not enough for KVM to support two 3.5GB guests? I see 136MB of free memory right now - it is not even used... Thanks, Dmitry On Sat, Oct 2, 2010 at 2:50 AM, Dmitry Golubev lastg...@gmail.com wrote: Hi, Thanks for reply. Well, although there is plenty of RAM left (about 100MB), some swap space was used during the operation: Mem: 8193472k total, 8089788k used, 103684k free, 5768k buffers Swap: 11716412k total, 36636k used, 11679776k free, 103112k cached I am not sure why, though. Are you saying that there are bursts of memory usage that push some pages to swap and they are not unswapped although used? I will try to replicate the problem now and send you some better printout from the moment the problem happens. I have not noticed anything unusual when I was watching the system - there was plenty of RAM free and a few megabytes in swap... Is there any kind of check I can try during the problem occurring? Or should I free 50-100MB from hugepages and the system shall be stable again? Thanks, Dmitry On Sat, Oct 2, 2010 at 1:30 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Sep 30, 2010 at 12:07:15PM +0300, Dmitry Golubev wrote: Hi, I am not sure what's really happening, but every few hours (unpredictable) two virtual machines (Linux 2.6.32) start to generate huge cpu loads. It looks like some kind of loop is unable to complete or something... So the idea is: 1. I have two linux 2.6.32 x64 (openvz, proxmox project) guests running on linux 2.6.35 x64 (ubuntu maverick) host with a Q6600 Core2Quad on qemu-kvm 0.12.5 and libvirt 0.8.3 and another one small 32bit linux virtual machine (16MB of ram) with a router inside (i doubt it contributes to the problem). 2. All these machines use hufetlbfs. The server has 8GB of RAM, I reserved 3696 huge pages (page size is 2MB) on the server, and I am running the main guests each having 3550MB of virtual memory. The third guest, as I wrote before, takes 16MB of virtual memory. 3. Once run, the guests reserve huge pages for themselves normally. As mem-prealloc is default, they grab all the memory they should have, leaving 6 pages unreserved (HugePages_Free - HugePages_Rsvd = 6) all times - so as I understand they should not want to get any more, right? 4. All virtual machines run perfectly normal without any disturbances for few hours. They do not, however, use all their memory, so maybe the issue arises when they pass some kind of a threshold. 5. At some point of time both guests exhibit cpu load over the top (16-24). At the same time, host works perfectly well, showing load of 8 and that both kvm processes use CPU equally and fully. This point of time is unpredictable - it can be anything from one to twenty hours, but it will be less than a day. Sometimes the load disappears in a moment, but usually it stays like that, and everything works extremely slow (even a 'ps' command executes some 2-5 minutes). 6. If I am patient, I can start rebooting the gueat systems - once they have restarted, everything returns to normal. If I destroy one of the guests (virsh destroy), the other one starts working normally at once (!). I am relatively new to kvm and I am absolutely lost here. I have not experienced such problems before, but recently I upgraded from ubuntu lucid (I think it was linux 2.6.32, qemukvm 0.12.3 and libvirt 0.7.5) and started to use hugepages. These two virtual machines are not normally run on the same host system (i have a corosync/pacemaker cluster with drbd storage), but when one of the hosts is not abailable
KVM with hugepages generate huge load with two guests
Hi, I am not sure what's really happening, but every few hours (unpredictable) two virtual machines (Linux 2.6.32) start to generate huge cpu loads. It looks like some kind of loop is unable to complete or something... So the idea is: 1. I have two linux 2.6.32 x64 (openvz, proxmox project) guests running on linux 2.6.35 x64 (ubuntu maverick) host with a Q6600 Core2Quad on qemu-kvm 0.12.5 and libvirt 0.8.3 and another one small 32bit linux virtual machine (16MB of ram) with a router inside (i doubt it contributes to the problem). 2. All these machines use hufetlbfs. The server has 8GB of RAM, I reserved 3696 huge pages (page size is 2MB) on the server, and I am running the main guests each having 3550MB of virtual memory. The third guest, as I wrote before, takes 16MB of virtual memory. 3. Once run, the guests reserve huge pages for themselves normally. As mem-prealloc is default, they grab all the memory they should have, leaving 6 pages unreserved (HugePages_Free - HugePages_Rsvd = 6) all times - so as I understand they should not want to get any more, right? 4. All virtual machines run perfectly normal without any disturbances for few hours. They do not, however, use all their memory, so maybe the issue arises when they pass some kind of a threshold. 5. At some point of time both guests exhibit cpu load over the top (16-24). At the same time, host works perfectly well, showing load of 8 and that both kvm processes use CPU equally and fully. This point of time is unpredictable - it can be anything from one to twenty hours, but it will be less than a day. Sometimes the load disappears in a moment, but usually it stays like that, and everything works extremely slow (even a 'ps' command executes some 2-5 minutes). 6. If I am patient, I can start rebooting the gueat systems - once they have restarted, everything returns to normal. If I destroy one of the guests (virsh destroy), the other one starts working normally at once (!). I am relatively new to kvm and I am absolutely lost here. I have not experienced such problems before, but recently I upgraded from ubuntu lucid (I think it was linux 2.6.32, qemukvm 0.12.3 and libvirt 0.7.5) and started to use hugepages. These two virtual machines are not normally run on the same host system (i have a corosync/pacemaker cluster with drbd storage), but when one of the hosts is not abailable, they start running on the same host. That is the reason I have not noticed this earlier. Unfortunately, I don't have any spare hardware to experiment and this is a production system, so my debugging options are rather limited. Do you have any ideas, what could be wrong? Thanks, Dmitry -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html