[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
This bug was fixed in the package linux - 3.13.0-51.84 --- linux (3.13.0-51.84) trusty; urgency=low [ Luis Henriques ] * Release Tracking Bug - LP: #1444141 * Merged back Ubuntu-3.13.0-49.83 security release linux (3.13.0-50.82) trusty; urgency=low [ Brad Figg ] * Release Tracking Bug - LP: #1442285 [ Andy Whitcroft ] * [Config] CONFIG_DEFAULT_MMAP_MIN_ADDR needs to match on armhf and arm64 - LP: #1418140 [ Chris J Arges ] * [Config] CONFIG_PCIEASPM_DEBUG=y - LP: #1398544 [ Upstream Kernel Changes ] * KEYS: request_key() should reget expired keys rather than give EKEYEXPIRED - LP: #1124250 * audit: correctly record file names with different path name types - LP: #1439441 * KVM: x86: Check for nested events if there is an injectable interrupt - LP: #1413540 * be2iscsi: fix memory leak in error path - LP: #1440156 * block: remove old blk_iopoll_enabled variable - LP: #1440156 * be2iscsi: Fix handling timed out MBX completion from FW - LP: #1440156 * be2iscsi: Fix doorbell format for EQ/CQ/RQ s per SLI spec. - LP: #1440156 * be2iscsi: Fix the session cleanup when reboot/shutdown happens - LP: #1440156 * be2iscsi: Fix scsi_cmnd leakage in driver. - LP: #1440156 * be2iscsi : Fix DMA Out of SW-IOMMU space error - LP: #1440156 * be2iscsi: Fix retrieving MCCQ_WRB in non-embedded Mbox path - LP: #1440156 * be2iscsi: Fix exposing Host in sysfs after adapter initialization is complete - LP: #1440156 * be2iscsi: Fix interrupt Coalescing mechanism. - LP: #1440156 * be2iscsi: Fix TCP parameters while connection offloading. - LP: #1440156 * be2iscsi: Fix memory corruption in MBX path - LP: #1440156 * be2iscsi: Fix destroy MCC-CQ before MCC-EQ is destroyed - LP: #1440156 * be2iscsi: add an missing goto in error path - LP: #1440156 * be2iscsi: remove potential junk pointer free - LP: #1440156 * be2iscsi: Fix memory leak in mgmt_set_ip() - LP: #1440156 * be2iscsi: Fix the sparse warning introduced in previous submission - LP: #1440156 * be2iscsi: Fix updating the boot enteries in sysfs - LP: #1440156 * be2iscsi: Fix processing CQE before connection resources are freed - LP: #1440156 * be2iscsi : Fix kernel panic during reboot/shutdown - LP: #1440156 * fixed invalid assignment of 64bit mask to host dma_boundary for scatter gather segment boundary limit. - LP: #1440156 * quota: Store maximum space limit in bytes - LP: #1441284 * ip: zero sockaddr returned on error queue - LP: #1441284 * net: rps: fix cpu unplug - LP: #1441284 * ipv6: stop sending PTB packets for MTU < 1280 - LP: #1441284 * netxen: fix netxen_nic_poll() logic - LP: #1441284 * udp_diag: Fix socket skipping within chain - LP: #1441284 * ping: Fix race in free in receive path - LP: #1441284 * bnx2x: fix napi poll return value for repoll - LP: #1441284 * net: don't OOPS on socket aio - LP: #1441284 * bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify - LP: #1441284 * ipv4: tcp: get rid of ugly unicast_sock - LP: #1441284 * ppp: deflate: never return len larger than output buffer - LP: #1441284 * net: sctp: fix passing wrong parameter header to param_type2af in sctp_process_param - LP: #1441284 * ARM: pxa: add regulator_has_full_constraints to corgi board file - LP: #1441284 * ARM: pxa: add regulator_has_full_constraints to poodle board file - LP: #1441284 * ARM: pxa: add regulator_has_full_constraints to spitz board file - LP: #1441284 * hx4700: regulator: declare full constraints - LP: #1441284 * HID: input: fix confusion on conflicting mappings - LP: #1441284 * HID: fixup the conflicting keyboard mappings quirk - LP: #1441284 * megaraid_sas: disable interrupt_mask before enabling hardware interrupts - LP: #1441284 * PCI: Generate uppercase hex for modalias var in uevent - LP: #1441284 * usb: core: buffer: smallest buffer should start at ARCH_DMA_MINALIGN - LP: #1441284 * tty/serial: at91: enable peripheral clock before accessing I/O registers - LP: #1441284 * tty/serial: at91: fix error handling in atmel_serial_probe() - LP: #1441284 * axonram: Fix bug in direct_access - LP: #1441284 * ksoftirqd: Enable IRQs and call cond_resched() before poking RCU - LP: #1441284 * TPM: Add new TPMs to the tail of the list to prevent inadvertent change of dev - LP: #1441284 * char: tpm: Add missing error check for devm_kzalloc - LP: #1441284 * tpm_tis: verify interrupt during init - LP: #1441284 * tpm: Fix NULL return in tpm_ibmvtpm_get_desired_dma - LP: #1441284 * tpm/tpm_i2c_stm_st33: Fix potential bug in tpm_stm_i2c_send - LP: #1441284 * tpm/tpm_i2c_stm_st33: Add status check when reading data on the FIFO - LP: #1441284 * mmc: sdhci-pxa
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
My deployment is still running strong after over 36 hours. No crashes. I will leave it running for a few more days to see if it happens after a few days... and will report back. @arges, thanks for this fix! -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
@baco-1 1) What kind of hardware are you running on L0? ('ubuntu-bug linux' and filing a bug would collect the necessary info) 2) What kind of load are you seeing in L0, L1? 3) Can you give me the output of 'tail /sys/module/kvm_intel/parameters/*' ? 4) You could setup crashdump to dump on a hang (if we think its the right one), or just have a full backtrace on a softlockup by adding the following to the kernel cmdline: softlockup_all_cpu_backtrace=1 Having a single vCPU could either be reducing the load, or avoiding a race; it would be hard to tell without a proper backtrace of the hang itsself. This seems like a pretty simple testcase, I will put it on my list of things to try and reproduce. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
@arges For me it's related at least part of it... If I don't update the kernel to proposed-updates I have the following messages : If I use one CPU instead of two, I don't have those messages. BUG: soft lockup CPU#1 stuck for 22s! [qemu-system-x86:6889] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=15002 jiffies, g=5324, c=5323, q=0) BUG: soft lockup CPU#1 stuck for 22s! [qemu-system-x86:6889] ... My stress test is installing openstack with cloud-install on a single VM (nested KVM) generated with uvt-kvm and two vcpus. The backtrace I posted is the result of a manual "force off" (using virt-manager) of the test openstack VM. shortly after cloud-install tries to launch a VM in VM, the CPU reach 100% and all the shells (also console) get stuck The last kernel message of the VM before I loose control is "[ 942.295014] IPv6: ADDRCONF(NETDEV_UP): virbr0: link is not ready" -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
@baco-1 These backtraces look a bit different than the original bug. Can you file a new bug with how you are reproducing this and gather complete logs? --chris -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
I still have the same issue with kernel 3.16.0-36-generic or 3.13.0-51-generic (proposed-updates) # KVM HOST (3.16.0-36-generic) sudo apt-get install linux-signed-generic-lts-utopic/trusty-proposed # KVM GUEST (3.16.0-36-generic) sudo apt-get install linux-virtual-lts-utopic/trusty-proposed apt-get install cloud-installer cloud-install [ 1196.920613] kvm: vmptrld (null)/7800 failed [ 1196.920953] vmwrite error: reg 401e value 31 (err 1) [ 1196.921243] CPU: 23 PID: 5240 Comm: qemu-system-x86 Not tainted 3.16.0-36-generic #48~14.04.1-Ubuntu [ 1196.921244] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 11/03/2014 [ 1196.921245] 88202018fb58 81764a5f 880fe496 [ 1196.921248] 88202018fb68 c0a9320d 88202018fb78 c0a878bf [ 1196.921250] 88202018fba8 c0a8e1cf 880fe496 [ 1196.921252] Call Trace: [ 1196.921262] [] dump_stack+0x45/0x56 [ 1196.921277] [] vmwrite_error+0x2c/0x2e [kvm_intel] [ 1196.921280] [] vmcs_writel+0x1f/0x30 [kvm_intel] [ 1196.921283] [] free_nested.part.73+0x5f/0x170 [kvm_intel] [ 1196.921286] [] vmx_free_vcpu+0x33/0x70 [kvm_intel] [ 1196.921305] [] kvm_arch_vcpu_free+0x44/0x50 [kvm] [ 1196.921312] [] kvm_arch_destroy_vm+0xf2/0x1f0 [kvm] [ 1196.921318] [] ? synchronize_srcu+0x1d/0x20 [ 1196.921323] [] kvm_put_kvm+0x10e/0x220 [kvm] [ 1196.921328] [] kvm_vcpu_release+0x18/0x20 [kvm] [ 1196.921331] [] __fput+0xe4/0x220 [ 1196.921333] [] fput+0xe/0x10 [ 1196.921337] [] task_work_run+0xc4/0xe0 [ 1196.921342] [] do_exit+0x2b8/0xa60 [ 1196.921345] [] ? __unqueue_futex+0x32/0x70 [ 1196.921347] [] ? futex_wait+0x126/0x290 [ 1196.921349] [] ? check_preempt_curr+0x85/0xa0 [ 1196.921351] [] do_group_exit+0x3f/0xa0 [ 1196.921353] [] get_signal_to_deliver+0x1d0/0x6f0 [ 1196.921357] [] do_signal+0x48/0xad0 [ 1196.921359] [] ? __switch_to+0x167/0x590 [ 1196.921361] [] do_notify_resume+0x69/0xb0 [ 1196.921364] [] int_signal+0x12/0x17 [ 1196.921365] vmwrite error: reg 2800 value (err -255) [ 1196.921733] CPU: 23 PID: 5240 Comm: qemu-system-x86 Not tainted 3.16.0-36-generic #48~14.04.1-Ubuntu [ 1196.921734] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 11/03/2014 [ 1196.921735] 88202018fb58 81764a5f 880fe496 [ 1196.921736] 88202018fb68 c0a9320d 88202018fb78 c0a878bf [ 1196.921737] 88202018fba8 c0a8e1e0 880fe496 [ 1196.921739] Call Trace: [ 1196.921741] [] dump_stack+0x45/0x56 [ 1196.921744] [] vmwrite_error+0x2c/0x2e [kvm_intel] [ 1196.921746] [] vmcs_writel+0x1f/0x30 [kvm_intel] [ 1196.921748] [] free_nested.part.73+0x70/0x170 [kvm_intel] [ 1196.921751] [] vmx_free_vcpu+0x33/0x70 [kvm_intel] [ 1196.921757] [] kvm_arch_vcpu_free+0x44/0x50 [kvm] [ 1196.921763] [] kvm_arch_destroy_vm+0xf2/0x1f0 [kvm] [ 1196.921765] [] ? synchronize_srcu+0x1d/0x20 [ 1196.921770] [] kvm_put_kvm+0x10e/0x220 [kvm] [ 1196.921774] [] kvm_vcpu_release+0x18/0x20 [kvm] [ 1196.921775] [] __fput+0xe4/0x220 [ 1196.921777] [] fput+0xe/0x10 [ 1196.921778] [] task_work_run+0xc4/0xe0 [ 1196.921780] [] do_exit+0x2b8/0xa60 [ 1196.921782] [] ? __unqueue_futex+0x32/0x70 [ 1196.921783] [] ? futex_wait+0x126/0x290 [ 1196.921784] [] ? check_preempt_curr+0x85/0xa0 [ 1196.921786] [] do_group_exit+0x3f/0xa0 [ 1196.921788] [] get_signal_to_deliver+0x1d0/0x6f0 [ 1196.921790] [] do_signal+0x48/0xad0 [ 1196.921791] [] ? __switch_to+0x167/0x590 [ 1196.921793] [] do_notify_resume+0x69/0xb0 [ 1196.921795] [] int_signal+0x12/0x17 [ 1270.766540] device vnet3 entered promiscuous mode [ 1270.865885] device vnet4 entered promiscuous mode [ 1273.824576] kvm: zapping shadow pages for mmio generation wraparound [ 1447.725335] kvm [6152]: vcpu0 unhandled rdmsr: 0x606 uvt-kvm create \ --memory 16384 \ --disk 100 \ --cpu 2 \ --ssh-public-key-file uvt-authorized_keys \ --template uvt-template.xml \ test release=trusty arch=amd64 2 SandyBridge Intel -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
After speaking to Gema, she will re-test with this kernel installed in L0 in addition to L1. NOTE: This fix needs to be present for L0/L1 kernels. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
I have been trying to verify this kernel and I haven't seen exactly the soft lockup crash, but this other one, which may or may not be related but wanted to make a note of it: [ 2406.041444] Kernel panic - not syncing: hung_task: blocked tasks [ 2406.043163] CPU: 1 PID: 35 Comm: khungtaskd Not tainted 3.13.0-51-generic #84-Ubuntu [ 2406.044223] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011 [ 2406.044223] 003fffd1 88080ec7fdf0 817225ce 81a62a65 [ 2406.044223] 88080ec7fe68 8171b46d 0008 88080ec7fe78 [ 2406.044223] 88080ec7fe18 88080ec7fe40 0100 0004 [ 2406.044223] Call Trace: [ 2406.044223] [] dump_stack+0x45/0x56 [ 2406.044223] [] panic+0xc8/0x1d7 [ 2406.044223] [] watchdog+0x296/0x2e0 [ 2406.044223] [] ? reset_hung_task_detector+0x20/0x20 [ 2406.044223] [] kthread+0xd2/0xf0 [ 2406.044223] [] ? kthread_create_on_node+0x1c0/0x1c0 [ 2406.044223] [] ret_from_fork+0x7c/0xb0 [ 2406.044223] [] ? kthread_create_on_node+0x1c0/0x1c0 I have the crashdump for it, let me know how you want to proceed. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Verified on my reproducers. I'm marking the development task as fixed for this bug. I'll move the upstream investigation to another bug. ** Changed in: linux (Ubuntu) Assignee: Chris J Arges (arges) => (unassigned) ** Changed in: linux (Ubuntu) Status: Confirmed => Fix Released ** Changed in: linux (Ubuntu) Importance: High => Undecided ** Tags removed: verification-needed-trusty ** Tags added: verification-done-trusty -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed- trusty' to 'verification-done-trusty'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: verification-needed-trusty -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
@Andy: So 3.16.0-34 is the kernel with the fix? Any chance that it will also be backported to the 3.13 series? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
** Changed in: linux (Ubuntu Trusty) Status: In Progress => Fix Committed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Ran into this bug too on 3.13.0-48. My workaround is to run QEMU on top of KVM (instead of kvm on top of KVM) devstack local.conf: [[post-config|$NOVA_CONF]] [libvirt] virt_type = qemu nova.conf [libvirt] virt_type = qemu -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
With a revert of b6b8a145 ('Rework interception of IRQs and NMIs'), the issue does not occur readily with the test case. I was able to run for 1+ hour. Generally I can reproduce within 15m. With 9242b5b6 ('KVM: x86: Check for nested events if there is an injectable interrupt') applied, I can run for 1+ hour without issue. Current 3.13.0 patchlevel is in between those two commits which allows for this bug to reproduce easily. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
** Description changed: [Impact] Upstream discussion: https://lkml.org/lkml/2015/2/11/247 Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: 8804274bb000 CPU: 1 COMMAND: "qemu-system-x86" #0 [88043fd03d18] machine_kexec at 8104ac02 #1 [88043fd03d68] crash_kexec at 810e7203 #2 [88043fd03e30] panic at 81719ff4 #3 [88043fd03ea8] watchdog_timer_fn at 8110d7c5 #4 [88043fd03ed8] __run_hrtimer at 8108e787 #5 [88043fd03f18] hrtimer_interrupt at 8108ef4f #6 [88043fd03f80] local_apic_timer_interrupt at 81043537 #7 [88043fd03f98] smp_apic_timer_interrupt at 81733d4f #8 [88043fd03fb0] apic_timer_interrupt at 817326dd --- --- #9 [880426f0d958] apic_timer_interrupt at 817326dd [exception RIP: generic_exec_single+130] RIP: 810dbe62 RSP: 880426f0da00 RFLAGS: 0202 RAX: 0002 RBX: 880426f0d9d0 RCX: 0001 RDX: 8180ad60 RSI: RDI: 0286 RBP: 880426f0da30 R8: 8180ad48 R9: 88042713bc68 R10: 7fe7d1f2dbd0 R11: 0206 R12: 8804274bb000 R13: R14: 880407670280 R15: ORIG_RAX: ff10 CS: 0010 SS: 0018 #10 [880426f0da38] smp_call_function_single at 810dbf75 #11 [880426f0dab0] smp_call_function_many at 810dc3a6 #12 [880426f0db10] native_flush_tlb_others at 8105c8f7 #13 [880426f0db38] flush_tlb_mm_range at 8105c9cb #14 [880426f0db68] pmdp_splitting_flush at 8105b80d #15 [880426f0db88] __split_huge_page at 811ac90b #16 [880426f0dc20] split_huge_page_to_list at 811acfb8 #17 [880426f0dc48] __split_huge_page_pmd at 811ad956 #18 [880426f0dcc8] unmap_page_range at 8117728d #19 [880426f0dda0] unmap_single_vma at 81177341 #20 [880426f0ddd8] zap_page_range at 811784cd #21 [880426f0de90] sys_madvise at 81174fbf #22 [880426f0df80] system_call_fastpath at 8173196d RIP: 7fe7ca2cc647 RSP: 7fe7be9febf0 RFLAGS: 0293 RAX: 001c RBX: 8173196d RCX: RDX: 0004 RSI: 007fb000 RDI: 7fe7be1ff000 RBP: R8: R9: 7fe7d1cd2738 R10: 7fe7d1f2dbd0 R11: 0206 R12: 7fe7be9ff700 R13: 7fe7be9ff9c0 R14: R15: ORIG_RAX: 001c CS: 0033 SS: 002b + [Fix] + + commit 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3, + Mitigates this issue if b6b8a1451fc40412c57d1 is applied (as in the case of the affected 3.13 distro kernel. However the issue can still occur in some cases. + + [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin 0 0 virsh vcpupin 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 - Another test case is to do the following (on affected hardware): 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce) 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM Sometimes this is sufficient to reproduce the issue, I've observed that running KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others). If this doesn't reproduce then you can do the following: 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between L1 vCPUs until the hang occurs. - -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description:Ubuntu 14.04.1 LTS Release:14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eve
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
** Also affects: linux (Ubuntu Trusty) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Trusty) Assignee: (unassigned) => Chris J Arges (arges) ** Changed in: linux (Ubuntu Trusty) Importance: Undecided => High ** Changed in: linux (Ubuntu Trusty) Status: New => In Progress ** Description changed: [Impact] Upstream discussion: https://lkml.org/lkml/2015/2/11/247 Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: 8804274bb000 CPU: 1 COMMAND: "qemu-system-x86" #0 [88043fd03d18] machine_kexec at 8104ac02 #1 [88043fd03d68] crash_kexec at 810e7203 #2 [88043fd03e30] panic at 81719ff4 #3 [88043fd03ea8] watchdog_timer_fn at 8110d7c5 #4 [88043fd03ed8] __run_hrtimer at 8108e787 #5 [88043fd03f18] hrtimer_interrupt at 8108ef4f #6 [88043fd03f80] local_apic_timer_interrupt at 81043537 #7 [88043fd03f98] smp_apic_timer_interrupt at 81733d4f #8 [88043fd03fb0] apic_timer_interrupt at 817326dd --- --- #9 [880426f0d958] apic_timer_interrupt at 817326dd [exception RIP: generic_exec_single+130] RIP: 810dbe62 RSP: 880426f0da00 RFLAGS: 0202 RAX: 0002 RBX: 880426f0d9d0 RCX: 0001 RDX: 8180ad60 RSI: RDI: 0286 RBP: 880426f0da30 R8: 8180ad48 R9: 88042713bc68 R10: 7fe7d1f2dbd0 R11: 0206 R12: 8804274bb000 R13: R14: 880407670280 R15: ORIG_RAX: ff10 CS: 0010 SS: 0018 #10 [880426f0da38] smp_call_function_single at 810dbf75 #11 [880426f0dab0] smp_call_function_many at 810dc3a6 #12 [880426f0db10] native_flush_tlb_others at 8105c8f7 #13 [880426f0db38] flush_tlb_mm_range at 8105c9cb #14 [880426f0db68] pmdp_splitting_flush at 8105b80d #15 [880426f0db88] __split_huge_page at 811ac90b #16 [880426f0dc20] split_huge_page_to_list at 811acfb8 #17 [880426f0dc48] __split_huge_page_pmd at 811ad956 #18 [880426f0dcc8] unmap_page_range at 8117728d #19 [880426f0dda0] unmap_single_vma at 81177341 #20 [880426f0ddd8] zap_page_range at 811784cd #21 [880426f0de90] sys_madvise at 81174fbf #22 [880426f0df80] system_call_fastpath at 8173196d RIP: 7fe7ca2cc647 RSP: 7fe7be9febf0 RFLAGS: 0293 RAX: 001c RBX: 8173196d RCX: RDX: 0004 RSI: 007fb000 RDI: 7fe7be1ff000 RBP: R8: R9: 7fe7d1cd2738 R10: 7fe7d1f2dbd0 R11: 0206 R12: 7fe7be9ff700 R13: 7fe7be9ff9c0 R14: R15: ORIG_RAX: 001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin 0 0 virsh vcpupin 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 + + Another test case is to do the following (on affected hardware): + + 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce) + 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU + 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM + + Sometimes this is sufficient to reproduce the issue, I've observed that running + KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others). + If this doesn't reproduce then you can do the following: + 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between + L1 vCPUs until the hang occurs. + + -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description:Ubuntu 14.04.1 LTS Release:14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
@chris: done https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1439394 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
@arosen, This looks like a different softlockup, and also the machine seems to recover from it. Please file a new bug and be sure to attach logs to the bug. Describe in detail how to reproduce this as well, what kind of host machine do you have? what VM definition are you using? Etc etc. ** Description changed: [Impact] - Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: + Upstream discussion: https://lkml.org/lkml/2015/2/11/247 + + Certain workloads that need to execute functions on a non-local CPU + using smp_call_function_* can result in soft lockups with the following + backtrace: PID: 22262 TASK: 8804274bb000 CPU: 1 COMMAND: "qemu-system-x86" #0 [88043fd03d18] machine_kexec at 8104ac02 #1 [88043fd03d68] crash_kexec at 810e7203 #2 [88043fd03e30] panic at 81719ff4 #3 [88043fd03ea8] watchdog_timer_fn at 8110d7c5 #4 [88043fd03ed8] __run_hrtimer at 8108e787 #5 [88043fd03f18] hrtimer_interrupt at 8108ef4f #6 [88043fd03f80] local_apic_timer_interrupt at 81043537 #7 [88043fd03f98] smp_apic_timer_interrupt at 81733d4f #8 [88043fd03fb0] apic_timer_interrupt at 817326dd --- --- #9 [880426f0d958] apic_timer_interrupt at 817326dd [exception RIP: generic_exec_single+130] RIP: 810dbe62 RSP: 880426f0da00 RFLAGS: 0202 RAX: 0002 RBX: 880426f0d9d0 RCX: 0001 RDX: 8180ad60 RSI: RDI: 0286 RBP: 880426f0da30 R8: 8180ad48 R9: 88042713bc68 R10: 7fe7d1f2dbd0 R11: 0206 R12: 8804274bb000 R13: R14: 880407670280 R15: ORIG_RAX: ff10 CS: 0010 SS: 0018 #10 [880426f0da38] smp_call_function_single at 810dbf75 #11 [880426f0dab0] smp_call_function_many at 810dc3a6 #12 [880426f0db10] native_flush_tlb_others at 8105c8f7 #13 [880426f0db38] flush_tlb_mm_range at 8105c9cb #14 [880426f0db68] pmdp_splitting_flush at 8105b80d #15 [880426f0db88] __split_huge_page at 811ac90b #16 [880426f0dc20] split_huge_page_to_list at 811acfb8 #17 [880426f0dc48] __split_huge_page_pmd at 811ad956 #18 [880426f0dcc8] unmap_page_range at 8117728d #19 [880426f0dda0] unmap_single_vma at 81177341 #20 [880426f0ddd8] zap_page_range at 811784cd #21 [880426f0de90] sys_madvise at 81174fbf #22 [880426f0df80] system_call_fastpath at 8173196d RIP: 7fe7ca2cc647 RSP: 7fe7be9febf0 RFLAGS: 0293 RAX: 001c RBX: 8173196d RCX: RDX: 0004 RSI: 007fb000 RDI: 7fe7be1ff000 RBP: R8: R9: 7fe7d1cd2738 R10: 7fe7d1f2dbd0 R11: 0206 R12: 7fe7be9ff700 R13: 7fe7be9ff9c0 R14: R15: ORIG_RAX: 001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin 0 0 virsh vcpupin 1 1 - [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description:Ubuntu 14.04.1 LTS Release:14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least): 24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.07200
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
I am also hitting this issue in my CI a lot. Here is the trace I'm getting in syslog: http://logs2.aaronorosen.com/85/169585/1/check/dsvm- tempest-full-congress- nodepool/94f8441/logs/syslog.txt.gz#_Apr__1_02_43_44 Is there a work around for this? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
@fifieldt Hi, that is the same bug. Things to reduce the hangs right now are: - Disabling KSM in L1 guest - Using 3.16 kernel on the L0 host - Pinning L1 vCPUs to L0 host CPU Note this doesn't fix the issue, it only decreases (potentially) the frequency of these lockups. --chris -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Hi, Just wanted to chime in that this bug also affected me - running OpenStack Juno w/KVM inside a KVM hypervisor. CPU on the host machine is: vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz running 14.04 with the latest packages applied as of today (2015-03-27) for both the host and the guest. Lockup appeared to happen with one host-guest VM after I altered the number of CPUs allocated to another VM (yet to reboot that VM for changes to take affect), though I had also recently booted a new host- guest-guest VM. ar 27 15:12:43 compute ntpd[1775]: peers refreshed Mar 27 15:12:43 compute ntpd[1775]: new interface(s) found: waking up resolver Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPDISCOVER(br100) fa:16:3e:c3:81:22 Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPOFFER(br100) 203.0.113.27 fa:16:3e:c3:81:22 Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPREQUEST(br100) 203.0.113.27 fa:16:3e:c3:81:22 Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPACK(br100) 203.0.113.27 fa:16:3e:c3:81:22 test03 Mar 27 15:15:40 compute kernel: [ 436.12] BUG: soft lockup - CPU#5 stuck for 23s! [ksmd:68] Mar 27 15:15:40 compute kernel: [ 436.12] Modules linked in: vhost_net vhost macvtap macvlan xt_CHECKSUM ebt_ip ebt_arp ebtable_filter br idge stp llc xt_conntrack xt_nat xt_tcpudp iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6tabl e_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_ tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_intel cirrus snd_hda_codec ttm snd_hwdep drm_kms_helper snd_pcm drm snd_page_alloc snd_ timer syscopyarea snd sysfillrect soundcore sysimgblt dm_multipath i2c_piix4 kvm_intel scsi_dh serio_raw kvm mac_hid lp parport 8139too psmous e 8139cp mii floppy pata_acpi Mar 27 15:15:40 compute kernel: [ 436.12] CPU: 5 PID: 68 Comm: ksmd Not tainted 3.13.0-46-generic #79-Ubuntu Mar 27 15:15:40 compute kernel: [ 436.12] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Mar 27 15:15:40 compute kernel: [ 436.12] task: 8802306db000 ti: 8802306e4000 task.ti: 8802306e4000 Mar 27 15:15:40 compute kernel: [ 436.12] RIP: 0010:[] [] generic_exec_single+0x86/0xb0 Mar 27 15:15:40 compute kernel: [ 436.12] RSP: 0018:8802306e5c00 EFLAGS: 0202 Mar 27 15:15:40 compute kernel: [ 436.12] RAX: 0006 RBX: 8802306e5bd0 RCX: 0005 Mar 27 15:15:40 compute kernel: [ 436.12] RDX: 8180ade0 RSI: RDI: 0286 Mar 27 15:15:40 compute kernel: [ 436.12] RBP: 8802306e5c30 R08: 8180adc8 R09: 880232989b48 Mar 27 15:15:40 compute kernel: [ 436.12] R10: 0867 R11: R12: Mar 27 15:15:40 compute kernel: [ 436.12] R13: R14: R15: Mar 27 15:15:40 compute kernel: [ 436.12] FS: () GS:88023fd4() knlGS: Mar 27 15:15:40 compute kernel: [ 436.12] CS: 0010 DS: ES: CR0: 8005003b Mar 27 15:15:40 compute kernel: [ 436.12] CR2: 7fb0557bf000 CR3: 36b7d000 CR4: 26e0 Mar 27 15:15:40 compute kernel: [ 436.12] Stack: Mar 27 15:15:40 compute kernel: [ 436.12] 88023fd13f80 0004 0005 81d14300 Mar 27 15:15:40 compute kernel: [ 436.12] 8105c7a0 88023212c380 8802306e5ca8 810dc065 Mar 27 15:15:40 compute kernel: [ 436.12] 000134c0 000134c0 88023fd13f80 88023fd13f80 Mar 27 15:15:40 compute kernel: [ 436.12] Call Trace: Mar 27 15:15:40 compute kernel: [ 436.12] [] ? leave_mm+0x80/0x80 Mar 27 15:15:40 compute kernel: [ 436.12] [] smp_call_function_single+0xe5/0x190 Mar 27 15:15:40 compute kernel: [ 436.12] [] ? leave_mm+0x80/0x80 Mar 27 15:15:40 compute kernel: [ 436.12] [] ? kvm_handle_hva_range+0x11a/0x180 [kvm] Mar 27 15:15:40 compute kernel: [ 436.12] [] ? rmap_write_protect+0x80/0x80 [kvm] Mar 27 15:15:40 compute kernel: [ 436.12] [] smp_call_function_many+0x286/0x2d0 Mar 27 15:15:40 compute kernel: [ 436.12] [] ? leave_mm+0x80/0x80 Mar 27 15:15:40 compute kernel: [ 436.12] [] native_flush_tlb_others+0x37/0x40 Mar 27 15:15:40 compute kernel: [ 436.12] [] flush_tlb_page+0x56/0xa0 Mar 27 15:15:40 compute kernel: [ 436.12] [] ptep_clear_flush+0x48/0x60 Mar 27 15:15:40 compute kernel: [ 436.12] [] try_to_merge_with_ksm_page+0x14f/0x650 Mar 27 15:15:40 compute kernel: [ 436.12] [] ksm_do_scan+0xb96/0xdb0 Mar 27 15:15:40 compute kernel: [ 436.12] [] ksm_scan_thread+0x7f/0x200 Mar 27 15:15:40 compute kernel: [ 436.12] [] ? prepare_
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Ideas going forward: 1) Instrument kernel for debugging csd_lock 2) Determine which CPUs exhibit this issue 3) Examine pinning more in depth pin 0-0 1-2 for example 4) Test older kernels , newer kernels to verify issue -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Stefan, This looks like a separate bug (as we discussed). Please file another bug for this when you have time. ** Description changed: [Impact] - Users of nested KVM for testing openstack have soft lockups as follows: + Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: 8804274bb000 CPU: 1 COMMAND: "qemu-system-x86" - #0 [88043fd03d18] machine_kexec at 8104ac02 - #1 [88043fd03d68] crash_kexec at 810e7203 - #2 [88043fd03e30] panic at 81719ff4 - #3 [88043fd03ea8] watchdog_timer_fn at 8110d7c5 - #4 [88043fd03ed8] __run_hrtimer at 8108e787 - #5 [88043fd03f18] hrtimer_interrupt at 8108ef4f - #6 [88043fd03f80] local_apic_timer_interrupt at 81043537 - #7 [88043fd03f98] smp_apic_timer_interrupt at 81733d4f - #8 [88043fd03fb0] apic_timer_interrupt at 817326dd + #0 [88043fd03d18] machine_kexec at 8104ac02 + #1 [88043fd03d68] crash_kexec at 810e7203 + #2 [88043fd03e30] panic at 81719ff4 + #3 [88043fd03ea8] watchdog_timer_fn at 8110d7c5 + #4 [88043fd03ed8] __run_hrtimer at 8108e787 + #5 [88043fd03f18] hrtimer_interrupt at 8108ef4f + #6 [88043fd03f80] local_apic_timer_interrupt at 81043537 + #7 [88043fd03f98] smp_apic_timer_interrupt at 81733d4f + #8 [88043fd03fb0] apic_timer_interrupt at 817326dd --- --- - #9 [880426f0d958] apic_timer_interrupt at 817326dd - [exception RIP: generic_exec_single+130] - RIP: 810dbe62 RSP: 880426f0da00 RFLAGS: 0202 - RAX: 0002 RBX: 880426f0d9d0 RCX: 0001 - RDX: 8180ad60 RSI: RDI: 0286 - RBP: 880426f0da30 R8: 8180ad48 R9: 88042713bc68 - R10: 7fe7d1f2dbd0 R11: 0206 R12: 8804274bb000 - R13: R14: 880407670280 R15: - ORIG_RAX: ff10 CS: 0010 SS: 0018 + #9 [880426f0d958] apic_timer_interrupt at 817326dd + [exception RIP: generic_exec_single+130] + RIP: 810dbe62 RSP: 880426f0da00 RFLAGS: 0202 + RAX: 0002 RBX: 880426f0d9d0 RCX: 0001 + RDX: 8180ad60 RSI: RDI: 0286 + RBP: 880426f0da30 R8: 8180ad48 R9: 88042713bc68 + R10: 7fe7d1f2dbd0 R11: 0206 R12: 8804274bb000 + R13: R14: 880407670280 R15: + ORIG_RAX: ff10 CS: 0010 SS: 0018 #10 [880426f0da38] smp_call_function_single at 810dbf75 #11 [880426f0dab0] smp_call_function_many at 810dc3a6 #12 [880426f0db10] native_flush_tlb_others at 8105c8f7 #13 [880426f0db38] flush_tlb_mm_range at 8105c9cb #14 [880426f0db68] pmdp_splitting_flush at 8105b80d #15 [880426f0db88] __split_huge_page at 811ac90b #16 [880426f0dc20] split_huge_page_to_list at 811acfb8 #17 [880426f0dc48] __split_huge_page_pmd at 811ad956 #18 [880426f0dcc8] unmap_page_range at 8117728d #19 [880426f0dda0] unmap_single_vma at 81177341 #20 [880426f0ddd8] zap_page_range at 811784cd #21 [880426f0de90] sys_madvise at 81174fbf #22 [880426f0df80] system_call_fastpath at 8173196d - RIP: 7fe7ca2cc647 RSP: 7fe7be9febf0 RFLAGS: 0293 - RAX: 001c RBX: 8173196d RCX: - RDX: 0004 RSI: 007fb000 RDI: 7fe7be1ff000 - RBP: R8: R9: 7fe7d1cd2738 - R10: 7fe7d1f2dbd0 R11: 0206 R12: 7fe7be9ff700 - R13: 7fe7be9ff9c0 R14: R15: - ORIG_RAX: 001c CS: 0033 SS: 002b + RIP: 7fe7ca2cc647 RSP: 7fe7be9febf0 RFLAGS: 0293 + RAX: 001c RBX: 8173196d RCX: + RDX: 0004 RSI: 007fb000 RDI: 7fe7be1ff000 + RBP: R8: R9: 7fe7d1cd2738 + R10: 7fe7d1f2dbd0 R11: 0206 R12: 7fe7be9ff700 + R13: 7fe7be9ff9c0 R14: R15: + ORIG_RAX: 001c CS: 0033 SS: 002b + + [Workaround] + + In order to avoid this issue, the workload needs to be pinned to CPUs + such that the function always executes locally. For the nested VM case, + this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. + This can be accomplished with the following (for 2 vCPUs): + + virsh vcpupin 0 0 + virsh vcpupin 1 1 [Test Case] - Deplo
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
I've added instructions for a workaround. The code paths I've seen in crashes has been the following: kvm_sched_in -> kvm_arch_vcpu_load -> vmx_vcpu_load -> loaded_vmcs_clear -> smp_call_function_single pmdp_clear_flush -> flush_tlb_mm_range -> native_flush_tlb_others -> smp_call_function_many Generally this has been caused by workloads that use nested VMs, and stress L2/L1 vms (causing non-local CPU TLB flushing or VMCS clearing). The hang is in csd_lock_wait waiting for CSD_FLAG_LOCK bit to be cleared, which can only be triggered with non-local smp_call_function_* calls. Another data point is that this can happen with x2apic as well as flat apic (as tested with nox2apic). -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Hrmn... When I repeated the setup I seem to have triggered some kind of lockup even while bringing up l2. Of course hard to say without details of Ryan's dump. However mine seems to have backtraces in the log which remind me an awful lot of an issue related to punching holes into ext4 based qcow images. Chris had been working on something like this before... He is on a sprint this week. Anyway, my strace in the log: [ 1200.288031] INFO: task qemu-system-x86:4545 blocked for more than 120 seconds. [ 1200.288712] Not tainted 3.13.0-46-generic #77-Ubuntu [ 1200.289204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1200.289892] qemu-system-x86 D 88007fc134c0 0 4545 1 0x [ 1200.289895] 88007a9c5d28 0082 88007bbd3000 88007a9c5fd8 [ 1200.289897] 000134c0 000134c0 88007bbd3000 88007fc13d58 [ 1200.289898] 88007ffcdee8 0002 8114eef0 88007a9c5da0 [ 1200.289900] Call Trace: [ 1200.289906] [] ? wait_on_page_read+0x60/0x60 [ 1200.289909] [] io_schedule+0x9d/0x140 [ 1200.289910] [] sleep_on_page+0xe/0x20 [ 1200.289912] [] __wait_on_bit+0x62/0x90 [ 1200.289914] [] wait_on_page_bit+0x7f/0x90 [ 1200.289917] [] ? autoremove_wake_function+0x40/0x40 [ 1200.289919] [] ? pagevec_lookup_tag+0x21/0x30 [ 1200.289921] [] filemap_fdatawait_range+0xf9/0x190 [ 1200.289923] [] filemap_write_and_wait_range+0x3f/0x70 [ 1200.289927] [] ext4_sync_file+0xba/0x320 [ 1200.289930] [] do_fsync+0x51/0x80 [ 1200.289931] [] SyS_fdatasync+0x13/0x20 [ 1200.289933] [] system_call_fastpath+0x1a/0x1f -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Yeah, will do. Just got distracted and wanted to ensure that the repro was not accidentally another form of failure path to the out of space issue. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
@smb - after repeating the test a few times, I too ran out of space with the default 8GB VM disk size, resulting in a paused VM. You'll have to re-create the VMs a little bit differently (--disk ). ex: @L0: sudo uvt-kvm destroy trusty-vm sudo uvt-kvm create --memory 2048 --disk 40 trusty-vm release=trusty @L1: #repeat original repro ref: http://manpages.ubuntu.com/manpages/trusty/man1/uvt-kvm.1.html -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Hm, following your instructions I rather run into a situation where the l2 guest gets paused. Likely because l1 runs out of disk space. The default of uvtool is 7G which I would say the l2 stress run fills as it grows the l2 qcow image on l1 which has to stuff all the initial cloud- image and the snapshot of that for l2 into its 7G. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
I've collected crash dumps, and have stored them on an internal Canonical server as they are 2gb+. Feel free to ping me for access. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
A few hrs later, those two L0 bare metal host CPUs are still maxed. In scenarios where L0 is hosting many VMs, such as in a cloud, this bug can be expected to cause significant performance, consistency and capacity issues on the host and in the cloud as a whole. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
** Attachment added: "L1-console-log-soft-lockup.png" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+attachment/4353984/+files/L1-console-log-soft-lockup.png -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
** Attachment added: "L0-baremetal-cpu-pegged.png" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+attachment/4353983/+files/L0-baremetal-cpu-pegged.png -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
s/static/sym/ ;-) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
This does not appear to be specific to OpenStack, nor tempest. I've reproduced with Trusty on Trusty on Trusty, vanilla qemu/kvm. Simplified reproducer, with an existing MAAS cluster: @L0 baremetal: - Create a Trusty bare metal host from daily images. - sudo apt-get update -y && sudo apt-get -y install uvtool - sudo uvt-simplestreams-libvirt sync release=trusty arch=amd64 - sudo uvt-simplestreams-libvirt query - ssh-keygen - sudo uvt-kvm create --memory 2048 trusty-vm release=trusty - sudo virsh shutdown trusty-vm - # edit the /etc/libvirt/qemu/trusty-vm.xml to enable serial console dump to file: - sudo virsh define /etc/libvirt/qemu/trusty-vm.xml - sudo virsh start trusty-vm - # confirm console output: - sudo tailf /tmp/trusty-vm-console.log - # take note of the VM's IP: - sudo uvt-kvm ip trusty-vm - # ssh into the new vm. @L1 "trusty-vm": - sudo apt-get update -y && sudo apt-get -y install uvtool - sudo uvt-simplestreams-libvirt sync release=trusty arch=amd64 - sudo uvt-simplestreams-libvirt query - ssh-keygen - # change .122. to .123. in /etc/libvirt/qemu/networks/default.xml - # make sure default.xml is static linked inside /etc/libvirt/qemu/networks - sudo reboot # for good measure - sudo uvt-kvm create --memory 768 trusty-nest release=trusty - # take note of the nested VM's IP - sudo uvt-kvm ip trusty-vm - # ssh into the new vm. @L2 "trusty-nest": - sudo apt-get update && sudo apt-get install stress - stress -c 1 -i 1 -m 1 -d 1 -t 600 Now watch the "trusty-vm" console for: [ 496.076004] BUG: soft lockup - CPU#0 stuck for 23s! [ksmd:36]. It happens to me within a couple of minutes. Then, both L1 and L2 become unreachable indefinitely, with two cores on L0 stuck at 100%. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
Also FYI: I was not able to reproduce this issue when using Vivid as the bare metal L0. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1413540] Re: Trusty soft lockup issues with nested KVM
** Summary changed: - soft lockup issues with nested KVM VMs running tempest + Trusty soft lockup issues with nested KVM -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1413540 Title: Trusty soft lockup issues with nested KVM To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs