[Kernel-packages] [Bug 1630304] Comment bridged from LTC Bugzilla
--- Comment From sthou...@in.ibm.com 2018-02-23 06:46 EDT--- Any updates? As part of the bug backlog screening exercise, we are looking for updates on the bugs that are not witnessing activity for quite sometime. Please post the latest status on this bug so that it can closed accordingly. Thanks for your support!! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1630304 Title: Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF Status in linux package in Ubuntu: Opinion Bug description: ---Problem Description--- I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 VF from a guest I am getting some issues: Like in this case: [ 474.393308] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user [ 474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0 [ 474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1 [ 474.393906] mlx5_core 0001:01:00.3: enabling device ( -> 0002) [ 474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu [ 474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fff pg=1000 [ 474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226 [ 474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass [ 474.402284] EEH: Frozen PE#6 on PHB#1 detected [ 474.402475] EEH: PE location: Slot4, PHB location: N/A [ 474.403699] EEH: This PCI device has failed 1 times in the last hour [ 474.403700] EEH: Notify device drivers to shutdown [ 474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called [ 474.403711] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start [ 474.403870] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end One time I saw SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: enabling device ( -> 0002) Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu Sep 13 09:41:32 ltc-fire1 kernel: [70437.944145] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944152] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944195] mlx5_core 0001:01:00.3: firmware version: 12.17.226 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944260] Unable to handle kernel paging request for data at address 0x Sep 13 09:41:32 ltc-fire1 kernel: [70437.944533] Faulting instruction address: 0xc05b37e0 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944592] Oops: Kernel access of bad area, sig: 11 [#1] Sep 13 09:41:32 ltc-fire1 kernel: [70437.944636] SMP NR_CPUS=2048 NUMA PowerNV Sep 13 09:41:32 ltc-fire1 kernel: [70437.944851] Modules linked in: vfio_pci irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) mlx4_en(OE) ib_addr(OE) ib_netlink(OE) mlx4_core(OE) mlx_compat(OE) bridge stp llc joydev input_leds mac_hid ofpart at24 cmdlinepart powernv_flash ipmi_powernv nvmem_core uio_pdrv_genirq opal_prd mtd ipmi_msghandler uio ibmpowernv powernv_rng binfmt_misc dm_multipath knem(OE) ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci devlink libahci [last unloaded: mlx4_core] Sep 13 09:41:32 ltc-fire1 kernel: [70437.946007] CPU: 40 PID: 12501 Comm: libvirtd Tainted: G OE 4.7.0unofficial #5 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946074] task: c00ec319a200 ti: c00ec324c000 task.ti: c00ec324c000 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946140] NIP: c05b37e0 LR: c05ad070 CTR: Sep 13 09:41:32 ltc-fire1 kernel: [70437.946208] REGS: c00ec324f100 TRAP: 0300 Tainted: G OE(4.7.0unofficial) Sep 13 09:41:32 ltc-fire1 kernel: [70437.946286] MSR: 90010280b033 CR: 84028844 XER: 2000 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] CFAR: c0008468 DAR: DSISR: 4000 SOFTE: 0 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR00: c05d19c8 c00ec324f380 c13bef00 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GP
[Kernel-packages] [Bug 1630304] Comment bridged from LTC Bugzilla
--- Comment From mdr...@us.ibm.com 2017-11-21 11:31 EDT--- The patches are upstream and will be included in QEMU 2.11: 1: commit 04162f8f4bcf8c9ae2422def4357289b44208c8c Author: Michael Roth Date: Mon Oct 16 17:23:13 2017 -0500 qdev: store DeviceState's canonical path to use when unparenting 2: commit 2fc06c4ac65594ad248e9a9150ebdde9ff5a1253 Author: Michael Roth Date: Mon Oct 16 17:23:14 2017 -0500 Revert "qdev: Free QemuOpts when the QOM path goes away" 3: commit f7b879e072ae6839b1b1d1312f48fa7f256397e2 Author: Michael Roth Date: Mon Oct 16 17:23:15 2017 -0500 qdev: defer DEVICE_DEL event until instance_finalize() -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1630304 Title: Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF Status in linux package in Ubuntu: Opinion Bug description: ---Problem Description--- I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 VF from a guest I am getting some issues: Like in this case: [ 474.393308] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user [ 474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0 [ 474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1 [ 474.393906] mlx5_core 0001:01:00.3: enabling device ( -> 0002) [ 474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu [ 474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fff pg=1000 [ 474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226 [ 474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass [ 474.402284] EEH: Frozen PE#6 on PHB#1 detected [ 474.402475] EEH: PE location: Slot4, PHB location: N/A [ 474.403699] EEH: This PCI device has failed 1 times in the last hour [ 474.403700] EEH: Notify device drivers to shutdown [ 474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called [ 474.403711] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start [ 474.403870] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end One time I saw SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: enabling device ( -> 0002) Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu Sep 13 09:41:32 ltc-fire1 kernel: [70437.944145] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944152] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944195] mlx5_core 0001:01:00.3: firmware version: 12.17.226 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944260] Unable to handle kernel paging request for data at address 0x Sep 13 09:41:32 ltc-fire1 kernel: [70437.944533] Faulting instruction address: 0xc05b37e0 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944592] Oops: Kernel access of bad area, sig: 11 [#1] Sep 13 09:41:32 ltc-fire1 kernel: [70437.944636] SMP NR_CPUS=2048 NUMA PowerNV Sep 13 09:41:32 ltc-fire1 kernel: [70437.944851] Modules linked in: vfio_pci irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) mlx4_en(OE) ib_addr(OE) ib_netlink(OE) mlx4_core(OE) mlx_compat(OE) bridge stp llc joydev input_leds mac_hid ofpart at24 cmdlinepart powernv_flash ipmi_powernv nvmem_core uio_pdrv_genirq opal_prd mtd ipmi_msghandler uio ibmpowernv powernv_rng binfmt_misc dm_multipath knem(OE) ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci devlink libahci [last unloaded: mlx4_core] Sep 13 09:41:32 ltc-fire1 kernel: [70437.946007] CPU: 40 PID: 12501 Comm: libvirtd Tainted: G OE 4.7.0unofficial #5 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946074] task: c00ec319a200 ti: c00ec324c000 task.ti: c00ec324c000 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946140] NIP: c05b37e0 LR: c05ad070 CTR: Sep 13 09:41:32 ltc-fire1 kernel: [70437.946208] REGS: c00ec324f100 TRAP: 0300 Tainted: G OE(4.7.0unofficial) Sep 13 09:41:32 ltc-fire1 kernel: [70437.946286] MSR: 90010280b033 CR: 8402884
[Kernel-packages] [Bug 1630304] Comment bridged from LTC Bugzilla
--- Comment From lagar...@br.ibm.com 2017-08-04 16:11 EDT--- Michael sent an updated version for his patch to the QEMU community last week [0]. 0. https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08410.html -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1630304 Title: Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF Status in linux package in Ubuntu: Opinion Bug description: ---Problem Description--- I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 VF from a guest I am getting some issues: Like in this case: [ 474.393308] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user [ 474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0 [ 474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1 [ 474.393906] mlx5_core 0001:01:00.3: enabling device ( -> 0002) [ 474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu [ 474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fff pg=1000 [ 474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226 [ 474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass [ 474.402284] EEH: Frozen PE#6 on PHB#1 detected [ 474.402475] EEH: PE location: Slot4, PHB location: N/A [ 474.403699] EEH: This PCI device has failed 1 times in the last hour [ 474.403700] EEH: Notify device drivers to shutdown [ 474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called [ 474.403711] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start [ 474.403870] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end One time I saw SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: enabling device ( -> 0002) Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu Sep 13 09:41:32 ltc-fire1 kernel: [70437.944145] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944152] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944195] mlx5_core 0001:01:00.3: firmware version: 12.17.226 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944260] Unable to handle kernel paging request for data at address 0x Sep 13 09:41:32 ltc-fire1 kernel: [70437.944533] Faulting instruction address: 0xc05b37e0 Sep 13 09:41:32 ltc-fire1 kernel: [70437.944592] Oops: Kernel access of bad area, sig: 11 [#1] Sep 13 09:41:32 ltc-fire1 kernel: [70437.944636] SMP NR_CPUS=2048 NUMA PowerNV Sep 13 09:41:32 ltc-fire1 kernel: [70437.944851] Modules linked in: vfio_pci irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) mlx4_en(OE) ib_addr(OE) ib_netlink(OE) mlx4_core(OE) mlx_compat(OE) bridge stp llc joydev input_leds mac_hid ofpart at24 cmdlinepart powernv_flash ipmi_powernv nvmem_core uio_pdrv_genirq opal_prd mtd ipmi_msghandler uio ibmpowernv powernv_rng binfmt_misc dm_multipath knem(OE) ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sys imgblt fb_sys_fops drm ahci devlink libahci [last unloaded: mlx4_core] Sep 13 09:41:32 ltc-fire1 kernel: [70437.946007] CPU: 40 PID: 12501 Comm: libvirtd Tainted: G OE 4.7.0unofficial #5 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946074] task: c00ec319a200 ti: c00ec324c000 task.ti: c00ec324c000 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946140] NIP: c05b37e0 LR: c05ad070 CTR: Sep 13 09:41:32 ltc-fire1 kernel: [70437.946208] REGS: c00ec324f100 TRAP: 0300 Tainted: G OE(4.7.0unofficial) Sep 13 09:41:32 ltc-fire1 kernel: [70437.946286] MSR: 90010280b033 CR: 84028844 XER: 2000 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] CFAR: c0008468 DAR: DSISR: 4000 SOFTE: 0 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR00: c05d19c8 c00ec324f380 c13bef00 Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR04: Sep 13 09:41:32 ltc-fire1
[Kernel-packages] [Bug 1630304] Comment bridged from LTC Bugzilla
--- Comment From mdr...@us.ibm.com 2016-10-04 18:34 EDT--- Some observations: 1) QEMU appears to be sending the 'device-removed' event prematurely. The below output shows that the device's VFIO group FD is still open by the QEMU process at the time it signals libvirt that the device unplug/cleanup has completed: root@ltc-fire1:~# virsh event ltc-fire1-vm3-ubuntu-16.10 --event device-removed && lsof /dev/vfio/7 event 'device-removed' for domain ltc-fire1-vm3-ubuntu-16.10: hostdev0 events received: 1 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME qemu-syst 31231 libvirt-qemu 26u CHR 242,0 0t0 750 /dev/vfio/7 2) In response to this event, libvirt issues the following sequence to rebind the VF: echo $DEVID >/sys/bus/pci/drivers/vfio-pci/unbind echo $DEVID >/sys/bus/pci/drivers_probe 3) On the VFIO side, this consistently leads to mlx5_core attempting to bind to the device while VFIO is still running it's cleanup routines: [ 120.099498] KVM guest htab at c00f2b00 (order 26), LPID 1 [ 120.208235] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..3fff pg=1000 [ 138.281730] pci 0001:01: 0.2: [PE# 005] Setting up window#1 800..801 pg=1 [ 396.873573] vfio-pci 0001:01:00.2: No device request channel registered, blocked until released by user [ 396.873791] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0 [ 396.873796] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1 [ 396.873908] mlx5_core 0001:01:00.2: enabling device ( -> 0002) [ 396.873940] mlx5_core 0001:01:00.2: Using 32-bit DMA via iommu [ 396.874034] mlx5_core 0001:01:00.2: firmware version: 12.17.1010 The full cleanup path should include something like: [ 4762.425039] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0 [ 4762.425043] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1 [ 4762.432014] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..7fff pg=1000 [ 4762.432018] pci 0001:01: 0.2: [PE# 005] Enabling 64-bit DMA bypass So the driver is attempting to enable the device before the default DMA windows have been restored 4) The sleep Carol inserted above in VFIO cleanup path seems to avoid the issue. This suggests that the reprobe doesn't blindly run but instead waits for a signal of some sort, but that that signaling seems to happen prematurely without the explicit sleep. This probably needs to be addressed at multiple levels, a fix in QEMU to defer the device-deleted event until VFIO has cleanup up the device, and a fix in VFIO path to avoid crashing the host if someone were to issue the reprobe manually while the device is still in use. A possible workaround that might be worth trying in the meantime is specifying managed='no' in the device XML, which according to libvirt documentation would prevent libvirt from automatically rebinding the device back to default in the host after unplug. But I saw mention that maybe this wasn't supported yet for KVM, so it's not a given. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1630304 Title: Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF Status in linux package in Ubuntu: New Bug description: ---Problem Description--- I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 VF from a guest I am getting some issues: Like in this case: [ 474.393308] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user [ 474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0 [ 474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1 [ 474.393906] mlx5_core 0001:01:00.3: enabling device ( -> 0002) [ 474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu [ 474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fff pg=1000 [ 474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226 [ 474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass [ 474.402284] EEH: Frozen PE#6 on PHB#1 detected [ 474.402475] EEH: PE location: Slot4, PHB location: N/A [ 474.403699] EEH: This PCI device has failed 1 times in the last hour [ 474.403700] EEH: Notify device drivers to shutdown [ 474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called [ 474.403711] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start [ 474.403870] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end One time I saw SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: enabling device ( -> 0002) Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu Sep 13 09:41:32 ltc-fire1 kernel: [704