[Kernel-packages] [Bug 1630304] Comment bridged from LTC Bugzilla

2018-02-23 Thread bugproxy
--- Comment From sthou...@in.ibm.com 2018-02-23 06:46 EDT---
Any updates?

As part of the bug backlog screening exercise, we are looking for
updates on the bugs that are not witnessing activity for quite sometime.

Please post the latest status on this bug so that it can closed
accordingly.

Thanks for your support!!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1630304

Title:
  Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF

Status in linux package in Ubuntu:
  Opinion

Bug description:
  ---Problem Description---
  I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 
VF from a guest I am getting some issues:
  Like in this case:
  [  474.393308] vfio-pci 0001:01:00.3: No device request channel registered, 
blocked until released by user
  [  474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0
  [  474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1
  [  474.393906] mlx5_core 0001:01:00.3: enabling device ( -> 0002)
  [  474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu
  [  474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fff 
pg=1000
  [  474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226
  [  474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass
  [  474.402284] EEH: Frozen PE#6 on PHB#1 detected
  [  474.402475] EEH: PE location: Slot4, PHB location: N/A
  [  474.403699] EEH: This PCI device has failed 1 times in the last hour
  [  474.403700] EEH: Notify device drivers to shutdown
  [  474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called
  [  474.403711] mlx5_core 0001:01:00.3: 
0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start
  [  474.403870] mlx5_core 0001:01:00.3: 
0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end

  
  One time I saw 
  SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No 
device request channel registered, blocked until released by user
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: 
enabling device ( -> 0002)
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: 
Using 32-bit DMA via iommu
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944145] pci 0001:01: 0.3: [PE# 006] 
Removing DMA window #0
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944152] pci 0001:01: 0.3: [PE# 006] 
Removing DMA window #1
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944195] mlx5_core 0001:01:00.3: 
firmware version: 12.17.226
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944260] Unable to handle kernel 
paging request for data at address 0x
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944533] Faulting instruction 
address: 0xc05b37e0
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944592] Oops: Kernel access of bad 
area, sig: 11 [#1]
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944636] SMP NR_CPUS=2048 NUMA PowerNV
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944851] Modules linked in: vfio_pci 
irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT 
nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) 
iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) 
mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) mlx4_en(OE) 
ib_addr(OE) ib_netlink(OE) mlx4_core(OE) mlx_compat(OE) bridge stp llc joydev 
input_leds mac_hid ofpart at24 cmdlinepart powernv_flash ipmi_powernv 
nvmem_core uio_pdrv_genirq opal_prd mtd ipmi_msghandler uio ibmpowernv 
powernv_rng binfmt_misc dm_multipath knem(OE) ip_tables x_tables autofs4 
hid_generic usbhid hid uas usb_storage ast i2c_algo_bit ttm drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci devlink libahci [last 
unloaded: mlx4_core]
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946007] CPU: 40 PID: 12501 Comm: 
libvirtd Tainted: G   OE   4.7.0unofficial #5
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946074] task: c00ec319a200 ti: 
c00ec324c000 task.ti: c00ec324c000
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946140] NIP: c05b37e0 LR: 
c05ad070 CTR: 
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946208] REGS: c00ec324f100 TRAP: 
0300   Tainted: G   OE(4.7.0unofficial)
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946286] MSR: 90010280b033 
  CR: 84028844  XER: 2000
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] CFAR: c0008468 DAR: 
 DSISR: 4000 SOFTE: 0
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR00: c05d19c8 
c00ec324f380 c13bef00 
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GP

[Kernel-packages] [Bug 1630304] Comment bridged from LTC Bugzilla

2017-11-21 Thread bugproxy
--- Comment From mdr...@us.ibm.com 2017-11-21 11:31 EDT---
The patches are upstream and will be included in QEMU 2.11:

1: commit 04162f8f4bcf8c9ae2422def4357289b44208c8c
Author: Michael Roth 
Date:   Mon Oct 16 17:23:13 2017 -0500

qdev: store DeviceState's canonical path to use when unparenting

2: commit 2fc06c4ac65594ad248e9a9150ebdde9ff5a1253
Author: Michael Roth 
Date:   Mon Oct 16 17:23:14 2017 -0500

Revert "qdev: Free QemuOpts when the QOM path goes away"

3: commit f7b879e072ae6839b1b1d1312f48fa7f256397e2
Author: Michael Roth 
Date:   Mon Oct 16 17:23:15 2017 -0500

qdev: defer DEVICE_DEL event until instance_finalize()

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1630304

Title:
  Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF

Status in linux package in Ubuntu:
  Opinion

Bug description:
  ---Problem Description---
  I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 
VF from a guest I am getting some issues:
  Like in this case:
  [  474.393308] vfio-pci 0001:01:00.3: No device request channel registered, 
blocked until released by user
  [  474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0
  [  474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1
  [  474.393906] mlx5_core 0001:01:00.3: enabling device ( -> 0002)
  [  474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu
  [  474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fff 
pg=1000
  [  474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226
  [  474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass
  [  474.402284] EEH: Frozen PE#6 on PHB#1 detected
  [  474.402475] EEH: PE location: Slot4, PHB location: N/A
  [  474.403699] EEH: This PCI device has failed 1 times in the last hour
  [  474.403700] EEH: Notify device drivers to shutdown
  [  474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called
  [  474.403711] mlx5_core 0001:01:00.3: 
0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start
  [  474.403870] mlx5_core 0001:01:00.3: 
0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end

  
  One time I saw 
  SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No 
device request channel registered, blocked until released by user
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: 
enabling device ( -> 0002)
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: 
Using 32-bit DMA via iommu
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944145] pci 0001:01: 0.3: [PE# 006] 
Removing DMA window #0
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944152] pci 0001:01: 0.3: [PE# 006] 
Removing DMA window #1
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944195] mlx5_core 0001:01:00.3: 
firmware version: 12.17.226
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944260] Unable to handle kernel 
paging request for data at address 0x
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944533] Faulting instruction 
address: 0xc05b37e0
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944592] Oops: Kernel access of bad 
area, sig: 11 [#1]
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944636] SMP NR_CPUS=2048 NUMA PowerNV
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944851] Modules linked in: vfio_pci 
irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT 
nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) 
iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) 
mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) mlx4_en(OE) 
ib_addr(OE) ib_netlink(OE) mlx4_core(OE) mlx_compat(OE) bridge stp llc joydev 
input_leds mac_hid ofpart at24 cmdlinepart powernv_flash ipmi_powernv 
nvmem_core uio_pdrv_genirq opal_prd mtd ipmi_msghandler uio ibmpowernv 
powernv_rng binfmt_misc dm_multipath knem(OE) ip_tables x_tables autofs4 
hid_generic usbhid hid uas usb_storage ast i2c_algo_bit ttm drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci devlink libahci [last 
unloaded: mlx4_core]
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946007] CPU: 40 PID: 12501 Comm: 
libvirtd Tainted: G   OE   4.7.0unofficial #5
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946074] task: c00ec319a200 ti: 
c00ec324c000 task.ti: c00ec324c000
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946140] NIP: c05b37e0 LR: 
c05ad070 CTR: 
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946208] REGS: c00ec324f100 TRAP: 
0300   Tainted: G   OE(4.7.0unofficial)
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946286] MSR: 90010280b033 
  CR: 8402884

[Kernel-packages] [Bug 1630304] Comment bridged from LTC Bugzilla

2017-08-04 Thread bugproxy
--- Comment From lagar...@br.ibm.com 2017-08-04 16:11 EDT---
Michael sent an updated version for his patch to the QEMU community last week 
[0].

0. https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08410.html

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1630304

Title:
  Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF

Status in linux package in Ubuntu:
  Opinion

Bug description:
  ---Problem Description---
  I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 
VF from a guest I am getting some issues:
  Like in this case:
  [  474.393308] vfio-pci 0001:01:00.3: No device request channel registered, 
blocked until released by user
  [  474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0
  [  474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1
  [  474.393906] mlx5_core 0001:01:00.3: enabling device ( -> 0002)
  [  474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu
  [  474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fff 
pg=1000
  [  474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226
  [  474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass
  [  474.402284] EEH: Frozen PE#6 on PHB#1 detected
  [  474.402475] EEH: PE location: Slot4, PHB location: N/A
  [  474.403699] EEH: This PCI device has failed 1 times in the last hour
  [  474.403700] EEH: Notify device drivers to shutdown
  [  474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called
  [  474.403711] mlx5_core 0001:01:00.3: 
0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start
  [  474.403870] mlx5_core 0001:01:00.3: 
0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end

  
  One time I saw 
  SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No 
device request channel registered, blocked until released by user
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: 
enabling device ( -> 0002)
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: 
Using 32-bit DMA via iommu
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944145] pci 0001:01: 0.3: [PE# 006] 
Removing DMA window #0
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944152] pci 0001:01: 0.3: [PE# 006] 
Removing DMA window #1
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944195] mlx5_core 0001:01:00.3: 
firmware version: 12.17.226
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944260] Unable to handle kernel 
paging request for data at address 0x
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944533] Faulting instruction 
address: 0xc05b37e0
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944592] Oops: Kernel access of bad 
area, sig: 11 [#1]
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944636] SMP NR_CPUS=2048 NUMA PowerNV
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944851] Modules linked in: vfio_pci 
irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT 
nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) 
iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) 
mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) mlx4_en(OE) 
ib_addr(OE) ib_netlink(OE) mlx4_core(OE) mlx_compat(OE) bridge stp llc joydev 
input_leds mac_hid ofpart at24 cmdlinepart powernv_flash ipmi_powernv 
nvmem_core uio_pdrv_genirq opal_prd mtd ipmi_msghandler uio ibmpowernv 
powernv_rng binfmt_misc dm_multipath knem(OE) ip_tables x_tables autofs4 
hid_generic usbhid hid uas usb_storage ast i2c_algo_bit ttm drm_kms_helper 
syscopyarea sysfillrect sys
 imgblt fb_sys_fops drm ahci devlink libahci [last unloaded: mlx4_core]
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946007] CPU: 40 PID: 12501 Comm: 
libvirtd Tainted: G   OE   4.7.0unofficial #5
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946074] task: c00ec319a200 ti: 
c00ec324c000 task.ti: c00ec324c000
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946140] NIP: c05b37e0 LR: 
c05ad070 CTR: 
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946208] REGS: c00ec324f100 TRAP: 
0300   Tainted: G   OE(4.7.0unofficial)
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946286] MSR: 90010280b033 
  CR: 84028844  XER: 2000
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] CFAR: c0008468 DAR: 
 DSISR: 4000 SOFTE: 0
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR00: c05d19c8 
c00ec324f380 c13bef00 
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR04:  
  
  Sep 13 09:41:32 ltc-fire1

[Kernel-packages] [Bug 1630304] Comment bridged from LTC Bugzilla

2016-10-04 Thread bugproxy
--- Comment From mdr...@us.ibm.com 2016-10-04 18:34 EDT---
Some observations:

1) QEMU appears to be sending the 'device-removed' event prematurely.
The below output shows that the device's VFIO group FD is still open by
the QEMU process at the time it signals libvirt that the device
unplug/cleanup has completed:

root@ltc-fire1:~# virsh event ltc-fire1-vm3-ubuntu-16.10 --event device-removed 
&& lsof /dev/vfio/7
event 'device-removed' for domain ltc-fire1-vm3-ubuntu-16.10: hostdev0
events received: 1

COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
qemu-syst 31231 libvirt-qemu   26u   CHR  242,0  0t0  750 /dev/vfio/7

2) In response to this event, libvirt issues the following sequence to
rebind the VF:

echo $DEVID >/sys/bus/pci/drivers/vfio-pci/unbind
echo $DEVID >/sys/bus/pci/drivers_probe

3) On the VFIO side, this consistently leads to mlx5_core attempting to
bind to the device while VFIO is still running it's cleanup routines:

[  120.099498] KVM guest htab at c00f2b00 (order 26), LPID 1
[  120.208235] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..3fff 
pg=1000
[  138.281730] pci 0001:01: 0.2: [PE# 005] Setting up window#1 
800..801 pg=1
[  396.873573] vfio-pci 0001:01:00.2: No device request channel registered, 
blocked until released by user
[  396.873791] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[  396.873796] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1
[  396.873908] mlx5_core 0001:01:00.2: enabling device ( -> 0002)
[  396.873940] mlx5_core 0001:01:00.2: Using 32-bit DMA via iommu
[  396.874034] mlx5_core 0001:01:00.2: firmware version: 12.17.1010

The full cleanup path should include something like:
[ 4762.425039] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[ 4762.425043] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1
[ 4762.432014] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..7fff 
pg=1000
[ 4762.432018] pci 0001:01: 0.2: [PE# 005] Enabling 64-bit DMA bypass

So the driver is attempting to enable the device before the default DMA
windows have been restored

4) The sleep Carol inserted above in VFIO cleanup path seems to avoid
the issue. This suggests that the reprobe doesn't blindly run but
instead waits for a signal of some sort, but that that signaling seems
to happen prematurely without the explicit sleep.

This probably needs to be addressed at multiple levels, a fix in QEMU to
defer the device-deleted event until VFIO has cleanup up the device, and
a fix in VFIO path to avoid crashing the host if someone were to issue
the reprobe manually while the device is still in use.

A possible workaround that might be worth trying in the meantime is
specifying managed='no' in the device XML, which according to libvirt
documentation would prevent libvirt from automatically rebinding the
device back to default in the host after unplug. But I saw mention that
maybe this wasn't supported yet for KVM, so it's not a given.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1630304

Title:
  Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF

Status in linux package in Ubuntu:
  New

Bug description:
  ---Problem Description---
  I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 
VF from a guest I am getting some issues:
  Like in this case:
  [  474.393308] vfio-pci 0001:01:00.3: No device request channel registered, 
blocked until released by user
  [  474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0
  [  474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1
  [  474.393906] mlx5_core 0001:01:00.3: enabling device ( -> 0002)
  [  474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu
  [  474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fff 
pg=1000
  [  474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226
  [  474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass
  [  474.402284] EEH: Frozen PE#6 on PHB#1 detected
  [  474.402475] EEH: PE location: Slot4, PHB location: N/A
  [  474.403699] EEH: This PCI device has failed 1 times in the last hour
  [  474.403700] EEH: Notify device drivers to shutdown
  [  474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called
  [  474.403711] mlx5_core 0001:01:00.3: 
0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start
  [  474.403870] mlx5_core 0001:01:00.3: 
0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end

  
  One time I saw 
  SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No 
device request channel registered, blocked until released by user
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: 
enabling device ( -> 0002)
  Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: 
Using 32-bit DMA via iommu
  Sep 13 09:41:32 ltc-fire1 kernel: [704