During the investigation, we've noticed that PCI specification mentions
the need of MSI/MSI-X capability to be disabled during a system
boot/reset; from PCI Local Bus specification 3.0, sections 6.8.1.3 and
6.8.2.3: "[...] MSI Enable: This bit’s state after reset is 0 (MSI is
disabled)."

PCI layer in the Linux kernel ensures this bit is 0 during its
initialization [0], but for our case it is too late, give we had an IRQ
storm during early stages in the kdump kernel boot process.

The idea to resolve the issue was then to disable MSI/MSI-X early in boot, 
using the early-quirks infrastructure in arch/x86, which proved to be a 
successful approach.
Patches will be attached here soon.

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/probe.c?h=v4.18#n1511

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1797990

Title:
  kdump fail due to an IRQ storm

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Trusty:
  Confirmed
Status in linux source package in Xenial:
  Confirmed
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed

Bug description:
  We have reports of a kdump failure in Ubuntu (in x86 machine) that was
  narrowed down to a MSI irq storm coming from a PCI network device.

  The bug manifests as a lack of progress in the boot process of the
  kdump kernel, and a storm of kernel messages like:

  [...]
  [  342.265294] do_IRQ: 0.155 No irq handler for vector
  [  342.266916] do_IRQ: 0.155 No irq handler for vector
  [  347.258422] do_IRQ: 14053260 callbacks suppressed
  [...]

  The root cause of the the issue is that the kdump kernel kexec process
  does not ensure PCI devices are reset and/or MSI capabilities are
  disabled, so a PCI device could produce a huge amount of PCI irqs
  which would take all the processing time for the CPU (specially since
  we restrict the kdump kernel to use one single CPU only).

  This was tested using upstream kernel version 4.18, and the problem 
reproduces.
  In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit 
[8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under 
high load on the guest.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to