** Description changed:

+ [Impact]
+ 
+  * A kexec/crash kernel might get stuck and fail to boot
+    (for crash kernel, kdump fails to collect a crashdump)
+    if a PCI device is buggy/stuck/looping and triggers a
+    continuous flood of MSI(X) interrupts (that the kernel
+    does not yet know about).
+ 
+  * This fix allowed to obtain crashdumps when debugging a
+    heavy-load scenario, in which a (heavy-loaded) network
+    adapter wouldn't stop triggering MSI-X interrupts ever
+    after panic()->kdump kicked in.
+ 
+  * This fix disables MSI(X) in all PCI devices on early
+    boot; this is OK as it's (re-)enabled normally later.
+ 
+ [Test Case]
+ 
+  * A synthetic test-case is not yet available, however,
+    this particular system/workload triggered the problem
+    consistently, and it was used for development/testing.
+ 
+  * We'll update this bug once a synthetic test-case is
+    available; we're working on patching QEMU for this.
+ 
+ [Regression Potential]
+ 
+  * The potential area for regressions is early boot,
+    particularly effects of applying quirks during PCI
+    bus scan, which is changed/broader w/ these patches.
+ 
+  * However, all quirks are applied based on PCI ID
+    matching, so would only apply if actually targeting
+    a new device.
+ 
+  * Moreover, the new quirk is only applied based on
+    a kernel cmdline parameter that is disabled by
+    default, which constraints even more when this
+    is actually in effect.
+ 
+ [Other Info]
+  
+  * The patch series is still under review/discussion
+    upstream, but it's relatively important for Ubuntu
+    users at this point, and after internal discussions
+    we decided to submit it for SRU.
+ 
+  * These are links to the linux-pci archive with the
+    patches [1, 2, 3]
+ 
+    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
+        
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com/
+ 
+    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
+        
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpicc...@canonical.com/
+ 
+    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
+        
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpicc...@canonical.com/
+ 
+ 
+ [Original Description]
+ 
  We have reports of a kdump failure in Ubuntu (in x86 machine) that was
  narrowed down to a MSI irq storm coming from a PCI network device.
  
  The bug manifests as a lack of progress in the boot process of the kdump
  kernel, and a storm of kernel messages like:
  
  [...]
  [  342.265294] do_IRQ: 0.155 No irq handler for vector
  [  342.266916] do_IRQ: 0.155 No irq handler for vector
  [  347.258422] do_IRQ: 14053260 callbacks suppressed
  [...]
  
  The root cause of the issue is that the kdump kernel kexec process does
  not ensure PCI devices are reset and/or MSI capabilities are disabled,
  so a PCI device could produce a huge amount of PCI irqs which would take
  all the processing time for the CPU (specially since we restrict the
  kdump kernel to use one single CPU only).
  
  This was tested using upstream kernel version 4.18, and the problem 
reproduces.
  In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit 
[8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under 
high load on the guest.

** Description changed:

  [Impact]
  
-  * A kexec/crash kernel might get stuck and fail to boot
-    (for crash kernel, kdump fails to collect a crashdump)
-    if a PCI device is buggy/stuck/looping and triggers a
-    continuous flood of MSI(X) interrupts (that the kernel
-    does not yet know about).
+  * A kexec/crash kernel might get stuck and fail to boot
+    (for crash kernel, kdump fails to collect a crashdump)
+    if a PCI device is buggy/stuck/looping and triggers a
+    continuous flood of MSI(X) interrupts (that the kernel
+    does not yet know about).
  
-  * This fix allowed to obtain crashdumps when debugging a
-    heavy-load scenario, in which a (heavy-loaded) network
-    adapter wouldn't stop triggering MSI-X interrupts ever
-    after panic()->kdump kicked in.
+  * This fix allowed to obtain crashdumps when debugging a
+    heavy-load scenario, in which a (heavy-loaded) network
+    adapter wouldn't stop triggering MSI-X interrupts ever
+    after panic()->kdump kicked in.
  
   * This fix disables MSI(X) in all PCI devices on early
-    boot; this is OK as it's (re-)enabled normally later.
+    boot (this is OK as it's (re-)enabled normally later)
+    with a kernel cmdline parameter (disabled by default).
  
  [Test Case]
  
-  * A synthetic test-case is not yet available, however,
-    this particular system/workload triggered the problem
-    consistently, and it was used for development/testing.
+  * A synthetic test-case is not yet available, however,
+    this particular system/workload triggered the problem
+    consistently, and it was used for development/testing.
  
-  * We'll update this bug once a synthetic test-case is
-    available; we're working on patching QEMU for this.
+  * We'll update this bug once a synthetic test-case is
+    available; we're working on patching QEMU for this.
  
  [Regression Potential]
  
-  * The potential area for regressions is early boot,
-    particularly effects of applying quirks during PCI
-    bus scan, which is changed/broader w/ these patches.
+  * The potential area for regressions is early boot,
+    particularly effects of applying quirks during PCI
+    bus scan, which is changed/broader w/ these patches.
  
-  * However, all quirks are applied based on PCI ID
-    matching, so would only apply if actually targeting
-    a new device.
+  * However, all quirks are applied based on PCI ID
+    matching, so would only apply if actually targeting
+    a new device.
  
-  * Moreover, the new quirk is only applied based on
-    a kernel cmdline parameter that is disabled by
-    default, which constraints even more when this
-    is actually in effect.
+  * Moreover, the new quirk is only applied based on
+    a kernel cmdline parameter that is disabled by
+    default, which constraints even more when this
+    is actually in effect.
  
  [Other Info]
-  
-  * The patch series is still under review/discussion
-    upstream, but it's relatively important for Ubuntu
-    users at this point, and after internal discussions
-    we decided to submit it for SRU.
  
-  * These are links to the linux-pci archive with the
-    patches [1, 2, 3]
+  * The patch series is still under review/discussion
+    upstream, but it's relatively important for Ubuntu
+    users at this point, and after internal discussions
+    we decided to submit it for SRU.
  
-    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
-        
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com/
+  * These are links to the linux-pci archive with the
+    patches [1, 2, 3]
  
-    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
-        
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpicc...@canonical.com/
+    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
+        
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpicc...@canonical.com/
  
-    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
-        
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpicc...@canonical.com/
+    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
+        
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpicc...@canonical.com/
  
+    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
+        
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpicc...@canonical.com/
  
  [Original Description]
  
  We have reports of a kdump failure in Ubuntu (in x86 machine) that was
  narrowed down to a MSI irq storm coming from a PCI network device.
  
  The bug manifests as a lack of progress in the boot process of the kdump
  kernel, and a storm of kernel messages like:
  
  [...]
  [  342.265294] do_IRQ: 0.155 No irq handler for vector
  [  342.266916] do_IRQ: 0.155 No irq handler for vector
  [  347.258422] do_IRQ: 14053260 callbacks suppressed
  [...]
  
  The root cause of the issue is that the kdump kernel kexec process does
  not ensure PCI devices are reset and/or MSI capabilities are disabled,
  so a PCI device could produce a huge amount of PCI irqs which would take
  all the processing time for the CPU (specially since we restrict the
  kdump kernel to use one single CPU only).
  
  This was tested using upstream kernel version 4.18, and the problem 
reproduces.
  In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit 
[8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under 
high load on the guest.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1797990

Title:
  kdump fail due to an IRQ storm

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to