** Description changed:

- impact
- being noticed a lot, only affects 5.4, fix in subsequent failures
+ SRU Justification:
  
- The offending patch was removed in 20.10 and later kernels (it was
- reverted upstream not long after being merged into mainline but we never
- reverted it)
+ [IMPACT]
  
+ This is being reported by a hardware partner as it is being noticed a
+ lot both in their internal testing teams and also being reported with
+ some frequency by customers who are seeing these messages in their logs
+ and thus it is generating an unusualy high volume of support calls from
+ the field.
  
- following error messages are observed
+ In 5.4, commit d60cd06331a3566d3305b3c7b566e79edf4e2095 was introduced
+ upstream and pulled into Ubuntu between 5.4.0-58.64 and 5.4.0-59.65.
+ Upstream, these errors were discovered and that patch was reverted (see
+ Fix Below).  We carry the revert commit in all subsequent Focal HWE
+ kernels starting at 5.12, but the fix was never pulled back into Focal
+ 5.4.
+ 
+ according to the hardware partner:
+ 
+ the following error messages are observed when rebooting a machine that
+ uses the BCM5720 chipset, which is a widely used 1GbE controller found
+ on LOMs and OCP NICs as well as many PCIe NIC models.
  
  [  146.429212] shutdown[1]: Rebooting.
  [  146.435151] kvm: exiting hardware virtualization
  [  146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is 
called outbound_intr_mask:0x40000009
  [  148.088133] [qede_unload:2236(eno12409)]Link is down
  [  148.183618] qede 0000:31:00.1: Ending qede_remove successfully
  [  148.518541] [qede_unload:2236(eno12399)]Link is down
  [  148.625066] qede 0000:31:00.0: Ending qede_remove successfully
  [  148.762067] ACPI: Preparing to enter system sleep state S5
  [  148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 5
  [  148.803731] {1}[Hardware Error]: event severity: recoverable
  [  148.810191] {1}[Hardware Error]:  Error 0, type: fatal
  [  148.816088] {1}[Hardware Error]:   section_type: PCIe error
  [  148.822391] {1}[Hardware Error]:   port_type: 0, PCIe end point
  [  148.829026] {1}[Hardware Error]:   version: 3.0
  [  148.834266] {1}[Hardware Error]:   command: 0x0006, status: 0x0010
  [  148.841140] {1}[Hardware Error]:   device_id: 0000:04:00.0
  [  148.847309] {1}[Hardware Error]:   slot: 0
  [  148.852077] {1}[Hardware Error]:   secondary_bus: 0x00
  [  148.857876] {1}[Hardware Error]:   vendor_id: 0x14e4, device_id: 0x165f
  [  148.865145] {1}[Hardware Error]:   class_code: 020000
  [  148.870845] {1}[Hardware Error]:   aer_uncor_status: 0x00100000, 
aer_uncor_mask: 0x00010000
  [  148.879842] {1}[Hardware Error]:   aer_uncor_severity: 0x000ef030
  [  148.886575] {1}[Hardware Error]:   TLP Header: 40000001 0000030f 90028090 
00000000
  [  148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 
0x00010000
  [  148.902795] tg3 0000:04:00.0: AER:    [20] UnsupReq               (First)
  [  148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, 
aer_agent=Requester ID
  [  148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
  [  148.925558] tg3 0000:04:00.0: AER:   TLP Header: 40000001 0000030f 
90028090 00000000
  [  148.933984] reboot: Restarting system
  [  148.938319] reboot: machine restart
  
- I  have observed the following. when I test older kernel
+ The hardware partner did some bisection and observed the following:
  
  Kernel  version       Fatal Error
  5.4.0-42.46   No
  5.4.0-45.49   No
  5.4.0-47.51   No
  5.4.0-48.52   No
  5.4.0-51.56   No
  5.4.0-52.57   No
  5.4.0-53.59   No
  5.4.0-54.60   No
  5.4.0-58.64   No
  5.4.0-59.65   yes
  5.4.0-60.67   yes
  
- later I have bisect kernel between 5.4.0-58.64 and 5.4.0-59.65.
+ [FIX]
+ The fix is to apply this patch from upstream:
  
- looks like due to the following patch we are observing this issue. The
- driver is not handling D3 state properly
+ commit 9d3fcb28f9b9750b474811a2964ce022df56336e
+ Author: Josef Bacik <jo...@toxicpanda.com>
+ Date:   Tue Mar 16 22:17:48 2021 -0400
  
- PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI
+     Revert "PM: ACPI: reboot: Use S5 for reboot"
+     
+     This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095.
+     
+     This patch causes a panic when rebooting my Dell Poweredge r440.  I do
+     not have the full panic log as it's lost at that stage of the reboot and
+     I do not have a serial console.  Reverting this patch makes my system
+     able to reboot again.
  
- https://kernel.ubuntu.com/git/ubuntu/ubuntu-
- focal.git/commit/?id=b9319dd02269593911403dd5d684368bcef3261d
+ Example:
+ 
https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/focal/+ref/1917471
+ 
+ [TEST CASE] 
+ Install the patched kernel on a machine that uses a BCM5720 LOM and reboot 
the machine and see that the errors no longer appear.

** Summary changed:

- [Regression] Bus Fatal Error observed when reboot on BCM5720
+ [SRU][Regression] Bus Fatal Error observed when reboot on BCM5720

** Summary changed:

- [SRU][Regression] Bus Fatal Error observed when reboot on BCM5720
+ [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which causes 
Bus Fatal Error when rebooting system with BCM5720 NIC

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1917471

Title:
  [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which
  causes Bus Fatal Error when rebooting system with BCM5720 NIC

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1917471/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to