On Mon, 14 Jul 2025 09:21:25 +0000 "Loktionov, Aleksandr" < [email protected]> wrote: > > On Sun, 2025-05-04 at 13:45 +0200, Laurent Bonnaud wrote: > > [...] > > > - Previously the kernel would output an error in > > /var/lib/systemd/pstore/ but would shutdown anyway. > > > > > > - Now, with kernel 6.1.135-1, the shutdown is blocked as with > > 6.12.x kernels (see below). > > > <30>[ 961.098671] systemd-shutdown[1]: Rebooting. > > > <6>[ 961.098743] kvm: exiting hardware virtualization <6>[ > > > 961.361878] megaraid_sas 0000:17:00.0: megasas_disable_intr_fusion > > is > > > called outbound_intr_mask:0x40000009 <6>[ 961.414526] ACPI: PM: > > > Preparing to enter system sleep state S5 <0>[ 963.828210] > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error > > > Source: 5 <0>[ 963.828213] {1}[Hardware Error]: event severity: > > fatal <0>[ 963.828214] {1}[Hardware Error]: Error 0, type: fatal > > > <0>[ 963.828216] {1}[Hardware Error]: section_type: PCIe error > > > <0>[ 963.828216] {1}[Hardware Error]: port_type: 0, PCIe end > > point > > > <0>[ 963.828217] {1}[Hardware Error]: version: 3.0 > > > <0>[ 963.828218] {1}[Hardware Error]: command: 0x0002, status: > > 0x0010 > > > <0>[ 963.828220] {1}[Hardware Error]: device_id: 0000:01:00.1 > > > <0>[ 963.828221] {1}[Hardware Error]: slot: 6 >>> <0>[ 963.828222] {1}[Hardware Error]: secondary_bus: 0x00 >>> <0>[ 963.828223] {1}[Hardware Error]: vendor_id: 0x8086, >> device_id: 0x1563 >>> <0>[ 963.828224] {1}[Hardware Error]: class_code: 020000 >>> <0>[ 963.828225] {1}[Hardware Error]: aer_uncor_status: >> 0x00100000, aer_uncor_mask: 0x00018000 >>> <0>[ 963.828226] {1}[Hardware Error]: aer_uncor_severity: >> 0x000ef010 >>> <0>[ 963.828227] {1}[Hardware Error]: TLP Header: 40000001 >> 0000000f 90028090 00000000 >> [...] >> >> It seems that this is a known bug in the BIOS of several Dell >> PowerEdge models including (in this case) the R540.
Yup, R730XD here. >> A workaround was added to the tg3 driver >> <https://git.kernel.org/linus/e0efe83ed325277bb70f9435d4d9fc70bebdcca8 >> >> and a similar change was proposed (but not accepted) in the i40e >> driver <https://lore.kernel.org/all/20241227035459.90602-1- >> [email protected]/>. >> On tihis system the erorr log points to a deivce handled by the ixgbe >> driver, and no workaround has been implemented for that. >> >> Since this issue seems to affect multiple different NIC vendors and >> drivers, would it make more sense to implement this workaround as a >> PCI quirk? It's not just network devices either. <5>[965917.449277] sd 4:0:0:0: [sda] Synchronizing SCSI cache <6>[965917.614364] [drm] PCIE GART of 256M enabled (table at 0x000000F47FF80000). <6>[965917.820364] [drm] UVD and UVD ENC initialized successfully. <6>[965917.921559] [drm] VCE initialized successfully. <6>[965917.926574] amdgpu 0000:04:00.0: [drm] Cannot find any crtc or sizes <6>[965917.934684] amdgpu 0000:04:00.0: [drm] Cannot find any crtc or sizes <0>[965919.725575] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3 <0>[965919.725582] {1}[Hardware Error]: event severity: fatal <0>[965919.725587] {1}[Hardware Error]: Error 0, type: fatal <0>[965919.725591] {1}[Hardware Error]: section_type: PCIe error <0>[965919.725595] {1}[Hardware Error]: port_type: 1, legacy PCI end point <0>[965919.725598] {1}[Hardware Error]: version: 1.16 <0>[965919.725602] {1}[Hardware Error]: command: 0x0407, status: 0x0010 <0>[965919.725607] {1}[Hardware Error]: device_id: 0000:04:00.1 <0>[965919.725611] {1}[Hardware Error]: slot: 0 <0>[965919.725614] {1}[Hardware Error]: secondary_bus: 0x00 <0>[965919.725617] {1}[Hardware Error]: vendor_id: 0x1002, device_id: 0xaae0 <0>[965919.725622] {1}[Hardware Error]: class_code: 040300 <0>[965919.725625] {1}[Hardware Error]: aer_cor_status: 0x00002000, aer_cor_mask: 0x000031c0 <0>[965919.725630] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000 <0>[965919.725635] {1}[Hardware Error]: aer_uncor_severity: 0x004e7030 <0>[965919.725638] {1}[Hardware Error]: TLP Header: 40008001 00000a0f 96a121a0 00000000 <0>[965919.725646] GHES: Fatal hardware error but panic disabled <0>[965919.725650] Kernel panic - not syncing: GHES: Fatal hardware error <4>[965919.725662] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: P O 6.14.11-5-bpo12-pve #1 <4>[965919.725676] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE <4>[965919.725689] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.19.0 12/12/2023 <4>[965919.725694] Call Trace: <4>[965919.725700] <NMI> <4>[965919.725706] dump_stack_lvl+0x27/0xa0 <4>[965919.725722] dump_stack+0x10/0x20 <4>[965919.725729] panic+0x358/0x3b0 <4>[965919.725742] __ghes_panic+0x60/0x80 <4>[965919.725756] ghes_notify_nmi+0x1d5/0x380 <4>[965919.725768] nmi_handle.part.0+0x58/0x160 <4>[965919.725781] default_do_nmi+0x131/0x170 <4>[965919.725792] exc_nmi+0x1c4/0x290 <4>[965919.725799] end_repeat_nmi+0xf/0x53 <4>[965919.725816] RIP: 0010:intel_idle+0x51/0x90 <4>[965919.725824] Code: 2d 80 ca 2b 00 eb 52 cc cc cc 48 89 f0 0f 1f 00 31 d2 48 89 d1 0f 01 c8 48 8b 06 a8 08 75 0b b9 01 00 00 00 4c 89 c0 0f 01 c9 <f0> 80 66 02 df f0 83 44 24 fc 00 48 8b 06 a8 08 74 0b 65 81 25 ea <4>[965919.725830] RSP: 0018:ffffffff8ec03db0 EFLAGS: 00000046 <4>[965919.725837] RAX: 0000000000000020 RBX: ffff8aa2ffa44680 RCX: 0000000000000001 <4>[965919.725841] RDX: 0000000000000000 RSI: ffffffff8ec107c0 RDI: 0000000000000004 <4>[965919.725849] RBP: ffffffff8ec03df0 R08: 0000000000000020 R09: 0000000000000000 <4>[965919.725854] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 <4>[965919.725857] R13: ffffffff8ee86960 R14: ffffffff8ee86b18 R15: 0000000000000004 <4>[965919.725866] ? intel_idle+0x51/0x90 <4>[965919.725873] ? intel_idle+0x51/0x90 <4>[965919.725879] </NMI> <4>[965919.725882] <TASK> <4>[965919.725884] ? cpuidle_enter_state+0x85/0x450 <4>[965919.725895] cpuidle_enter+0x2e/0x50 <4>[965919.725908] call_cpuidle+0x22/0x60 <4>[965919.725918] do_idle+0x1de/0x240 <4>[965919.725925] cpu_startup_entry+0x29/0x30 <4>[965919.725930] rest_init+0xd0/0xd0 <4>[965919.725934] start_kernel+0x779/0xb60 <4>[965919.725941] ? load_ucode_intel_bsp+0x43/0xa0 <4>[965919.725952] x86_64_start_reservations+0x18/0x30 <4>[965919.725961] x86_64_start_kernel+0xbf/0x110 <4>[965919.725968] common_startup_64+0x13e/0x141 <4>[965919.725980] </TASK> <0>[965919.726136] Kernel Offset: 0xb600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) 04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00 [VGA controller]) Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 174 NUMA node: 0 IOMMU group: 29 Region 0: Memory at 38000000000 (64-bit, prefetchable) [size=2G] Region 2: Memory at 38080000000 (64-bit, prefetchable) [size=2M] Region 4: I/O ports at 2000 [size=256] Region 5: Memory at 96a00000 (32-bit, non-prefetchable) [size=256K] Expansion ROM at 96a60000 [disabled] [size=128K] Capabilities: <access denied> Kernel driver in use: amdgpu Kernel modules: amdgpu 04:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin B routed to IRQ 172 NUMA node: 0 IOMMU group: 30 Region 0: Memory at 96a40000 (64-bit, non-prefetchable) [size=16K] Capabilities: <access denied> Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel Was completely idle and unused all boot session, and reboot was routine after patching. kernel 6.14.11-5-bpo12 from proxmox backports (so ubuntu backports, essentially). > I support the idea of PCI workaround, but who will implement it ?

