Bug#1104670: [Intel-wired-lan] Bug#1104670: linux-image-6.12.25-amd64: system does not shut down - GHES: Fatal hardware error
> -Original Message- > From: Intel-wired-lan On Behalf > Of Ben Hutchings > Sent: Saturday, July 12, 2025 5:13 PM > To: [email protected]; linux-pci [email protected]>; Pavan Chebbi ; > Michael Chan > Cc: Laurent Bonnaud ; [email protected]; > [email protected] > Subject: Re: [Intel-wired-lan] Bug#1104670: linux-image-6.12.25-amd64: > system does not shut down - GHES: Fatal hardware error > > Hi all, > > On Sun, 2025-05-04 at 13:45 +0200, Laurent Bonnaud wrote: > [...] > > - Previously the kernel would output an error in > /var/lib/systemd/pstore/ but would shutdown anyway. > > > > - Now, with kernel 6.1.135-1, the shutdown is blocked as with > 6.12.x kernels (see below). > > -- > > Laurent. > > > > <30>[ 961.098671] systemd-shutdown[1]: Rebooting. > > <6>[ 961.098743] kvm: exiting hardware virtualization <6>[ > > 961.361878] megaraid_sas :17:00.0: megasas_disable_intr_fusion > is > > called outbound_intr_mask:0x4009 <6>[ 961.414526] ACPI: PM: > > Preparing to enter system sleep state S5 <0>[ 963.828210] > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error > > Source: 5 <0>[ 963.828213] {1}[Hardware Error]: event severity: > fatal <0>[ 963.828214] {1}[Hardware Error]: Error 0, type: fatal > > <0>[ 963.828216] {1}[Hardware Error]: section_type: PCIe error > > <0>[ 963.828216] {1}[Hardware Error]: port_type: 0, PCIe end > point > > <0>[ 963.828217] {1}[Hardware Error]: version: 3.0 > > <0>[ 963.828218] {1}[Hardware Error]: command: 0x0002, status: > 0x0010 > > <0>[ 963.828220] {1}[Hardware Error]: device_id: :01:00.1 > > <0>[ 963.828221] {1}[Hardware Error]: slot: 6 > > <0>[ 963.828222] {1}[Hardware Error]: secondary_bus: 0x00 > > <0>[ 963.828223] {1}[Hardware Error]: vendor_id: 0x8086, > device_id: 0x1563 > > <0>[ 963.828224] {1}[Hardware Error]: class_code: 02 > > <0>[ 963.828225] {1}[Hardware Error]: aer_uncor_status: > 0x0010, aer_uncor_mask: 0x00018000 > > <0>[ 963.828226] {1}[Hardware Error]: aer_uncor_severity: > 0x000ef010 > > <0>[ 963.828227] {1}[Hardware Error]: TLP Header: 4001 > 000f 90028090 > [...] > > It seems that this is a known bug in the BIOS of several Dell > PowerEdge models including (in this case) the R540. > > A workaround was added to the tg3 driver > <https://git.kernel.org/linus/e0efe83ed325277bb70f9435d4d9fc70bebdcca8 > > > and a similar change was proposed (but not accepted) in the i40e > driver <https://lore.kernel.org/all/20241227035459.90602-1- > [email protected]/>. > On tihis system the erorr log points to a deivce handled by the ixgbe > driver, and no workaround has been implemented for that. > > Since this issue seems to affect multiple different NIC vendors and > drivers, would it make more sense to implement this workaround as a > PCI quirk? > I support the idea of PCI workaround, but who will implement it ? Alex > Ben. > > -- > Ben Hutchings > Experience is directly proportional to the value of equipment > destroyed > - Carolyn > Scheppner
Bug#1104670: linux-image-6.12.25-amd64: system does not shut down - GHES: Fatal hardware error
Hi all,
On Sun, 2025-05-04 at 13:45 +0200, Laurent Bonnaud wrote:
[...]
> - Previously the kernel would output an error in /var/lib/systemd/pstore/
> but would shutdown anyway.
>
> - Now, with kernel 6.1.135-1, the shutdown is blocked as with 6.12.x
> kernels (see below).
> --
> Laurent.
>
> <30>[ 961.098671] systemd-shutdown[1]: Rebooting.
> <6>[ 961.098743] kvm: exiting hardware virtualization
> <6>[ 961.361878] megaraid_sas :17:00.0: megasas_disable_intr_fusion is
> called outbound_intr_mask:0x4009
> <6>[ 961.414526] ACPI: PM: Preparing to enter system sleep state S5
> <0>[ 963.828210] {1}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 5
> <0>[ 963.828213] {1}[Hardware Error]: event severity: fatal
> <0>[ 963.828214] {1}[Hardware Error]: Error 0, type: fatal
> <0>[ 963.828216] {1}[Hardware Error]: section_type: PCIe error
> <0>[ 963.828216] {1}[Hardware Error]: port_type: 0, PCIe end point
> <0>[ 963.828217] {1}[Hardware Error]: version: 3.0
> <0>[ 963.828218] {1}[Hardware Error]: command: 0x0002, status: 0x0010
> <0>[ 963.828220] {1}[Hardware Error]: device_id: :01:00.1
> <0>[ 963.828221] {1}[Hardware Error]: slot: 6
> <0>[ 963.828222] {1}[Hardware Error]: secondary_bus: 0x00
> <0>[ 963.828223] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x1563
> <0>[ 963.828224] {1}[Hardware Error]: class_code: 02
> <0>[ 963.828225] {1}[Hardware Error]: aer_uncor_status: 0x0010,
> aer_uncor_mask: 0x00018000
> <0>[ 963.828226] {1}[Hardware Error]: aer_uncor_severity: 0x000ef010
> <0>[ 963.828227] {1}[Hardware Error]: TLP Header: 4001 000f
> 90028090
[...]
It seems that this is a known bug in the BIOS of several Dell PowerEdge
models including (in this case) the R540.
A workaround was added to the tg3 driver
and a similar change was proposed (but not accepted) in the i40e driver
.
On tihis system the erorr log points to a deivce handled by the ixgbe
driver, and no workaround has been implemented for that.
Since this issue seems to affect multiple different NIC vendors and
drivers, would it make more sense to implement this workaround as a PCI
quirk?
Ben.
--
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed
- Carolyn Scheppner
signature.asc
Description: This is a digitally signed message part
Bug#1104670: linux-image-6.12.25-amd64: system does not shut down - GHES: Fatal hardware error
Hi,
this bug is similar to bug #1053750 and bug #1034718 that have been archived.
In the 6.1.x kernel branch, the problem has become worse:
- Previously the kernel would output an error in /var/lib/systemd/pstore/ but
would shutdown anyway.
- Now, with kernel 6.1.135-1, the shutdown is blocked as with 6.12.x kernels
(see below).
--
Laurent.
<30>[ 961.098671] systemd-shutdown[1]: Rebooting.
<6>[ 961.098743] kvm: exiting hardware virtualization
<6>[ 961.361878] megaraid_sas :17:00.0: megasas_disable_intr_fusion is
called outbound_intr_mask:0x4009
<6>[ 961.414526] ACPI: PM: Preparing to enter system sleep state S5
<0>[ 963.828210] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 5
<0>[ 963.828213] {1}[Hardware Error]: event severity: fatal
<0>[ 963.828214] {1}[Hardware Error]: Error 0, type: fatal
<0>[ 963.828216] {1}[Hardware Error]: section_type: PCIe error
<0>[ 963.828216] {1}[Hardware Error]: port_type: 0, PCIe end point
<0>[ 963.828217] {1}[Hardware Error]: version: 3.0
<0>[ 963.828218] {1}[Hardware Error]: command: 0x0002, status: 0x0010
<0>[ 963.828220] {1}[Hardware Error]: device_id: :01:00.1
<0>[ 963.828221] {1}[Hardware Error]: slot: 6
<0>[ 963.828222] {1}[Hardware Error]: secondary_bus: 0x00
<0>[ 963.828223] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x1563
<0>[ 963.828224] {1}[Hardware Error]: class_code: 02
<0>[ 963.828225] {1}[Hardware Error]: aer_uncor_status: 0x0010,
aer_uncor_mask: 0x00018000
<0>[ 963.828226] {1}[Hardware Error]: aer_uncor_severity: 0x000ef010
<0>[ 963.828227] {1}[Hardware Error]: TLP Header: 4001 000f 90028090
<0>[ 963.828229] GHES: Fatal hardware error but panic disabled
<0>[ 963.828230] Kernel panic - not syncing: GHES: Fatal hardware error
<4>[ 963.828231] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-34-amd64 #1
Debian 6.1.135-1
<4>[ 963.828234] Hardware name: Dell Inc. PowerEdge R540/0PRWNC, BIOS 2.23.0
01/09/2025
<4>[ 963.828235] Call Trace:
<4>[ 963.828238]
<4>[ 963.828240] dump_stack_lvl+0x44/0x5c
<4>[ 963.828247] panic+0x118/0x2f4
<4>[ 963.828253] __ghes_panic.cold+0x28/0x28
<4>[ 963.828258] ghes_notify_nmi+0x1db/0x370
<4>[ 963.828263] nmi_handle+0x5a/0x120
<4>[ 963.828269] default_do_nmi+0x40/0x130
<4>[ 963.828273] exc_nmi+0x11e/0x150
<4>[ 963.828276] end_repeat_nmi+0x16/0x67
<4>[ 963.828281] RIP: 0010:mwait_idle_with_hints.constprop.0+0x48/0x90
<4>[ 963.828286] Code: 48 89 d1 65 48 8b 04 25 80 fb 01 00 0f 01 c8 48 8b 00 a8 08
75 14 66 90 0f 00 2d 8f ac b0 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 80
fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
<4>[ 963.828287] RSP: 0018:b2e03e18 EFLAGS: 0046
<4>[ 963.828290] RAX: 0020 RBX: 0003 RCX:
0001
<4>[ 963.828292] RDX: RSI: b2fa0160 RDI:
0020
<4>[ 963.828293] RBP: 0003 R08: 0002 R09:
3a518aaa
<4>[ 963.828295] R10: 0018 R11: afc8 R12:
b2fa0160
<4>[ 963.828296] R13: b2fa02b0 R14: 0003 R15:
<4>[ 963.828300] ? mwait_idle_with_hints.constprop.0+0x48/0x90
<4>[ 963.828303] ? mwait_idle_with_hints.constprop.0+0x48/0x90
<4>[ 963.828305]
<4>[ 963.828306]
<4>[ 963.828307] intel_idle_ibrs+0x75/0x90
<4>[ 963.828309] cpuidle_enter_state+0x89/0x420
<4>[ 963.828315] cpuidle_enter+0x29/0x40
<4>[ 963.828317] do_idle+0x202/0x2a0
<4>[ 963.828323] cpu_startup_entry+0x26/0x30
<4>[ 963.828326] rest_init+0xca/0xd0
<4>[ 963.828328] arch_call_rest_init+0xa/0x14
<4>[ 963.828333] start_kernel+0x70a/0x733
<4>[ 963.828336] secondary_startup_64_no_verify+0xe5/0xeb
<4>[ 963.828343]
<0>[ 963.828357] Kernel Offset: 0x3040 from 0x8100 (relocation
range: 0x8000-0xbfff)
Bug#1104670: linux-image-6.12.25-amd64: system does not shut down - GHES: Fatal hardware error
Package: src:linux
Version: 6.12.25-1
Severity: normal
Dear Maintainer,
when I try to reboot this system (by entering the "reboot" command), the screen
becomes black and then nothing happens. The system never finishes its shutdown.
Here is a debug log from /var/lib/systemd/pstore:
<30>[ 642.476392] systemd-shutdown[1]: Rebooting.
<6>[ 642.763811] megaraid_sas :17:00.0: megasas_disable_intr_fusion is
called outbound_intr_mask:0x4009
<6>[ 643.140087] ACPI: PM: Preparing to enter system sleep state S5
<0>[ 646.279684] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 5
<0>[ 646.279686] {1}[Hardware Error]: event severity: fatal
<0>[ 646.279688] {1}[Hardware Error]: Error 0, type: fatal
<0>[ 646.279690] {1}[Hardware Error]: section_type: PCIe error
<0>[ 646.279691] {1}[Hardware Error]: port_type: 0, PCIe end point
<0>[ 646.279692] {1}[Hardware Error]: version: 3.0
<0>[ 646.279693] {1}[Hardware Error]: command: 0x0002, status: 0x0010
<0>[ 646.279694] {1}[Hardware Error]: device_id: :01:00.1
<0>[ 646.279696] {1}[Hardware Error]: slot: 6
<0>[ 646.279697] {1}[Hardware Error]: secondary_bus: 0x00
<0>[ 646.279698] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x1563
<0>[ 646.279699] {1}[Hardware Error]: class_code: 02
<0>[ 646.279700] {1}[Hardware Error]: aer_cor_status: 0x2000,
aer_cor_mask: 0x31c1
<0>[ 646.279701] {1}[Hardware Error]: aer_uncor_status: 0x0010,
aer_uncor_mask: 0x00018000
<0>[ 646.279702] {1}[Hardware Error]: aer_uncor_severity: 0x000ef010
<0>[ 646.279703] {1}[Hardware Error]: TLP Header: 4001 030f 90028090
<0>[ 646.279706] GHES: Fatal hardware error but panic disabled
<0>[ 646.279707] Kernel panic - not syncing: GHES: Fatal hardware error
<4>[ 646.279709] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: GW
6.12.25-amd64 #1 Debian 6.12.25-1
<4>[ 646.279714] Tainted: [W]=WARN
<4>[ 646.279714] Hardware name: Dell Inc. PowerEdge R540/0PRWNC, BIOS 2.23.0
01/09/2025
<4>[ 646.279716] Call Trace:
<4>[ 646.279719]
<4>[ 646.279721] dump_stack_lvl+0x5d/0x80
<4>[ 646.279727] panic+0x118/0x2db
<4>[ 646.279734] __ghes_panic.cold+0x28/0x28
<4>[ 646.279738] ghes_notify_nmi+0x30e/0x3b0
<4>[ 646.279742] nmi_handle+0x5e/0x120
<4>[ 646.279747] default_do_nmi+0x40/0x130
<4>[ 646.279751] exc_nmi+0x122/0x1a0
<4>[ 646.279754] end_repeat_nmi+0xf/0x53
<4>[ 646.279758] RIP: 0010:intel_idle_ibrs+0x87/0x100
<4>[ 646.279762] Code: 3e 0f ae f0 31 d2 48 89 f0 48 89 d1 0f 01 c8 48 8b 06 a8 08
75 14 66 90 0f 00 2d 80 1a 41 00 b9 01 00 00 00 4c 89 c8 0f 01 c9 80 66 02 df f0
83 44 24 fc 00 48 8b 06 a8 08 74 0b 65 81 25 84
<4>[ 646.279764] RSP: 0018:b5603e28 EFLAGS: 0046
<4>[ 646.279767] RAX: 0020 RBX: 0003 RCX:
0001
<4>[ 646.279768] RDX: RSI: b5610940 RDI:
0001
<4>[ 646.279770] RBP: b57b6e40 R08: R09:
0020
<4>[ 646.279771] R10: 0008 R11: 96781fc34764 R12:
b57b6e40
<4>[ 646.279772] R13: b57b6f90 R14: 0003 R15:
<4>[ 646.279776] ? intel_idle_ibrs+0x87/0x100
<4>[ 646.279780] ? intel_idle_ibrs+0x87/0x100
<4>[ 646.279783]
<4>[ 646.279783]
<4>[ 646.279784] cpuidle_enter_state+0x7e/0x420
<4>[ 646.279788] cpuidle_enter+0x2d/0x40
<4>[ 646.279792] do_idle+0x1e5/0x240
<4>[ 646.279798] cpu_startup_entry+0x29/0x30
<4>[ 646.279800] rest_init+0xcc/0xd0
<4>[ 646.279803] start_kernel+0x74c/0x750
<4>[ 646.279809] x86_64_start_reservations+0x24/0x30
<4>[ 646.279813] x86_64_start_kernel+0x95/0xa0
<4>[ 646.279816] common_startup_64+0x13e/0x141
<4>[ 646.279823]
<0>[ 646.279837] Kernel Offset: 0x32a0 from 0x8100 (relocation
range: 0x8000-0xbfff)
Thanks,
-- Package-specific info:
** Version:
Linux version 6.12.25-amd64 ([email protected])
(x86_64-linux-gnu-gcc-14 (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for
Debian) 2.44) #1 SMP PREEMPT_DYNAMIC Debian 6.12.25-1 (2025-04-25)
** Command line:
BOOT_IMAGE=/boot/vmlinuz-6.12.25-amd64
root=UUID=d4026c7c-61cc-435f-81c5-76194e22454e ro quiet net.ifnames=0
** Tainted: W (512)
* kernel issued warning
** Kernel log:
Unable to read kernel log; any relevant messages should be attached
** Model information
sys_vendor: Dell Inc.
product_name: PowerEdge R540
product_version:
chassis_vendor: Dell Inc.
chassis_version:
bios_vendor: Dell Inc.
bios_version: 2.23.0
board_vendor: Dell Inc.
board_name: 0PRWNC
board_version: A07
** Configuration for modprobe:
blacklist acpi_power_meter
blacklist arkfb
blacklist aty128fb
blacklist atyfb
blacklist radeonfb
blacklist cirrusfb
blacklist cyber2000fb
blacklist kyrofb
blacklist matroxfb_base
blacklist mb862xxfb
blacklist neofb
blacklist pm2fb
blacklist pm3fb
blacklist s3fb
blacklist savagefb
blacklist sisfb
blacklist t

