[cc += Tony, Przemek (ice driver maintainers), start of thread is here:
https://lore.kernel.org/all/[email protected]/
]

On Mon, Feb 02, 2026 at 02:00:55PM +0800, LeoLiu-oc wrote:
> The kernel version I am using is 6.18.6.
[...]
> The complete log of the kernel panic is as follows:
> 
> [  100.304077][  T843] list_del corruption, ffff8881418b79e8->next is 
> LIST_POISON1 (dead000000000100)
> [  100.312989][  T843] ------------[ cut here ]------------
> [  100.318268][  T843] kernel BUG at lib/list_debug.c:56!
> [  100.323380][  T843] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [  100.329250][  T843] CPU: 7 PID: 843 Comm: irq/27-pciehp Tainted: P    W  
> OE     ------- ----  6.6.0-32.7.v2505.ky11.x86_64 #1
> [  100.340793][  T843] Source Version: 
> 71d5b964051132b7772acd935972fca11462bbfe
> [  100.359228][  T843] RIP: 0010:__list_del_entry_valid_or_report+0x7f/0xc0
> [  100.365877][  T843] Code: 66 4b a6 e8 c3 43 a9 ff 0f 0b 48 89 fe 48 c7 c7 
> 10 67 4b a6 e8 b2 43 a9 ff 0f 0b 48 89 fe 48 c7 c7 40 67 4b a6 e8 a1 43 a9 ff 
> <0f> 0b 48 89 fe 48 89 ca 48 c7 c7 78 67 4b a6 e8 8d 43 a9 ff 0f 0b
> [  100.385158][  T843] RSP: 0018:ffffc9000f70fc08 EFLAGS: 00010246
> [  100.391024][  T843] RAX: 000000000000004e RBX: ffff8881418b79e8 RCX: 
> 0000000000000000
> [  100.398781][  T843] RDX: 0000000000000000 RSI: ffff8897df5a32c0 RDI: 
> ffff8897df5a32c0
> [  100.406538][  T843] RBP: ffff8881257f9608 R08: 0000000000000000 R09: 
> 0000000000000003
> [  100.414294][  T843] R10: ffffc9000f70fa90 R11: ffffffffa6fee508 R12: 
> 0000000000000000
> [  100.422050][  T843] R13: ffff8881257f9608 R14: ffff888116507c28 R15: 
> ffff888116507c28
> [  100.429807][  T843] FS:  0000000000000000(0000) GS:ffff8897df580000(0000) 
> knlGS:0000000000000000
> [  100.438511][  T843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  100.444891][  T843] CR2: 00007f9563bac1c0 CR3: 0000000c4be26004 CR4: 
> 0000000000570ee0
> [  100.452647][  T843] PKRU: 55555554
> [  100.456017][  T843] Call Trace:
> [  100.459129][  T843]  <TASK>
> [  100.461898][  T843]  ice_flow_rem_entry_sync.constprop.0+0x1c/0x90 [ice]
> [  100.468663][  T843]  ice_flow_rem_entry+0x3d/0x60 [ice]
> [  100.473925][  T843]  ice_fdir_erase_flow_from_hw.constprop.0+0x9b/0x100 
> [ice]
> [  100.481078][  T843]  ice_fdir_rem_flow.constprop.0+0x32/0xb0 [ice]
> [  100.487284][  T843]  ice_vsi_manage_fdir+0x7b/0xb0 [ice]
> [  100.492629][  T843]  ice_deinit_features.part.0+0x46/0xc0 [ice]
> [  100.498571][  T843]  ice_remove+0xcf/0x220 [ice]
> [  100.503222][  T843]  pci_device_remove+0x3f/0xb0
> [  100.507798][  T843]  device_release_driver_internal+0x19d/0x220
> [  100.513667][  T843]  pci_stop_bus_device+0x6c/0x90
> [  100.518417][  T843]  pci_stop_and_remove_bus_device+0x12/0x20
> [  100.524110][  T843]  pciehp_unconfigure_device+0x9f/0x160
> [  100.529463][  T843]  pciehp_disable_slot+0x69/0x130
> [  100.534296][  T843]  pciehp_handle_presence_or_link_change+0xfc/0x210
> [  100.540678][  T843]  pciehp_ist+0x204/0x230
> [  100.544824][  T843]  ? __pfx_irq_thread_fn+0x10/0x10
> [  100.549747][  T843]  irq_thread_fn+0x20/0x60
> [  100.553978][  T843]  irq_thread+0xfb/0x1c0
> [  100.558038][  T843]  ? __pfx_irq_thread_dtor+0x10/0x10
> [  100.563130][  T843]  ? __pfx_irq_thread+0x10/0x10
> [  100.567791][  T843]  kthread+0xe5/0x120
> [  100.571594][  T843]  ? __pfx_kthread+0x10/0x10
> [  100.575997][  T843]  ret_from_fork+0x17a/0x1a0
> [  100.580403][  T843]  ? __pfx_kthread+0x10/0x10
> [  100.584805][  T843]  ret_from_fork_asm+0x1a/0x30
> [  100.589384][  T843]  </TASK>
> [  100.592237][  T843] Modules linked in: zxmem(OE) einj amdgpu amdxcp
> gpu_sched drm_exec drm_buddy nft_fib_inet nft_fib_ipv4 nft_fib_ipv6
> nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 zhaoxin_cputemp
> nf_defrag_ipv4 zhaoxin_rng snd_hda_codec_hdmi radeon rfkill
> snd_hda_intel snd_intel_dspcfg irdma i2c_algo_bit snd_intel_sdw_acpi
> ip_set i40e drm_suballoc_helper nf_tables drm_ttm_helper pcicfg(POE)
> snd_hda_codec ib_uverbs sunrpc ttm ib_core snd_hda_core
> drm_display_helper snd_hwdep kvm_intel snd_pcm cec vfat fat
> drm_kms_helper snd_timer kvm video ice snd psmouse soundcore wmi
> acpi_cpufreq pcspkr i2c_zhaoxin sg sch_fq_codel drm fuse backlight
> nfnetlink xfs sd_mod t10_pi sm2_zhaoxin_gmi crct10dif_pclmul
> crc32_pclmul ahci crc32c_intel libahci r8169 ghash_clmulni_intel libata
> sha512_ssse3 serio_raw realtek dm_mirror dm_region_hash dm_log
> dm_multipath dm_mod i2c_dev autofs4
> [  100.674508][  T843] ---[ end trace 0000000000000000 ]---
> [  100.709547][  T843] RIP: 0010:__list_del_entry_valid_or_report+0x7f/0xc0
> [  100.716197][  T843] Code: 66 4b a6 e8 c3 43 a9 ff 0f 0b 48 89 fe 48 c7 c7 
> 10 67 4b a6 e8 b2 43 a9 ff 0f 0b 48 89 fe 48 c7 c7 40 67 4b a6 e8 a1 43 a9 ff 
> <0f> 0b 48 89 fe 48 89 ca 48 c7 c7 78 67 4b a6 e8 8d 43 a9 ff 0f 0b
> [  100.735491][  T843] RSP: 0018:ffffc9000f70fc08 EFLAGS: 00010246
> [  100.741367][  T843] RAX: 000000000000004e RBX: ffff8881418b79e8 RCX: 
> 0000000000000000
> [  100.749137][  T843] RDX: 0000000000000000 RSI: ffff8897df5a32c0 RDI: 
> ffff8897df5a32c0
> [  100.756909][  T843] RBP: ffff8881257f9608 R08: 0000000000000000 R09: 
> 0000000000000003
> [  100.764678][  T843] R10: ffffc9000f70fa90 R11: ffffffffa6fee508 R12: 
> 0000000000000000
> [  100.772448][  T843] R13: ffff8881257f9608 R14: ffff888116507c28 R15: 
> ffff888116507c28
> [  100.780218][  T843] FS:  0000000000000000(0000) GS:ffff8897df580000(0000) 
> knlGS:0000000000000000
> [  100.788934][  T843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  100.795329][  T843] CR2: 00007f9563bac1c0 CR3: 0000000c4be26004 CR4: 
> 0000000000570ee0
> [  100.803099][  T843] PKRU: 55555554
> [  100.806483][  T843] Kernel panic - not syncing: Fatal exception
> [  100.812794][  T843] Kernel Offset: disabled
> [  100.821613][  T843] pstore: backend (erst) writing error (-28)
> [  100.827481][  T843] ---[ end Kernel panic - not syncing: Fatal exception 
> ]---
> 
> The reason for this kernel panic is that the ice network card driver
> executed the ice_pci_err_detected() for a longer time than the maximum
> waiting time allowed by pciehp. After that, the pciehp_ist() will
> execute the ice network card driver's ice_remove() process. This results
> in the ice_pci_err_detected() having already deleted the list, while the
> ice_remove() is still attempting to delete a list that no longer exists.

This is a bug in the ice driver, not in the pciehp or dpc driver.
As such, it is not a good argument to support the extension of the
timeout.  I'm not against extending the timeout, but the argument
that it's necessary to avoid occurrence of a bug is not a good one.

You should first try to unbind the ice driver at runtime to see if
there is a general problem in the unbind code path:

echo abcd:ef:gh.i > /sys/bus/pci/drivers/shpchp/unbind

Replace abcd:ef:gh.i with the domain/bus/device/function of the Ethernet
card.  The dmesg excerpt you've provided unfortunately does not betray
the card's address.

Then try to rebind the driver via the "bind" sysfs attribute.

If this works, the next thing to debug is whether the driver has a
problem with surprise removal.  I'm not fully convinced that the
crash you're seeing is caused by concurrent execution of
ice_pci_err_detected() and ice_remove().  When pciehp unbinds the
driver during DPC recovery, the device is likely inaccessible.
It's possible that ice_remove() behaves differently for an
inaccessible device and that may cause the crash instead of the
concurrent execution of ice_pci_err_detected().

It would also be good to understand why DPC recovery of the Ethernet
card takes this long.  Does it take a long time to come out of reset?
Could the ice driver be changed to allow for faster recovery?

Thanks,

Lukas

Reply via email to