Hi Nam,
On the latest upstream mainline kernel, I am encountering a kernel
crash when attempting to unload the NVMe driver module (rmmod nvme)
on a POWER9 system. The crash appears to be triggered by the recent
work on using MSI parent domains, discussed here:
https://lore.kernel.org/all/[email protected]/
System details:
===============
Architecture: PowerPC (POWER9, IBM 9008-22L)
Kernel: 6.18.0-rc1 (mainline, unmodified)
Platform: pSeries / PHYP
Reproducibility: Always, when running rmmod nvme
Crash trace:
============
Kernel attempted to read user page (8) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000008
Faulting instruction address: 0xc000000000b30638
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp bonding tls nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet n
CPU: 14 UID: 0 PID: 1973 Comm: rmmod Not tainted 6.18.0-rc1 #63 VOLUNTARY
Hardware name: IBM,9008-22L POWER9 (architected) 0x4e0202 0xf000005
of:IBM,FW950.80 (VL950_131) hv:phyp pSeries
NIP: c000000000b30638 LR: c000000000111d90 CTR: c000000000111d6c
REGS: c00000011f1076e0 TRAP: 0300 Not tainted (6.18.0-rc1)
MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008228 XER:
200400cf
CFAR: c00000000000d9cc DAR: 0000000000000008 DSISR: 40000000 IRQMASK: 0
GPR00: c000000000111d90 c00000011f107980 c000000001da8100 0000000000000000
GPR04: c0000000bcf535e8 0000000000000000 73efa01ced0dd290 00000000b0734e18
GPR08: 0000000ffb4c0000 c0000000bcf53540 0000000000000000 0000000048008222
GPR12: c000000000111d6c c000000017ff1c80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 0000000000000001 c0000000b70bd980 c0000000995890c8
GPR28: 0000000000000000 c0000000bcf53590 c00000000e79c800 c0000000995890c8
NIP [c000000000b30638] msi_desc_to_pci_dev+0x8/0x14
LR [c000000000111d90] pseries_msi_ops_teardown+0x24/0x38
Call Trace:
[c00000011f107980] [c0000000995890c8] 0xc0000000995890c8 (unreliable)
[c00000011f1079a0] [c000000000276118] msi_remove_device_irq_domain+0x9c/0x18c
[c00000011f1079e0] [c00000000027623c] msi_device_data_release+0x34/0xa8
[c00000011f107a10] [c000000000c657b8] release_nodes+0xac/0x1f0
[c00000011f107ab0] [c000000000c675e8] devres_release_all+0xc0/0x138
[c00000011f107b20] [c000000000c5bb8c] device_unbind_cleanup+0x2c/0xb0
[c00000011f107b50] [c000000000c5dfc8] device_release_driver_internal+0x2fc/0x34c
[c00000011f107ba0] [c000000000c5e0c4] driver_detach+0x74/0xe0
[c00000011f107bd0] [c000000000c5b3e0] bus_remove_driver+0x94/0x140
[c00000011f107c50] [c000000000c5f1c8] driver_unregister+0x48/0x88
[c00000011f107cc0] [c000000000b228ec] pci_unregister_driver+0x40/0x128
[c00000011f107d10] [c008000004b6834c] nvme_exit+0x20/0x7cd4 [nvme]
[c00000011f107d30] [c0000000002becb8]
__do_sys_delete_module.constprop.0+0x1ac/0x3ec
[c00000011f107e10] [c000000000032324] system_call_exception+0x134/0x360
[c00000011f107e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
Analysis:
=========
>From tracing the cleanup path, it appears that the crash happens because the
>MSI
descriptor is freed before the MSI teardown is invoked. Specifically, during
NVMe
module unload (rmmod nvme), the call sequence is as follows:
cleanup_module
-> pci_unregister_driver
-> driver_unregister
-> bus_remove_driver
-> driver_detach
-> device_release_driver_internal
-> device_remove
-> pci_device_remove
-> nvme_remove
-> nvme_dev_disable
-> pci_free_irq_vectors
-> pci_disable_msix
-> pci_free_msi_irqs
-> pci_msi_teardown_msi_irqs ==> here we free msi_desc
Later, when call stack continue unwinding through,
-> device_release_driver_internal
-> device_unbind_cleanup
-> devres_release_all
-> release_nodes
-> msi_device_data_release
-> msi_remove_device_irq_domain
-> pseries_msi_ops_teardown => here the freed msi_desc is
dereferenced, leads to crash
Possible Cause:
===============
This looks like a cleanup ordering issue introduced by the recent MSI parent
domain rework. The PCI/MSI teardown seems to assume that the MSI descriptor
remains valid until after the domain teardown path executes — which no longer
appears to hold true in this sequence.
Expected behavior:
==================
The rmmod nvme operation should cleanly unload the module without triggering a
crash or accessing freed MSI descriptors.
Additional notes:
=================
- The crash reproduces consistently on PowerPC (pseries, PHYP).
- It did not occur before the MSI parent domain series was merged.
- Likely to affect other MSI-capable PCI drivers.
Let me know if you need any further details. Also if you fix this bug,
I'd be glad to assist you validating the fix on PPC.
Thanks,
--Nilay