[AMD Official Use Only - AMD Internal Distribution Only] flush_workqueue(kfd->ih_wq) and destroy_workqueue(kfd->ih_wq) in kfd_cleanup_nodes clean up pending work items, and node->interrupts_active check prevent new work items from being enqueued. So after kfd_cleanup_nodes free kfd node, there is no pending kfd ih_wq work items.
And I agree there is still potential race here, since kfd->nodes[i] NULL check is not protected by lock. Will address it together with the issue Lijo mentioned. -----Original Message----- From: Chen, Xiaogang <[email protected]> Sent: Friday, September 26, 2025 3:11 AM To: Yang, Philip <[email protected]>; Zhang, Yifan <[email protected]>; [email protected] Cc: Deucher, Alexander <[email protected]>; Kuehling, Felix <[email protected]>; Yang, Philip <[email protected]>; Lazar, Lijo <[email protected]> Subject: Re: [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw On 9/24/2025 5:48 PM, Philip Yang wrote: > > On 2025-09-24 11:29, Yifan Zhang wrote: >> There is race in amdgpu_amdkfd_device_fini_sw and interrupt. >> if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and >> kfree(kfd), and KGD interrupt generated. >> >> kernel panic log: >> >> BUG: kernel NULL pointer dereference, address: 0000000000000098 >> amdgpu 0000:c8:00.0: amdgpu: Requesting 4 partitions through PSP >> >> PGD d78c68067 P4D d78c68067 >> >> kfd kfd: amdgpu: Allocated 3969056 bytes on gart >> >> PUD 1465b8067 PMD @ >> >> Oops: @002 [#1] SMP NOPTI >> >> kfd kfd: amdgpu: Total number of KFD nodes to be created: 4 >> CPU: 115 PID: @ Comm: swapper/115 Kdump: loaded Tainted: G S W OE K >> >> RIP: 0010:_raw_spin_lock_irqsave+0x12/0x40 >> >> Code: 89 e@ 41 5c c3 cc cc cc cc 66 66 2e Of 1f 84 00 00 00 00 00 OF >> 1f 40 00 Of 1f 44% 00 00 41 54 9c 41 5c fa 31 cO ba 01 00 00 00 <fO> >> OF b1 17 75 Ba 4c 89 e@ 41 Sc >> >> 89 c6 e8 07 38 5d >> >> RSP: 0018: ffffc90@1a6b0e28 EFLAGS: 00010046 >> >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018 >> 0000000000000001 RSI: ffff8883bb623e00 RDI: 0000000000000098 >> ffff8883bb000000 RO8: ffff888100055020 ROO: ffff888100055020 >> 0000000000000000 R11: 0000000000000000 R12: 0900000000000002 >> ffff888F2b97da0@ R14: @000000000000098 R15: ffff8883babdfo00 >> >> CS: 010 DS: 0000 ES: 0000 CRO: 0000000080050033 >> >> CR2: 0000000000000098 CR3: 0000000e7cae2006 CR4: 0000000002770ce0 >> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> 0000000000000000 DR6: 00000000fffeO7FO DR7: 0000000000000400 >> >> PKRU: 55555554 >> >> Call Trace: >> >> <IRQ> >> >> kgd2kfd_interrupt+@x6b/0x1f@ [amdgpu] >> >> ? amdgpu_fence_process+0xa4/0x150 [amdgpu] >> >> kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 3 YcpxFl Rant tErace >> >> amdgpu_irq_dispatch+0x165/0x210 [amdgpu] >> >> amdgpu_ih_process+0x80/0x100 [amdgpu] >> >> amdgpu: Virtual CRAT table created for GPU >> >> amdgpu_irq_handler+0x1f/@x60 [amdgpu] >> >> __handle_irq_event_percpu+0x3d/0x170 >> >> amdgpu: Topology: Add dGPU node [0x74a2:0x1002] >> >> handle_irq_event+0x5a/@xcO >> >> handle_edge_irq+0x93/0x240 >> >> kfd kfd: amdgpu: KFD node 1 partition @ size 49148M >> >> asm_call_irq_on_stack+0xf/@x20 >> >> </IRQ> >> >> common_interrupt+0xb3/0x130 >> >> asm_common_interrupt+0x1le/0x40 >> >> 5.10.134-010.a1i5000.a18.x86_64 #1 >> >> Signed-off-by: Yifan Zhang <[email protected]> > Reviewed-by: Philip Yang<[email protected]> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> index 349c351e242b..051a00152b08 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >> @@ -1133,7 +1133,15 @@ void kgd2kfd_interrupt(struct kfd_dev *kfd, >> const void *ih_ring_entry) >> } >> for (i = 0; i < kfd->num_nodes; i++) { >> - node = kfd->nodes[i]; >> + /* Race if another thread in b/w >> + * kfd_cleanup_nodes and kfree(kfd), >> + * when kfd->nodes[i] = NULL >> + */ >> + if (kfd->nodes[i]) >> + node = kfd->nodes[i]; >> + else >> + return; >> + KFD interrupt is handled later in wq node->kfd->ih_wq at interrupt_wq which uses kfd->nodes[i]. Checking " kfd->nodes[i] == NULL" here is not enough as kfd_cleanup_nodes can free kfd node at any time. Regards Xiaogang >> spin_lock_irqsave(&node->interrupt_lock, flags); >> if (node->interrupts_active
