[Public]

Looks good to me.

Reviewed-by: Kent Russell <[email protected]>



> -----Original Message-----
> From: Mario Limonciello (AMD) <[email protected]>
> Sent: Wednesday, January 7, 2026 4:37 PM
> To: [email protected]
> Cc: Mario Limonciello (AMD) <[email protected]>; Russell, Kent
> <[email protected]>
> Subject: [PATCH] drm/amd: Clean up kfd node on surprise disconnect
>
> When an eGPU is unplugged the KFD topology should also be destroyed
> for that GPU. This never happens because the fini_sw callbacks never
> get to run. Run them manually before calling amdgpu_device_ip_fini_early()
> when a device has already been disconnected.
>
> This location is intentionally chosen to make sure that the kfd locking
> refcount doesn't get incremented unintentionally.
>
> Cc: [email protected]
> Closes: https://community.frame.work/t/amd-egpu-on-linux/8691/33
> Signed-off-by: Mario Limonciello (AMD) <[email protected]>
> ---
> v2:
>  * Move the call earlier in amdgpu_device_fini_hw() to fix locking
>    refcount issues
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 021ecc988ff79..f167ba1b6ffcb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5251,6 +5251,14 @@ void amdgpu_device_fini_hw(struct amdgpu_device
> *adev)
>
>       amdgpu_ttm_set_buffer_funcs_status(adev, false);
>
> +     /*
> +      * device went through surprise hotplug; we need to destroy topology
> +      * before ip_fini_early to prevent kfd locking refcount issues by 
> calling
> +      * amdgpu_amdkfd_suspend()
> +      */
> +     if (drm_dev_is_unplugged(adev_to_drm(adev)))
> +             amdgpu_amdkfd_device_fini_sw(adev);
> +
>       amdgpu_device_ip_fini_early(adev);
>
>       amdgpu_irq_fini_hw(adev);
> --
> 2.43.0

Reply via email to