When a surprise unplug occurs while a process has active KFD queues, userspace never gets a chance to call kfd_ioctl_destroy_queue() to properly clean them up. This leads to a WARN_ON in uninitialize() complaining about active_queue_count or processes_count being non-zero.
The issue is that during surprise unplug: 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged() 2. It calls amdgpu_amdkfd_device_fini_sw() 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit() 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 || dqm->processes_count > 0) The warning triggers because the queues were never destroyed - userspace had no opportunity to clean them up before the device disappeared. Fix this by checking for device unplug in kfd_cleanup_nodes() and calling process_termination for each affected process before uninitializing the DQM. This mirrors what happens during normal process shutdown (kfd_process_notifier_release_internal), ensuring queues are properly cleaned up even during surprise removal. Cc: Felix Kuehling <[email protected]> Cc: Kent Russell <[email protected]> Cc: [email protected] Signed-off-by: Mario Limonciello <[email protected]> --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index e9cfb80bd436..7727b66e6afb 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, unsigned int num_nodes) flush_workqueue(kfd->ih_wq); destroy_workqueue(kfd->ih_wq); + /* + * For surprise unplugs with running processes, we need to clean up + * queues before uninitializing the DQM to avoid WARN in uninitialize. + * This handles the case where userspace can't destroy queues normally. + */ + if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) { + struct kfd_process *p; + unsigned int temp; + int idx; + + idx = srcu_read_lock(&kfd_processes_srcu); + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) { + int j; + + for (j = 0; j < p->n_pdds; j++) { + struct kfd_process_device *pdd = p->pdds[j]; + + if (pdd->dev->kfd != kfd) + continue; + + dev_info(kfd_device, + "Terminating queues for process %d on unplugged device\n", + p->lead_thread->pid); + + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm, + &pdd->qpd); + pdd->already_dequeued = true; + } + } + srcu_read_unlock(&kfd_processes_srcu, idx); + } + for (i = 0; i < num_nodes; i++) { knode = kfd->nodes[i]; device_queue_manager_uninit(knode->dqm); -- 2.47.1
