When a surprise unplug occurs while a process has active KFD queues,
userspace never gets a chance to call kfd_ioctl_destroy_queue() to
properly clean them up. This leads to a WARN_ON in uninitialize()
complaining about active_queue_count or processes_count being non-zero.

The issue is that during surprise unplug:
1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
2. It calls amdgpu_amdkfd_device_fini_sw()
3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 || 
   dqm->processes_count > 0)

The warning triggers because the queues were never destroyed - userspace
had no opportunity to clean them up before the device disappeared.

Fix this by checking for device unplug in kfd_cleanup_nodes() and
calling process_termination for each affected process before
uninitializing the DQM. This mirrors what happens during normal process
shutdown (kfd_process_notifier_release_internal), ensuring queues are
properly cleaned up even during surprise removal.

Cc: Felix Kuehling <[email protected]>
Cc: Kent Russell <[email protected]>
Cc: [email protected]
Signed-off-by: Mario Limonciello <[email protected]>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index e9cfb80bd436..7727b66e6afb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, 
unsigned int num_nodes)
        flush_workqueue(kfd->ih_wq);
        destroy_workqueue(kfd->ih_wq);
 
+       /*
+        * For surprise unplugs with running processes, we need to clean up
+        * queues before uninitializing the DQM to avoid WARN in uninitialize.
+        * This handles the case where userspace can't destroy queues normally.
+        */
+       if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
+               struct kfd_process *p;
+               unsigned int temp;
+               int idx;
+
+               idx = srcu_read_lock(&kfd_processes_srcu);
+               hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+                       int j;
+
+                       for (j = 0; j < p->n_pdds; j++) {
+                               struct kfd_process_device *pdd = p->pdds[j];
+
+                               if (pdd->dev->kfd != kfd)
+                                       continue;
+
+                               dev_info(kfd_device,
+                                        "Terminating queues for process %d on 
unplugged device\n",
+                                        p->lead_thread->pid);
+
+                               
pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
+                                                                      
&pdd->qpd);
+                               pdd->already_dequeued = true;
+                       }
+               }
+               srcu_read_unlock(&kfd_processes_srcu, idx);
+       }
+
        for (i = 0; i < num_nodes; i++) {
                knode = kfd->nodes[i];
                device_queue_manager_uninit(knode->dqm);
-- 
2.47.1

Reply via email to