RE: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout

Chen, Horace Fri, 08 Jan 2021 03:34:47 -0800

[AMD Public Use]

Hi Christian,


Can you help review this change?

This issue happens when 2 jobs on 2 schedulers time out at the same time. Which 
will lead 2 threads to enter amdgpu_device_gpu_recover() at the same time. The 
problem is that if device is not an XGMI node, the adev->gmc.xgmi.head will be 
added to device_list which is a stack variable. 
So the first thread will get the device in to its device list and start to 
iterate, meanwhile the second thread may rob the device away from the first 
thread and add to its own device list. This will cause the first thread get in 
to a bad state in its iteration.

The solution is to lock the device earily, before we add device to the local 
device list.

Thanks & Regards,
Horace.

-----Original Message-----
From: Horace Chen <horace.c...@amd.com> 
Sent: Wednesday, January 6, 2021 8:43 PM
To: amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <horace.c...@amd.com>; Tuikov, Luben <luben.tui...@amd.com>; 
Koenig, Christian <christian.koe...@amd.com>; Deucher, Alexander 
<alexander.deuc...@amd.com>; Xiao, Jack <jack.x...@amd.com>; Zhang, Hawking 
<hawking.zh...@amd.com>; Liu, Monk <monk....@amd.com>; Xu, Feifei 
<feifei...@amd.com>; Wang, Kevin(Yang) <kevin1.w...@amd.com>; Xiaojie Yuan 
<xiaojie.y...@amd.com>
Subject: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout

Fix a racing issue when 2 rings job timeout simultaneously.

If 2 rings timed out at the same time, the
amdgpu_device_gpu_recover will be reentered. Then the
adev->gmc.xgmi.head will be grabbed by 2 local linked list,
which may cause wild pointer issue in iterating.

lock the device earily to prevent the node be added to 2
different lists.

Signed-off-by: Horace Chen <horace.c...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 ++++++++++++++++------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9a3cb98d03be..233dae27c8eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4620,23 +4620,34 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
        if (adev->gmc.xgmi.num_physical_nodes > 1) {
                if (!hive)
                        return -ENODEV;
+
+               list_for_each_entry(tmp_adev, &hive->device_list, 
gmc.xgmi.head) {
+                       if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
+                               dev_info(tmp_adev->dev, "Bailing on TDR for 
s_job:%llx, as another already in progress",
+                                               job ? job->base.id : -1);
+                               r = 0;
+                               goto skip_recovery;
+                       }
+               }
+
                if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
                        list_rotate_to_front(&adev->gmc.xgmi.head, 
&hive->device_list);
                device_list_handle = &hive->device_list;
        } else {
+               /* if current dev is already in reset, skip adding list to 
prevent race issue */
+               if (!amdgpu_device_lock_adev(adev, hive)) {
+                       dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as 
another already in progress",
+                                       job ? job->base.id : -1);
+                       r = 0;
+                       goto skip_recovery;
+               }
+
                list_add_tail(&adev->gmc.xgmi.head, &device_list);
                device_list_handle = &device_list;
        }
 
        /* block all schedulers and reset given job's ring */
        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-               if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
-                       dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, 
as another already in progress",
-                                 job ? job->base.id : -1);
-                       r = 0;
-                       goto skip_recovery;
-               }
-
                /*
                 * Try to put the audio codec into suspend state
                 * before gpu reset started.
-- 
2.17.1
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout

Reply via email to