Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Christian König Thu, 06 Aug 2020 01:14:04 -0700

Preventing locking problems during implementation is obviously a goodapproach, but lockdep has proven to be massively useful for finding andfixing problems.

Disabling lockdep splat by annotating lock with separate classes isusually a no-go and only allowed if there is no other potential approach.


In this case here we should really clean things up instead.

Christian.

Am 06.08.20 um 09:44 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
       I agree with your concern. However we shouldn't rely on system to detect 
dead-lock, and should consider this when doing code implementation. In fact, 
dead-lock detection isn't enabled by default.
       For your proposal to remove reset_sem into the hive structure, we can 
open a new topic to discuss it. Currently we couldn't make sure which is the 
best solution. For example, with your proposal, we must wait for all devices 
resuming successfully before resubmit an old task in one device, which will 
effect performance.

Best Regards
Dennis Li
-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Christian 
König
Sent: Thursday, August 6, 2020 3:08 PM
To: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
<alexander.deuc...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>; Zhang, Hawking 
<hawking.zh...@amd.com>
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Am 06.08.20 um 09:02 schrieb Dennis Li:

[  584.110304] ============================================
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
[  584.111164] --------------------------------------------
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
                 but task is already holding lock:
[  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
                 other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]        CPU0
[  584.114685]        ----
[  584.115014]   lock(&adev->reset_sem);
[  584.115349]   lock(&adev->reset_sem);
[  584.115678]
                  *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.},
at: process_one_work+0x21f/0x630 [  584.117967]  #1: ffffac708e1c3e58
((work_completion)(&con->recovery_work)){+.+.}, at:
process_one_work+0x21f/0x630 [  584.118358]  #2: ffffffffc1c2a5d0
(&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [  
584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
                 stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G     
      OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT,
BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events
amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [
584.124599]  down_write+0x49/0x120 [  584.125032]  ?
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
process_one_work+0x29e/0x630 [  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90 [  584.128014]
kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 [
584.128790]  ? kthread_park+0x90/0x90 [  584.129174]
ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive recursive
locking.

NAK, that is not a false positive but a real problem.

The issue here is that we have multiple reset semaphores, one for each device 
in the hive. If those are not acquired in the correct order we deadlock.

The real solution would be to move the reset_sem into the hive structure and 
make sure that we lock it only once.

Christian.

Signed-off-by: Dennis Li <dennis...@amd.com>
Change-Id: I7571efeccbf15483982031d00504a353031a854a

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e97c088d03b3..766dc8f8c8a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -967,6 +967,7 @@ struct amdgpu_device {
        atomic_t                        in_gpu_reset;
        enum pp_mp1_state               mp1_state;
        struct rw_semaphore     reset_sem;
+       struct lock_class_key lock_key;
        struct amdgpu_doorbell_index doorbell_index;

struct mutex notifier_lock;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6c572db42d92..d78df9312d34 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3037,6 +3037,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
        mutex_init(&adev->virt.vf_errors.lock);
        hash_init(adev->mn_hash);
        init_rwsem(&adev->reset_sem);
+       lockdep_set_class(&adev->reset_sem, &adev->lock_key);
        atomic_set(&adev->in_gpu_reset, 0);
        mutex_init(&adev->psp.mutex);
        mutex_init(&adev->notifier_lock);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CDennis.Li%40amd.com%7C56c95f939ddd441bd10408d839d77c9e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637322944985436656&amp;sdata=%2FBoRyEW3iK59Y52ctLWd4y1lOmi2CInb6lpIgAF88i4%3D&amp;reserved=0


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Reply via email to