RE: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
[AMD Official Use Only - Internal Distribution Only] Quick grep says it is protected like that ... can you pls paste the full build error without {}? -Daniel [Dennis Li] hi, Daniel, the full build error as the following: make: Entering directory '/home/yajunl/workspace/amd/brahma-staging/BUILD/x86_64/linux' CC [M] drivers/gpu/drm/amd/amdgpu/amdgpu_device.o drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function ‘amdgpu_device_lock_adev’: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4155:2: error: ‘else’ without a previous ‘if’ else ^~~~ scripts/Makefile.build:267: recipe for target 'drivers/gpu/drm/amd/amdgpu/amdgpu_device.o' failed make[1]: *** [drivers/gpu/drm/amd/amdgpu/amdgpu_device.o] Error 1 Makefile:1683: recipe for target 'drivers/gpu/drm/amd/amdgpu' failed make: *** [drivers/gpu/drm/amd/amdgpu] Error 2 > > Let me take a look, > Christian. > > > > > Regards, > > Guchun > > > > -Original Message- > > From: amd-gfx On Behalf Of > > Christian König > > Sent: Tuesday, August 11, 2020 2:53 PM > > To: Li, Dennis ; amd-gfx@lists.freedesktop.org; > > Deucher, Alexander ; Kuehling, Felix > > ; Zhang, Hawking ; > > dan...@ffwll.ch > > Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive > > recursive locking > > > > Am 11.08.20 um 04:12 schrieb Dennis Li: > >> [ 584.110304] > >> [ 584.110590] WARNING: possible recursive locking detected > >> [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G > >> OE > >> [ 584.64] > >> [ 584.111456] kworker/38:1/553 is trying to acquire lock: > >> [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] > >> but task is already holding lock: > >> [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] > >> other info that might help us debug this: > >> [ 584.113689] Possible unsafe locking scenario: > >> > >> [ 584.114350]CPU0 > >> [ 584.114685] > >> [ 584.115014] lock(&adev->reset_sem); > >> [ 584.115349] lock(&adev->reset_sem); > >> [ 584.115678] > >> *** DEADLOCK *** > >> > >> [ 584.116624] May be due to missing lock nesting notation > >> > >> [ 584.117284] 4 locks held by kworker/38:1/553: > >> [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, > >> at: process_one_work+0x21f/0x630 [ 584.117967] #1: > >> ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: > >> process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 > >> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 > >> [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, > >> at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] > >> stack backtrace: > >> [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: > >> G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 > >> [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, > >> BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events > >> amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: > >> [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] > >> __lock_acquire+0x1139/0x16e0 [ 584.122931] ? > >> trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? > >> cancel_delayed_work+0xa6/0xc0 [ 584.123771] > >> lock_acquire+0xb8/0x1c0 [ 584.124197] ? > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] > >> down_write+0x49/0x120 [ 584.125032] ? > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? > >> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] > >> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] > >> process_one_work+0x29e/0x630 [ 584.127208] > >> worker_thread+0x3c/0x3f0 [ 584.127621] ? > >> __kthread_parkme+0x61/0x90 [ 584.128014] > >> kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 > >> kthread+[ > >> 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] > >> ret_from_fork+0x3a/0x50 > >>
RE: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
[AMD Public Use] Hi Christian, Since it's out of amdgpu driver scope, so I just attached my patch in this email thread instead of sending it by git sent-email. Regards, Guchun -Original Message- From: Koenig, Christian Sent: Tuesday, August 11, 2020 3:52 PM To: Chen, Guchun ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; dan...@ffwll.ch Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking Nice catch, it's not often that we stumble over something like this :) Can you provide a patch to fix this? Problem is that we probably won't be able to push it to the AMD servers, but I can probably merge it through drm-misc-next. Until this is fixed feel free to commit the patch with the {} in place. Thanks, Christian. Am 11.08.20 um 09:44 schrieb Chen, Guchun: > [AMD Public Use] > > # define down_write_nest_lock(sem, nest_lock) \ > do { \ > typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ > _down_write_nest_lock(sem, &(nest_lock)->dep_map); \ > } while (0); > > Looks the ';' after while (0) is the error point. It should be dropped. > > Regards, > Guchun > > -Original Message- > From: Koenig, Christian > Sent: Tuesday, August 11, 2020 3:41 PM > To: Chen, Guchun ; Li, Dennis > ; amd-gfx@lists.freedesktop.org; Deucher, Alexander > ; Kuehling, Felix ; > Zhang, Hawking ; dan...@ffwll.ch > Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive > recursive locking > > Am 11.08.20 um 09:36 schrieb Chen, Guchun: >> [AMD Public Use] >> >>> - down_write(&adev->reset_sem); >>> + if (hive) { >>> + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); >>> + } else >>> + down_write(&adev->reset_sem); >> Coding style nit pick: You should drop the {} here. >> >> {} could not be dropped here, as down_write_nest_lock Is one macro with >> multiple lines, otherwise, build error would say 'else' missed one previous >> 'if'. >> Instead of dropping, another {} should be added to else to include >> down_write(&adev->reset_sem), which makes the braces been balanced. > Interesting bug, that is something which should not happen with macros in the > Linux kernel. They protection by "do { ... } while(0)" is mandatory here. > > Let me take a look, > Christian. > >> Regards, >> Guchun >> >> -----Original Message----- >> From: amd-gfx On Behalf Of >> Christian König >> Sent: Tuesday, August 11, 2020 2:53 PM >> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; >> Deucher, Alexander ; Kuehling, Felix >> ; Zhang, Hawking ; >> dan...@ffwll.ch >> Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive >> recursive locking >> >> Am 11.08.20 um 04:12 schrieb Dennis Li: >>> [ 584.110304] >>> [ 584.110590] WARNING: possible recursive locking detected >>> [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >>> [ 584.64] >>> [ 584.111456] kworker/38:1/553 is trying to acquire lock: >>> [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: >>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] >>> but task is already holding lock: >>> [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: >>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] >>> other info that might help us debug this: >>> [ 584.113689] Possible unsafe locking scenario: >>> >>> [ 584.114350]CPU0 >>> [ 584.114685] >>> [ 584.115014] lock(&adev->reset_sem); >>> [ 584.115349] lock(&adev->reset_sem); >>> [ 584.115678] >>>*** DEADLOCK *** >>> >>> [ 584.116624] May be due to missing lock nesting notation >>> >>> [ 584.117284] 4 locks held by kworker/38:1/553: >>> [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, >>> at: process_one_work+0x21f/0x630 [ 584.117967] #1: >>> ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: >>> process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 >>> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] >>> [
Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
Nice catch, it's not often that we stumble over something like this :) Can you provide a patch to fix this? Problem is that we probably won't be able to push it to the AMD servers, but I can probably merge it through drm-misc-next. Until this is fixed feel free to commit the patch with the {} in place. Thanks, Christian. Am 11.08.20 um 09:44 schrieb Chen, Guchun: [AMD Public Use] # define down_write_nest_lock(sem, nest_lock) \ do {\ typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ _down_write_nest_lock(sem, &(nest_lock)->dep_map); \ } while (0); Looks the ';' after while (0) is the error point. It should be dropped. Regards, Guchun -Original Message- From: Koenig, Christian Sent: Tuesday, August 11, 2020 3:41 PM To: Chen, Guchun ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; dan...@ffwll.ch Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking Am 11.08.20 um 09:36 schrieb Chen, Guchun: [AMD Public Use] - down_write(&adev->reset_sem); + if (hive) { + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); + } else + down_write(&adev->reset_sem); Coding style nit pick: You should drop the {} here. {} could not be dropped here, as down_write_nest_lock Is one macro with multiple lines, otherwise, build error would say 'else' missed one previous 'if'. Instead of dropping, another {} should be added to else to include down_write(&adev->reset_sem), which makes the braces been balanced. Interesting bug, that is something which should not happen with macros in the Linux kernel. They protection by "do { ... } while(0)" is mandatory here. Let me take a look, Christian. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Tuesday, August 11, 2020 2:53 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; dan...@ffwll.ch Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking Am 11.08.20 um 04:12 schrieb Dennis Li: [ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350]CPU0 [ 584.114685] [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630
Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
On Tue, Aug 11, 2020 at 9:40 AM Christian König wrote: > > Am 11.08.20 um 09:36 schrieb Chen, Guchun: > > [AMD Public Use] > > > >> -down_write(&adev->reset_sem); > >> +if (hive) { > >> +down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); > >> +} else > >> +down_write(&adev->reset_sem); > > Coding style nit pick: You should drop the {} here. > > > > {} could not be dropped here, as down_write_nest_lock Is one macro with > > multiple lines, otherwise, build error would say 'else' missed one previous > > 'if'. > > Instead of dropping, another {} should be added to else to include > > down_write(&adev->reset_sem), which makes the braces been balanced. > > Interesting bug, that is something which should not happen with macros > in the Linux kernel. They protection by "do { ... } while(0)" is > mandatory here. Quick grep says it is protected like that ... can you pls paste the full build error without {}? -Daniel > > Let me take a look, > Christian. > > > > > Regards, > > Guchun > > > > -Original Message- > > From: amd-gfx On Behalf Of > > Christian König > > Sent: Tuesday, August 11, 2020 2:53 PM > > To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, > > Alexander ; Kuehling, Felix > > ; Zhang, Hawking ; > > dan...@ffwll.ch > > Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive > > locking > > > > Am 11.08.20 um 04:12 schrieb Dennis Li: > >> [ 584.110304] > >> [ 584.110590] WARNING: possible recursive locking detected > >> [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G > >> OE > >> [ 584.64] > >> [ 584.111456] kworker/38:1/553 is trying to acquire lock: > >> [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] > >> but task is already holding lock: > >> [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] > >> other info that might help us debug this: > >> [ 584.113689] Possible unsafe locking scenario: > >> > >> [ 584.114350]CPU0 > >> [ 584.114685] > >> [ 584.115014] lock(&adev->reset_sem); > >> [ 584.115349] lock(&adev->reset_sem); > >> [ 584.115678] > >> *** DEADLOCK *** > >> > >> [ 584.116624] May be due to missing lock nesting notation > >> > >> [ 584.117284] 4 locks held by kworker/38:1/553: > >> [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, > >> at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 > >> ((work_completion)(&con->recovery_work)){+.+.}, at: > >> process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 > >> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 > >> [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, > >> at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] > >> stack backtrace: > >> [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: > >> G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 > >> [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, > >> BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events > >> amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: > >> [ 584.122050] dump_stack+0x98/0xd5 > >> [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? > >> trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? > >> cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 > >> [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ > >> 584.124599] down_write+0x49/0x120 [ 584.125032] ? > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] > >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? > >> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] > >> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] > >> process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 > >> [ 584.127621] ? __kthread_p
RE: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
[AMD Public Use] # define down_write_nest_lock(sem, nest_lock) \ do {\ typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ _down_write_nest_lock(sem, &(nest_lock)->dep_map); \ } while (0); Looks the ';' after while (0) is the error point. It should be dropped. Regards, Guchun -Original Message- From: Koenig, Christian Sent: Tuesday, August 11, 2020 3:41 PM To: Chen, Guchun ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; dan...@ffwll.ch Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking Am 11.08.20 um 09:36 schrieb Chen, Guchun: > [AMD Public Use] > >> -down_write(&adev->reset_sem); >> +if (hive) { >> +down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); >> +} else >> +down_write(&adev->reset_sem); > Coding style nit pick: You should drop the {} here. > > {} could not be dropped here, as down_write_nest_lock Is one macro with > multiple lines, otherwise, build error would say 'else' missed one previous > 'if'. > Instead of dropping, another {} should be added to else to include > down_write(&adev->reset_sem), which makes the braces been balanced. Interesting bug, that is something which should not happen with macros in the Linux kernel. They protection by "do { ... } while(0)" is mandatory here. Let me take a look, Christian. > > Regards, > Guchun > > -Original Message- > From: amd-gfx On Behalf Of > Christian König > Sent: Tuesday, August 11, 2020 2:53 PM > To: Li, Dennis ; amd-gfx@lists.freedesktop.org; > Deucher, Alexander ; Kuehling, Felix > ; Zhang, Hawking ; > dan...@ffwll.ch > Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive > recursive locking > > Am 11.08.20 um 04:12 schrieb Dennis Li: >> [ 584.110304] >> [ 584.110590] WARNING: possible recursive locking detected >> [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >> [ 584.64] >> [ 584.111456] kworker/38:1/553 is trying to acquire lock: >> [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] >> but task is already holding lock: >> [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] >> other info that might help us debug this: >> [ 584.113689] Possible unsafe locking scenario: >> >> [ 584.114350]CPU0 >> [ 584.114685] >> [ 584.115014] lock(&adev->reset_sem); >> [ 584.115349] lock(&adev->reset_sem); >> [ 584.115678] >> *** DEADLOCK *** >> >> [ 584.116624] May be due to missing lock nesting notation >> >> [ 584.117284] 4 locks held by kworker/38:1/553: >> [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, >> at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 >> ((work_completion)(&con->recovery_work)){+.+.}, at: >> process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 >> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] >> [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] >> stack backtrace: >> [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G >> OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 >> [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, >> BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events >> amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: >> [ 584.122050] dump_stack+0x98/0xd5 >> [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? >> trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? >> cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 >> [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ >> 584.124599] down_write+0x49/0x120 [ 584.125032] ? >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? >> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] >> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] >> process_
Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
Am 11.08.20 um 09:36 schrieb Chen, Guchun: [AMD Public Use] - down_write(&adev->reset_sem); + if (hive) { + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); + } else + down_write(&adev->reset_sem); Coding style nit pick: You should drop the {} here. {} could not be dropped here, as down_write_nest_lock Is one macro with multiple lines, otherwise, build error would say 'else' missed one previous 'if'. Instead of dropping, another {} should be added to else to include down_write(&adev->reset_sem), which makes the braces been balanced. Interesting bug, that is something which should not happen with macros in the Linux kernel. They protection by "do { ... } while(0)" is mandatory here. Let me take a look, Christian. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Tuesday, August 11, 2020 2:53 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; dan...@ffwll.ch Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking Am 11.08.20 um 04:12 schrieb Dennis Li: [ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350]CPU0 [ 584.114685] [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] ret_from_fork+0x3a/0x50 Each adev has owned lock_class_key to avoid false positive recursive locking. v2: 1. register adev->lock_key into lockdep, otherwise lockdep will report the below warning [ 1216.705820] BUG: key 890183b647d0 has not been registered! [ 1216.705924] [ cut here ] [ 1216.705972] DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 v3: change to use down_write_nest_lock to annotate the false dead-lock warning. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 62ecac97fbd2..8a55b0bc044a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, return r; } -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev) +static bool amdgpu_device_lock_adev(struct
RE: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
[AMD Public Use] > - down_write(&adev->reset_sem); > + if (hive) { > + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); > + } else > + down_write(&adev->reset_sem); Coding style nit pick: You should drop the {} here. {} could not be dropped here, as down_write_nest_lock Is one macro with multiple lines, otherwise, build error would say 'else' missed one previous 'if'. Instead of dropping, another {} should be added to else to include down_write(&adev->reset_sem), which makes the braces been balanced. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Tuesday, August 11, 2020 2:53 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; dan...@ffwll.ch Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking Am 11.08.20 um 04:12 schrieb Dennis Li: > [ 584.110304] > [ 584.110590] WARNING: possible recursive locking detected > [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE > [ 584.64] > [ 584.111456] kworker/38:1/553 is trying to acquire lock: > [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] > but task is already holding lock: > [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] > other info that might help us debug this: > [ 584.113689] Possible unsafe locking scenario: > > [ 584.114350]CPU0 > [ 584.114685] > [ 584.115014] lock(&adev->reset_sem); > [ 584.115349] lock(&adev->reset_sem); > [ 584.115678] > *** DEADLOCK *** > > [ 584.116624] May be due to missing lock nesting notation > > [ 584.117284] 4 locks held by kworker/38:1/553: > [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, > at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 > ((work_completion)(&con->recovery_work)){+.+.}, at: > process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 > (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ > 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] > stack backtrace: > [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G > OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 > [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, > BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events > amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: > [ 584.122050] dump_stack+0x98/0xd5 > [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? > trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? > cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 > [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ > 584.124599] down_write+0x49/0x120 [ 584.125032] ? > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? > amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] > amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] > process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 > [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] > kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ > 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] > ret_from_fork+0x3a/0x50 > > Each adev has owned lock_class_key to avoid false positive recursive > locking. > > v2: > 1. register adev->lock_key into lockdep, otherwise lockdep will report > the below warning > > [ 1216.705820] BUG: key 890183b647d0 has not been registered! > [ 1216.705924] [ cut here ] [ 1216.705972] > DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at > kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 > > v3: > change to use down_write_nest_lock to annotate the false dead-lock > warning. > > Signed-off-by: Dennis Li > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 62ecac97fbd2..8a55b0bc044a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct
Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
On Tue, Aug 11, 2020 at 4:12 AM Dennis Li wrote: > > [ 584.110304] > [ 584.110590] WARNING: possible recursive locking detected > [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE > [ 584.64] > [ 584.111456] kworker/38:1/553 is trying to acquire lock: > [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.112112] >but task is already holding lock: > [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.113068] >other info that might help us debug this: > [ 584.113689] Possible unsafe locking scenario: > > [ 584.114350]CPU0 > [ 584.114685] > [ 584.115014] lock(&adev->reset_sem); > [ 584.115349] lock(&adev->reset_sem); > [ 584.115678] > *** DEADLOCK *** > > [ 584.116624] May be due to missing lock nesting notation > > [ 584.117284] 4 locks held by kworker/38:1/553: > [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: > process_one_work+0x21f/0x630 > [ 584.117967] #1: ac708e1c3e58 > ((work_completion)(&con->recovery_work)){+.+.}, at: > process_one_work+0x21f/0x630 > [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: > amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] > [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.119222] >stack backtrace: > [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G > OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 > [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 > 05/23/2019 > [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] > [ 584.121638] Call Trace: > [ 584.122050] dump_stack+0x98/0xd5 > [ 584.122499] __lock_acquire+0x1139/0x16e0 > [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 > [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 > [ 584.123771] lock_acquire+0xb8/0x1c0 > [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.124599] down_write+0x49/0x120 > [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] > [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] > [ 584.126789] process_one_work+0x29e/0x630 > [ 584.127208] worker_thread+0x3c/0x3f0 > [ 584.127621] ? __kthread_parkme+0x61/0x90 > [ 584.128014] kthread+0x12f/0x150 > [ 584.128402] ? process_one_work+0x630/0x630 > [ 584.128790] ? kthread_park+0x90/0x90 > [ 584.129174] ret_from_fork+0x3a/0x50 > > Each adev has owned lock_class_key to avoid false positive > recursive locking. > > v2: > 1. register adev->lock_key into lockdep, otherwise lockdep will > report the below warning > > [ 1216.705820] BUG: key 890183b647d0 has not been registered! > [ 1216.705924] [ cut here ] > [ 1216.705972] DEBUG_LOCKS_WARN_ON(1) > [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 > lockdep_init_map+0x150/0x210 > > v3: > change to use down_write_nest_lock to annotate the false dead-lock > warning. > > Signed-off-by: Dennis Li I'd explain a bit more in writing what the actual deadlock possibility is, but up to you folks. Reviewed-by: Daniel Vetter Cheers, Daniel > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 62ecac97fbd2..8a55b0bc044a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct > amdgpu_hive_info *hive, > return r; > } > > -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev) > +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, struct > amdgpu_hive_info *hive) > { > if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0) > return false; > > - down_write(&adev->reset_sem); > + if (hive) { > + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); > + } else > + down_write(&adev->reset_sem); > > atomic_inc(&adev->gpu_reset_counter); > switch (amdgpu_asic_reset_method(adev)) { > @@ -4312,7 +4315,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, > > /* block all schedulers and reset given job's ring */ > list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { > - if (!amdgpu_device_lock_adev(tmp_adev)) { > + if (!amdgpu_device_lock_adev(tmp_adev, hive)) { > DRM_INFO("Bailing on TDR for s_job:%llx, as another > already in progress", >
Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking
Am 11.08.20 um 04:12 schrieb Dennis Li: [ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350]CPU0 [ 584.114685] [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] ret_from_fork+0x3a/0x50 Each adev has owned lock_class_key to avoid false positive recursive locking. v2: 1. register adev->lock_key into lockdep, otherwise lockdep will report the below warning [ 1216.705820] BUG: key 890183b647d0 has not been registered! [ 1216.705924] [ cut here ] [ 1216.705972] DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 v3: change to use down_write_nest_lock to annotate the false dead-lock warning. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 62ecac97fbd2..8a55b0bc044a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, return r; } -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev) +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, struct amdgpu_hive_info *hive) { if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0) return false; - down_write(&adev->reset_sem); + if (hive) { + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); + } else + down_write(&adev->reset_sem); Coding style nit pick: You should drop the {} here. Apart from that the patch is Reviewed-by: Christian König atomic_inc(&adev->gpu_reset_counter); switch (amdgpu_asic_reset_method(adev)) { @@ -4312,7 +4315,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, /* block all schedulers and reset given job's ring */ list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { - if (!amdgpu_device_lock_adev(tmp_adev)) { + if (!amdgpu_device_lock_adev(tmp_adev, hive)) { DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress", job ? job->base.id : -1); r = 0; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH v3] drm/amdgpu: annotate a false positive recursive locking
[ 584.110304] [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.64] [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350]CPU0 [ 584.114685] [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: 9b1603d247a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] stack backtrace: [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] ret_from_fork+0x3a/0x50 Each adev has owned lock_class_key to avoid false positive recursive locking. v2: 1. register adev->lock_key into lockdep, otherwise lockdep will report the below warning [ 1216.705820] BUG: key 890183b647d0 has not been registered! [ 1216.705924] [ cut here ] [ 1216.705972] DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 v3: change to use down_write_nest_lock to annotate the false dead-lock warning. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 62ecac97fbd2..8a55b0bc044a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, return r; } -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev) +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, struct amdgpu_hive_info *hive) { if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0) return false; - down_write(&adev->reset_sem); + if (hive) { + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); + } else + down_write(&adev->reset_sem); atomic_inc(&adev->gpu_reset_counter); switch (amdgpu_asic_reset_method(adev)) { @@ -4312,7 +4315,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, /* block all schedulers and reset given job's ring */ list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { - if (!amdgpu_device_lock_adev(tmp_adev)) { + if (!amdgpu_device_lock_adev(tmp_adev, hive)) { DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress", job ? job->base.id : -1); r = 0; -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx