RE: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-11 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]


Quick grep says it is protected like that ... can you pls paste the full build 
error without {}?
-Daniel

[Dennis Li] hi, Daniel, the full build error as the following: 

make: Entering directory 
'/home/yajunl/workspace/amd/brahma-staging/BUILD/x86_64/linux'
  CC [M]  drivers/gpu/drm/amd/amdgpu/amdgpu_device.o
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function 
‘amdgpu_device_lock_adev’:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4155:2: error: ‘else’ without a 
previous ‘if’
  else
  ^~~~
scripts/Makefile.build:267: recipe for target 
'drivers/gpu/drm/amd/amdgpu/amdgpu_device.o' failed
make[1]: *** [drivers/gpu/drm/amd/amdgpu/amdgpu_device.o] Error 1
Makefile:1683: recipe for target 'drivers/gpu/drm/amd/amdgpu' failed
make: *** [drivers/gpu/drm/amd/amdgpu] Error 2



>
> Let me take a look,
> Christian.
>
> >
> > Regards,
> > Guchun
> >
> > -Original Message-
> > From: amd-gfx  On Behalf Of 
> > Christian König
> > Sent: Tuesday, August 11, 2020 2:53 PM
> > To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
> > Deucher, Alexander ; Kuehling, Felix 
> > ; Zhang, Hawking ; 
> > dan...@ffwll.ch
> > Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive 
> > recursive locking
> >
> > Am 11.08.20 um 04:12 schrieb Dennis Li:
> >> [  584.110304] 
> >> [  584.110590] WARNING: possible recursive locking detected
> >> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   
> >> OE
> >> [  584.64] 
> >> [  584.111456] kworker/38:1/553 is trying to acquire lock:
> >> [  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at:
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
> >>  but task is already holding lock:
> >> [  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at:
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
> >>  other info that might help us debug this:
> >> [  584.113689]  Possible unsafe locking scenario:
> >>
> >> [  584.114350]CPU0
> >> [  584.114685]
> >> [  584.115014]   lock(&adev->reset_sem);
> >> [  584.115349]   lock(&adev->reset_sem);
> >> [  584.115678]
> >>   *** DEADLOCK ***
> >>
> >> [  584.116624]  May be due to missing lock nesting notation
> >>
> >> [  584.117284] 4 locks held by kworker/38:1/553:
> >> [  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
> >> at: process_one_work+0x21f/0x630 [  584.117967]  #1: 
> >> ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at:
> >> process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0 
> >> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 
> >> [amdgpu] [  584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, 
> >> at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
> >>  stack backtrace:
> >> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: 
> >> G   OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
> >> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, 
> >> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events 
> >> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
> >> [  584.122050]  dump_stack+0x98/0xd5 [  584.122499]  
> >> __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
> >> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
> >> cancel_delayed_work+0xa6/0xc0 [  584.123771]  
> >> lock_acquire+0xb8/0x1c0 [  584.124197]  ? 
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599]  
> >> down_write+0x49/0x120 [  584.125032]  ?
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
> >> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
> >> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
> >> process_one_work+0x29e/0x630 [  584.127208]  
> >> worker_thread+0x3c/0x3f0 [  584.127621]  ? 
> >> __kthread_parkme+0x61/0x90 [  584.128014]
> >> kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 
> >> kthread+[
> >> 584.128790]  ? kthread_park+0x90/0x90 [  584.129174]
> >> ret_from_fork+0x3a/0x50
> >>

RE: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-11 Thread Chen, Guchun
[AMD Public Use]

Hi Christian,

Since it's out of amdgpu driver scope, so I just attached my patch in this 
email thread instead of sending it by git sent-email.

Regards,
Guchun

-Original Message-
From: Koenig, Christian  
Sent: Tuesday, August 11, 2020 3:52 PM
To: Chen, Guchun ; Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander ; 
Kuehling, Felix ; Zhang, Hawking 
; dan...@ffwll.ch
Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Nice catch, it's not often that we stumble over something like this :)

Can you provide a patch to fix this? Problem is that we probably won't be able 
to push it to the AMD servers, but I can probably merge it through 
drm-misc-next.

Until this is fixed feel free to commit the patch with the {} in place.

Thanks,
Christian.

Am 11.08.20 um 09:44 schrieb Chen, Guchun:
> [AMD Public Use]
>
> # define down_write_nest_lock(sem, nest_lock) \
> do {  \
>   typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \
>   _down_write_nest_lock(sem, &(nest_lock)->dep_map);  \
> } while (0);
>
> Looks the ';' after while (0) is the error point. It should be dropped.
>
> Regards,
> Guchun
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Tuesday, August 11, 2020 3:41 PM
> To: Chen, Guchun ; Li, Dennis 
> ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> ; Kuehling, Felix ; 
> Zhang, Hawking ; dan...@ffwll.ch
> Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive 
> recursive locking
>
> Am 11.08.20 um 09:36 schrieb Chen, Guchun:
>> [AMD Public Use]
>>
>>> -   down_write(&adev->reset_sem);
>>> +   if (hive) {
>>> +   down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
>>> +   } else
>>> +   down_write(&adev->reset_sem);
>> Coding style nit pick: You should drop the {} here.
>>
>> {} could not be dropped here, as down_write_nest_lock Is one macro with 
>> multiple lines, otherwise, build error would say 'else' missed one previous 
>> 'if'.
>> Instead of dropping, another {} should be added to else to include 
>> down_write(&adev->reset_sem), which makes the braces been balanced.
> Interesting bug, that is something which should not happen with macros in the 
> Linux kernel. They protection by "do { ... } while(0)" is mandatory here.
>
> Let me take a look,
> Christian.
>
>> Regards,
>> Guchun
>>
>> -----Original Message-----
>> From: amd-gfx  On Behalf Of 
>> Christian König
>> Sent: Tuesday, August 11, 2020 2:53 PM
>> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
>> Deucher, Alexander ; Kuehling, Felix 
>> ; Zhang, Hawking ; 
>> dan...@ffwll.ch
>> Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive 
>> recursive locking
>>
>> Am 11.08.20 um 04:12 schrieb Dennis Li:
>>> [  584.110304] 
>>> [  584.110590] WARNING: possible recursive locking detected
>>> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
>>> [  584.64] 
>>> [  584.111456] kworker/38:1/553 is trying to acquire lock:
>>> [  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at:
>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
>>>   but task is already holding lock:
>>> [  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at:
>>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
>>>   other info that might help us debug this:
>>> [  584.113689]  Possible unsafe locking scenario:
>>>
>>> [  584.114350]CPU0
>>> [  584.114685]
>>> [  584.115014]   lock(&adev->reset_sem);
>>> [  584.115349]   lock(&adev->reset_sem);
>>> [  584.115678]
>>>*** DEADLOCK ***
>>>
>>> [  584.116624]  May be due to missing lock nesting notation
>>>
>>> [  584.117284] 4 locks held by kworker/38:1/553:
>>> [  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
>>> at: process_one_work+0x21f/0x630 [  584.117967]  #1: 
>>> ac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at:
>>> process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0 
>>> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] 
>>> [  

Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-11 Thread Christian König

Nice catch, it's not often that we stumble over something like this :)

Can you provide a patch to fix this? Problem is that we probably won't 
be able to push it to the AMD servers, but I can probably merge it 
through drm-misc-next.


Until this is fixed feel free to commit the patch with the {} in place.

Thanks,
Christian.

Am 11.08.20 um 09:44 schrieb Chen, Guchun:

[AMD Public Use]

# define down_write_nest_lock(sem, nest_lock)   \
do {\
typecheck(struct lockdep_map *, &(nest_lock)->dep_map);  \
_down_write_nest_lock(sem, &(nest_lock)->dep_map);   \
} while (0);

Looks the ';' after while (0) is the error point. It should be dropped.

Regards,
Guchun

-Original Message-
From: Koenig, Christian 
Sent: Tuesday, August 11, 2020 3:41 PM
To: Chen, Guchun ; Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix 
; Zhang, Hawking ; dan...@ffwll.ch
Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Am 11.08.20 um 09:36 schrieb Chen, Guchun:

[AMD Public Use]


-   down_write(&adev->reset_sem);
+   if (hive) {
+   down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
+   } else
+   down_write(&adev->reset_sem);

Coding style nit pick: You should drop the {} here.

{} could not be dropped here, as down_write_nest_lock Is one macro with 
multiple lines, otherwise, build error would say 'else' missed one previous 
'if'.
Instead of dropping, another {} should be added to else to include 
down_write(&adev->reset_sem), which makes the braces been balanced.

Interesting bug, that is something which should not happen with macros in the Linux 
kernel. They protection by "do { ... } while(0)" is mandatory here.

Let me take a look,
Christian.


Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of
Christian König
Sent: Tuesday, August 11, 2020 2:53 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org;
Deucher, Alexander ; Kuehling, Felix
; Zhang, Hawking ;
dan...@ffwll.ch
Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive
recursive locking

Am 11.08.20 um 04:12 schrieb Dennis Li:

[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
  but task is already holding lock:
[  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
  other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]CPU0
[  584.114685]
[  584.115014]   lock(&adev->reset_sem);
[  584.115349]   lock(&adev->reset_sem);
[  584.115678]
   *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58
((work_completion)(&con->recovery_work)){+.+.}, at:
process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0
(&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [  
584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
  stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G 
  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT,
BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events
amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [
584.124599]  down_write+0x49/0x120 [  584.125032]  ?
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
process_one_work+0x29e/0x630 [  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90 [  584.128014]
kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 

Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-11 Thread Daniel Vetter
On Tue, Aug 11, 2020 at 9:40 AM Christian König
 wrote:
>
> Am 11.08.20 um 09:36 schrieb Chen, Guchun:
> > [AMD Public Use]
> >
> >> -down_write(&adev->reset_sem);
> >> +if (hive) {
> >> +down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
> >> +} else
> >> +down_write(&adev->reset_sem);
> > Coding style nit pick: You should drop the {} here.
> >
> > {} could not be dropped here, as down_write_nest_lock Is one macro with 
> > multiple lines, otherwise, build error would say 'else' missed one previous 
> > 'if'.
> > Instead of dropping, another {} should be added to else to include 
> > down_write(&adev->reset_sem), which makes the braces been balanced.
>
> Interesting bug, that is something which should not happen with macros
> in the Linux kernel. They protection by "do { ... } while(0)" is
> mandatory here.

Quick grep says it is protected like that ... can you pls paste the
full build error without {}?
-Daniel

>
> Let me take a look,
> Christian.
>
> >
> > Regards,
> > Guchun
> >
> > -Original Message-
> > From: amd-gfx  On Behalf Of 
> > Christian König
> > Sent: Tuesday, August 11, 2020 2:53 PM
> > To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
> > Alexander ; Kuehling, Felix 
> > ; Zhang, Hawking ; 
> > dan...@ffwll.ch
> > Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive 
> > locking
> >
> > Am 11.08.20 um 04:12 schrieb Dennis Li:
> >> [  584.110304] 
> >> [  584.110590] WARNING: possible recursive locking detected
> >> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   
> >> OE
> >> [  584.64] 
> >> [  584.111456] kworker/38:1/553 is trying to acquire lock:
> >> [  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at:
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
> >>  but task is already holding lock:
> >> [  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at:
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
> >>  other info that might help us debug this:
> >> [  584.113689]  Possible unsafe locking scenario:
> >>
> >> [  584.114350]CPU0
> >> [  584.114685]
> >> [  584.115014]   lock(&adev->reset_sem);
> >> [  584.115349]   lock(&adev->reset_sem);
> >> [  584.115678]
> >>   *** DEADLOCK ***
> >>
> >> [  584.116624]  May be due to missing lock nesting notation
> >>
> >> [  584.117284] 4 locks held by kworker/38:1/553:
> >> [  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
> >> at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58
> >> ((work_completion)(&con->recovery_work)){+.+.}, at:
> >> process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0
> >> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 
> >> [amdgpu] [  584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, 
> >> at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
> >>  stack backtrace:
> >> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: 
> >> G   OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
> >> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT,
> >> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events
> >> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
> >> [  584.122050]  dump_stack+0x98/0xd5
> >> [  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
> >> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
> >> cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0
> >> [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [
> >> 584.124599]  down_write+0x49/0x120 [  584.125032]  ?
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
> >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
> >> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
> >> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
> >> process_one_work+0x29e/0x630 [  584.127208]  worker_thread+0x3c/0x3f0
> >> [  584.127621]  ? __kthread_p

RE: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-11 Thread Chen, Guchun
[AMD Public Use]

# define down_write_nest_lock(sem, nest_lock)   \
do {\
typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \
_down_write_nest_lock(sem, &(nest_lock)->dep_map);  \
} while (0);

Looks the ';' after while (0) is the error point. It should be dropped.

Regards,
Guchun

-Original Message-
From: Koenig, Christian  
Sent: Tuesday, August 11, 2020 3:41 PM
To: Chen, Guchun ; Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander ; 
Kuehling, Felix ; Zhang, Hawking 
; dan...@ffwll.ch
Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Am 11.08.20 um 09:36 schrieb Chen, Guchun:
> [AMD Public Use]
>
>> -down_write(&adev->reset_sem);
>> +if (hive) {
>> +down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
>> +} else
>> +down_write(&adev->reset_sem);
> Coding style nit pick: You should drop the {} here.
>
> {} could not be dropped here, as down_write_nest_lock Is one macro with 
> multiple lines, otherwise, build error would say 'else' missed one previous 
> 'if'.
> Instead of dropping, another {} should be added to else to include 
> down_write(&adev->reset_sem), which makes the braces been balanced.

Interesting bug, that is something which should not happen with macros in the 
Linux kernel. They protection by "do { ... } while(0)" is mandatory here.

Let me take a look,
Christian.

>
> Regards,
> Guchun
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Christian König
> Sent: Tuesday, August 11, 2020 2:53 PM
> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
> Deucher, Alexander ; Kuehling, Felix 
> ; Zhang, Hawking ; 
> dan...@ffwll.ch
> Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive 
> recursive locking
>
> Am 11.08.20 um 04:12 schrieb Dennis Li:
>> [  584.110304] 
>> [  584.110590] WARNING: possible recursive locking detected
>> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
>> [  584.64] 
>> [  584.111456] kworker/38:1/553 is trying to acquire lock:
>> [  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at:
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
>>  but task is already holding lock:
>> [  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at:
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
>>  other info that might help us debug this:
>> [  584.113689]  Possible unsafe locking scenario:
>>
>> [  584.114350]CPU0
>> [  584.114685]
>> [  584.115014]   lock(&adev->reset_sem);
>> [  584.115349]   lock(&adev->reset_sem);
>> [  584.115678]
>>   *** DEADLOCK ***
>>
>> [  584.116624]  May be due to missing lock nesting notation
>>
>> [  584.117284] 4 locks held by kworker/38:1/553:
>> [  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
>> at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58 
>> ((work_completion)(&con->recovery_work)){+.+.}, at:
>> process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0 
>> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] 
>> [  584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, at: 
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
>>  stack backtrace:
>> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G  
>>  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
>> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, 
>> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events 
>> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
>> [  584.122050]  dump_stack+0x98/0xd5
>> [  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
>> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
>> cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0 
>> [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 
>> 584.124599]  down_write+0x49/0x120 [  584.125032]  ?
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
>> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
>> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
>> process_

Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-11 Thread Christian König

Am 11.08.20 um 09:36 schrieb Chen, Guchun:

[AMD Public Use]


-   down_write(&adev->reset_sem);
+   if (hive) {
+   down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
+   } else
+   down_write(&adev->reset_sem);

Coding style nit pick: You should drop the {} here.

{} could not be dropped here, as down_write_nest_lock Is one macro with 
multiple lines, otherwise, build error would say 'else' missed one previous 
'if'.
Instead of dropping, another {} should be added to else to include 
down_write(&adev->reset_sem), which makes the braces been balanced.


Interesting bug, that is something which should not happen with macros 
in the Linux kernel. They protection by "do { ... } while(0)" is 
mandatory here.


Let me take a look,
Christian.



Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Tuesday, August 11, 2020 2:53 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix ; Zhang, Hawking 
; dan...@ffwll.ch
Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Am 11.08.20 um 04:12 schrieb Dennis Li:

[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
 but task is already holding lock:
[  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
 other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]CPU0
[  584.114685]
[  584.115014]   lock(&adev->reset_sem);
[  584.115349]   lock(&adev->reset_sem);
[  584.115678]
  *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58
((work_completion)(&con->recovery_work)){+.+.}, at:
process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0
(&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [  
584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
 stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G 
  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT,
BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events
amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [
584.124599]  down_write+0x49/0x120 [  584.125032]  ?
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
process_one_work+0x29e/0x630 [  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90 [  584.128014]
kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 [
584.128790]  ? kthread_park+0x90/0x90 [  584.129174]
ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive recursive
locking.

v2:
1. register adev->lock_key into lockdep, otherwise lockdep will report
the below warning

[ 1216.705820] BUG: key 890183b647d0 has not been registered!
[ 1216.705924] [ cut here ] [ 1216.705972]
DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at
kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210

v3:
change to use down_write_nest_lock to annotate the false dead-lock
warning.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62ecac97fbd2..8a55b0bc044a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info 
*hive,
return r;
   }
   
-static bool amdgpu_device_lock_adev(struct amdgpu_device *adev)

+static bool amdgpu_device_lock_adev(struct

RE: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-11 Thread Chen, Guchun
[AMD Public Use]

> - down_write(&adev->reset_sem);
> + if (hive) {
> + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
> + } else
> + down_write(&adev->reset_sem);

Coding style nit pick: You should drop the {} here.

{} could not be dropped here, as down_write_nest_lock Is one macro with 
multiple lines, otherwise, build error would say 'else' missed one previous 
'if'.
Instead of dropping, another {} should be added to else to include 
down_write(&adev->reset_sem), which makes the braces been balanced.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Tuesday, August 11, 2020 2:53 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking ; 
dan...@ffwll.ch
Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

Am 11.08.20 um 04:12 schrieb Dennis Li:
> [  584.110304] 
> [  584.110590] WARNING: possible recursive locking detected
> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
> [  584.64] 
> [  584.111456] kworker/38:1/553 is trying to acquire lock:
> [  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
> but task is already holding lock:
> [  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
> other info that might help us debug this:
> [  584.113689]  Possible unsafe locking scenario:
>
> [  584.114350]CPU0
> [  584.114685]
> [  584.115014]   lock(&adev->reset_sem);
> [  584.115349]   lock(&adev->reset_sem);
> [  584.115678]
>  *** DEADLOCK ***
>
> [  584.116624]  May be due to missing lock nesting notation
>
> [  584.117284] 4 locks held by kworker/38:1/553:
> [  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.}, 
> at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58 
> ((work_completion)(&con->recovery_work)){+.+.}, at: 
> process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0 
> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 
>  584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
> stack backtrace:
> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G   
> OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, 
> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events 
> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
> [  584.122050]  dump_stack+0x98/0xd5
> [  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ? 
> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ? 
> cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0 
> [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  
> 584.124599]  down_write+0x49/0x120 [  584.125032]  ? 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]  
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ? 
> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]  
> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]  
> process_one_work+0x29e/0x630 [  584.127208]  worker_thread+0x3c/0x3f0 
> [  584.127621]  ? __kthread_parkme+0x61/0x90 [  584.128014]  
> kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 [  
> 584.128790]  ? kthread_park+0x90/0x90 [  584.129174]  
> ret_from_fork+0x3a/0x50
>
> Each adev has owned lock_class_key to avoid false positive recursive 
> locking.
>
> v2:
> 1. register adev->lock_key into lockdep, otherwise lockdep will report 
> the below warning
>
> [ 1216.705820] BUG: key 890183b647d0 has not been registered!
> [ 1216.705924] [ cut here ] [ 1216.705972] 
> DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at 
> kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210
>
> v3:
> change to use down_write_nest_lock to annotate the false dead-lock 
> warning.
>
> Signed-off-by: Dennis Li 
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 62ecac97fbd2..8a55b0bc044a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct 

Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-10 Thread Daniel Vetter
On Tue, Aug 11, 2020 at 4:12 AM Dennis Li  wrote:
>
> [  584.110304] 
> [  584.110590] WARNING: possible recursive locking detected
> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
> [  584.64] 
> [  584.111456] kworker/38:1/553 is trying to acquire lock:
> [  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.112112]
>but task is already holding lock:
> [  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.113068]
>other info that might help us debug this:
> [  584.113689]  Possible unsafe locking scenario:
>
> [  584.114350]CPU0
> [  584.114685]
> [  584.115014]   lock(&adev->reset_sem);
> [  584.115349]   lock(&adev->reset_sem);
> [  584.115678]
> *** DEADLOCK ***
>
> [  584.116624]  May be due to missing lock nesting notation
>
> [  584.117284] 4 locks held by kworker/38:1/553:
> [  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: 
> process_one_work+0x21f/0x630
> [  584.117967]  #1: ac708e1c3e58 
> ((work_completion)(&con->recovery_work)){+.+.}, at: 
> process_one_work+0x21f/0x630
> [  584.118358]  #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: 
> amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
> [  584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.119222]
>stack backtrace:
> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G   
> OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 
> 05/23/2019
> [  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
> [  584.121638] Call Trace:
> [  584.122050]  dump_stack+0x98/0xd5
> [  584.122499]  __lock_acquire+0x1139/0x16e0
> [  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
> [  584.123358]  ? cancel_delayed_work+0xa6/0xc0
> [  584.123771]  lock_acquire+0xb8/0x1c0
> [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.124599]  down_write+0x49/0x120
> [  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
> [  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
> [  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
> [  584.126789]  process_one_work+0x29e/0x630
> [  584.127208]  worker_thread+0x3c/0x3f0
> [  584.127621]  ? __kthread_parkme+0x61/0x90
> [  584.128014]  kthread+0x12f/0x150
> [  584.128402]  ? process_one_work+0x630/0x630
> [  584.128790]  ? kthread_park+0x90/0x90
> [  584.129174]  ret_from_fork+0x3a/0x50
>
> Each adev has owned lock_class_key to avoid false positive
> recursive locking.
>
> v2:
> 1. register adev->lock_key into lockdep, otherwise lockdep will
> report the below warning
>
> [ 1216.705820] BUG: key 890183b647d0 has not been registered!
> [ 1216.705924] [ cut here ]
> [ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
> [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 
> lockdep_init_map+0x150/0x210
>
> v3:
> change to use down_write_nest_lock to annotate the false dead-lock
> warning.
>
> Signed-off-by: Dennis Li 

I'd explain a bit more in writing what the actual deadlock possibility
is, but up to you folks.

Reviewed-by: Daniel Vetter 

Cheers, Daniel
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 62ecac97fbd2..8a55b0bc044a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct 
> amdgpu_hive_info *hive,
> return r;
>  }
>
> -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev)
> +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, struct 
> amdgpu_hive_info *hive)
>  {
> if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0)
> return false;
>
> -   down_write(&adev->reset_sem);
> +   if (hive) {
> +   down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
> +   } else
> +   down_write(&adev->reset_sem);
>
> atomic_inc(&adev->gpu_reset_counter);
> switch (amdgpu_asic_reset_method(adev)) {
> @@ -4312,7 +4315,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>
> /* block all schedulers and reset given job's ring */
> list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -   if (!amdgpu_device_lock_adev(tmp_adev)) {
> +   if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> DRM_INFO("Bailing on TDR for s_job:%llx, as another 
> already in progress",
>

Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-10 Thread Christian König

Am 11.08.20 um 04:12 schrieb Dennis Li:

[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.112112]
but task is already holding lock:
[  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.113068]
other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]CPU0
[  584.114685]
[  584.115014]   lock(&adev->reset_sem);
[  584.115349]   lock(&adev->reset_sem);
[  584.115678]
 *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: 
process_one_work+0x21f/0x630
[  584.117967]  #1: ac708e1c3e58 
((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
[  584.118358]  #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: 
amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
[  584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.119222]
stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G 
  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 
05/23/2019
[  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0
[  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
[  584.123358]  ? cancel_delayed_work+0xa6/0xc0
[  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.124599]  down_write+0x49/0x120
[  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[  584.126789]  process_one_work+0x29e/0x630
[  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90
[  584.128014]  kthread+0x12f/0x150
[  584.128402]  ? process_one_work+0x630/0x630
[  584.128790]  ? kthread_park+0x90/0x90
[  584.129174]  ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive
recursive locking.

v2:
1. register adev->lock_key into lockdep, otherwise lockdep will
report the below warning

[ 1216.705820] BUG: key 890183b647d0 has not been registered!
[ 1216.705924] [ cut here ]
[ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
[ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 
lockdep_init_map+0x150/0x210

v3:
change to use down_write_nest_lock to annotate the false dead-lock
warning.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62ecac97fbd2..8a55b0bc044a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info 
*hive,
return r;
  }
  
-static bool amdgpu_device_lock_adev(struct amdgpu_device *adev)

+static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, struct 
amdgpu_hive_info *hive)
  {
if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0)
return false;
  
-	down_write(&adev->reset_sem);

+   if (hive) {
+   down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
+   } else
+   down_write(&adev->reset_sem);


Coding style nit pick: You should drop the {} here.

Apart from that the patch is Reviewed-by: Christian König 



  
  	atomic_inc(&adev->gpu_reset_counter);

switch (amdgpu_asic_reset_method(adev)) {
@@ -4312,7 +4315,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  
  	/* block all schedulers and reset given job's ring */

list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-   if (!amdgpu_device_lock_adev(tmp_adev)) {
+   if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
DRM_INFO("Bailing on TDR for s_job:%llx, as another already 
in progress",
  job ? job->base.id : -1);
r = 0;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-10 Thread Dennis Li
[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (&adev->reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.112112]
   but task is already holding lock:
[  584.112673] 9b1603d247a0 (&adev->reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.113068]
   other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]CPU0
[  584.114685]
[  584.115014]   lock(&adev->reset_sem);
[  584.115349]   lock(&adev->reset_sem);
[  584.115678]
*** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: 
process_one_work+0x21f/0x630
[  584.117967]  #1: ac708e1c3e58 
((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
[  584.118358]  #2: c1c2a5d0 (&tmp->hive_lock){+.+.}, at: 
amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
[  584.118786]  #3: 9b1603d247a0 (&adev->reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.119222]
   stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G 
  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 
05/23/2019
[  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0
[  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
[  584.123358]  ? cancel_delayed_work+0xa6/0xc0
[  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.124599]  down_write+0x49/0x120
[  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[  584.126789]  process_one_work+0x29e/0x630
[  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90
[  584.128014]  kthread+0x12f/0x150
[  584.128402]  ? process_one_work+0x630/0x630
[  584.128790]  ? kthread_park+0x90/0x90
[  584.129174]  ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive
recursive locking.

v2:
1. register adev->lock_key into lockdep, otherwise lockdep will
report the below warning

[ 1216.705820] BUG: key 890183b647d0 has not been registered!
[ 1216.705924] [ cut here ]
[ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
[ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 
lockdep_init_map+0x150/0x210

v3:
change to use down_write_nest_lock to annotate the false dead-lock
warning.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62ecac97fbd2..8a55b0bc044a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info 
*hive,
return r;
 }
 
-static bool amdgpu_device_lock_adev(struct amdgpu_device *adev)
+static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, struct 
amdgpu_hive_info *hive)
 {
if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0)
return false;
 
-   down_write(&adev->reset_sem);
+   if (hive) {
+   down_write_nest_lock(&adev->reset_sem, &hive->hive_lock);
+   } else
+   down_write(&adev->reset_sem);
 
atomic_inc(&adev->gpu_reset_counter);
switch (amdgpu_asic_reset_method(adev)) {
@@ -4312,7 +4315,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
/* block all schedulers and reset given job's ring */
list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-   if (!amdgpu_device_lock_adev(tmp_adev)) {
+   if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
DRM_INFO("Bailing on TDR for s_job:%llx, as another 
already in progress",
  job ? job->base.id : -1);
r = 0;
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx