Re: [PATCH] drm/amdgpu: hold the reference of finished fence

2020-03-24 Thread Grodzovsky, Andrey
I see now. In that case the change seems good to me.

Andrey

From: Koenig, Christian 
Sent: 24 March 2020 13:58
To: Grodzovsky, Andrey ; Pan, Xinhui 
; Tao, Yintian ; Deucher, Alexander 

Cc: amd-gfx@lists.freedesktop.org 
Subject: Re: [PATCH] drm/amdgpu: hold the reference of finished fence

There is a misunderstanding here:

Did you find out why the zero refcount on the finished fence happens
before the fence was signaled ?

The refcount on the finished fence doesn't become zero before it is signaled, 
it becomes zero while it is signaled.

CPU 1 calls dma_fence_signal(fence) without holding a reference to the fence. 
CPU 2 at the same time checks if the fence is signaled and frees the last 
reference because it find the signaled flag to be set.

The problem is now that dma_fence_signal() wants to set the timestamp after 
setting the signaled flag and now races with freeing the memory.

That's a really hard to hit problem, but it indeed seems to be possible.

Christian.

Am 24.03.20 um 15:52 schrieb Andrey Grodzovsky:

This is only for the guilty job which was removed from the ring_mirror_list due 
to completion and hence will not be resubmitted by recovery and will not be 
freed by the usual flow in drm_sched_get_cleanup_job (see drm_sched_stop)

Andrey

On 3/24/20 10:45 AM, Pan, Xinhui wrote:

[AMD Official Use Only - Internal Distribution Only]

Does this issue occur when gpu recovery?
I just check the code,  fence timedout will free job and put its fence. but gpu 
recovery might resubmit job.
Correct me if I am wrong.

From: amd-gfx 
<mailto:amd-gfx-boun...@lists.freedesktop.org>
 on behalf of Andrey Grodzovsky 
<mailto:andrey.grodzov...@amd.com>
Sent: Tuesday, March 24, 2020 11:40:06 AM
To: Tao, Yintian <mailto:yintian@amd.com>; Koenig, 
Christian <mailto:christian.koe...@amd.com>; Deucher, 
Alexander <mailto:alexander.deuc...@amd.com>
Cc: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: hold the reference of finished fence


On 3/23/20 10:22 AM, Yintian Tao wrote:
> There is one one corner case at dma_fence_signal_locked
> which will raise the NULL pointer problem just like below.
> ->dma_fence_signal
>  ->dma_fence_signal_locked
>->test_and_set_bit
> here trigger dma_fence_release happen due to the zero of fence refcount.


Did you find out why the zero refcount on the finished fence happens
before the fence was signaled ? The finished fence is created with
refcount set to 1 in drm_sched_fence_create->dma_fence_init and then the
refcount is decremented in
drm_sched_main->amdgpu_job_free_cb->drm_sched_job_cleanup. This should
only happen after fence is already signaled (see
drm_sched_get_cleanup_job). On top of that the finished fence is
referenced from other places (e.g. entity->last_scheduled e.t.c)...


>
> ->dma_fence_put
>  ->dma_fence_release
>->drm_sched_fence_release_scheduled
>->call_rcu
> here make the union fled “cb_list” at finished fence
> to NULL because struct rcu_head contains two pointer
> which is same as struct list_head cb_list
>
> Therefore, to hold the reference of finished fence at drm_sched_process_job
> to prevent the null pointer during finished fence dma_fence_signal
>
> [  732.912867] BUG: kernel NULL pointer dereference, address: 0008
> [  732.914815] #PF: supervisor write access in kernel mode
> [  732.915731] #PF: error_code(0x0002) - not-present page
> [  732.916621] PGD 0 P4D 0
> [  732.917072] Oops: 0002 [#1] SMP PTI
> [  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   OE 
> 5.4.0-rc7 #1
> [  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
> [  732.938569] Call Trace:
> [  732.939003]  
> [  732.939364]  dma_fence_signal+0x29/0x50
> [  732.940036]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
> [  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
> [  732.941910]  dma_fence_signal_locked+0x85/0x100
> [  732.942692]  dma_fence_signal+0x29/0x50
> [  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
> [  732.944393]  sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
>
> v2: hold the finished fence at drm_sched_process_job instead of
>  amdgpu_fence_process
> v3: resume the blank line
>
> Signed-off-by: Yintian Tao <mailto:yt...@amd.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 

Re: [PATCH] drm/amdgpu: hold the reference of finished fence

2020-03-24 Thread Christian König

There is a misunderstanding here:


Did you find out why the zero refcount on the finished fence happens
before the fence was signaled ?


The refcount on the finished fence doesn't become zero before it is 
signaled, it becomes zero while it is signaled.


CPU 1 calls dma_fence_signal(fence) without holding a reference to the 
fence. CPU 2 at the same time checks if the fence is signaled and frees 
the last reference because it find the signaled flag to be set.


The problem is now that dma_fence_signal() wants to set the timestamp 
after setting the signaled flag and now races with freeing the memory.


That's a really hard to hit problem, but it indeed seems to be possible.

Christian.

Am 24.03.20 um 15:52 schrieb Andrey Grodzovsky:


This is only for the guilty job which was removed from the 
ring_mirror_list due to completion and hence will not be resubmitted 
by recovery and will not be freed by the usual flow in 
drm_sched_get_cleanup_job (see drm_sched_stop)


Andrey

On 3/24/20 10:45 AM, Pan, Xinhui wrote:


[AMD Official Use Only - Internal Distribution Only]


Does this issue occur when gpu recovery?
I just check the code,  fence timedout will free job and put its 
fence. but gpu recovery might resubmit job.

Correct me if I am wrong.

*From:* amd-gfx  on behalf of 
Andrey Grodzovsky 

*Sent:* Tuesday, March 24, 2020 11:40:06 AM
*To:* Tao, Yintian ; Koenig, Christian 
; Deucher, Alexander 


*Cc:* amd-gfx@lists.freedesktop.org 
*Subject:* Re: [PATCH] drm/amdgpu: hold the reference of finished fence

On 3/23/20 10:22 AM, Yintian Tao wrote:
> There is one one corner case at dma_fence_signal_locked
> which will raise the NULL pointer problem just like below.
> ->dma_fence_signal
>  ->dma_fence_signal_locked
>    ->test_and_set_bit
> here trigger dma_fence_release happen due to the zero of fence 
refcount.



Did you find out why the zero refcount on the finished fence happens
before the fence was signaled ? The finished fence is created with
refcount set to 1 in drm_sched_fence_create->dma_fence_init and then the
refcount is decremented in
drm_sched_main->amdgpu_job_free_cb->drm_sched_job_cleanup. This should
only happen after fence is already signaled (see
drm_sched_get_cleanup_job). On top of that the finished fence is
referenced from other places (e.g. entity->last_scheduled e.t.c)...


>
> ->dma_fence_put
>  ->dma_fence_release
>    ->drm_sched_fence_release_scheduled
>    ->call_rcu
> here make the union fled “cb_list” at finished fence
> to NULL because struct rcu_head contains two pointer
> which is same as struct list_head cb_list
>
> Therefore, to hold the reference of finished fence at 
drm_sched_process_job

> to prevent the null pointer during finished fence dma_fence_signal
>
> [  732.912867] BUG: kernel NULL pointer dereference, address: 
0008

> [  732.914815] #PF: supervisor write access in kernel mode
> [  732.915731] #PF: error_code(0x0002) - not-present page
> [  732.916621] PGD 0 P4D 0
> [  732.917072] Oops: 0002 [#1] SMP PTI
> [  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   
OE 5.4.0-rc7 #1
> [  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014

> [  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
> [  732.938569] Call Trace:
> [  732.939003]  
> [  732.939364]  dma_fence_signal+0x29/0x50
> [  732.940036] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
> [  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
> [  732.941910] dma_fence_signal_locked+0x85/0x100
> [  732.942692]  dma_fence_signal+0x29/0x50
> [  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
> [  732.944393] sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
>
> v2: hold the finished fence at drm_sched_process_job instead of
>  amdgpu_fence_process
> v3: resume the blank line
>
> Signed-off-by: Yintian Tao 
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c

> index a18eabf692e4..8e731ed0d9d9 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -651,7 +651,9 @@ static void drm_sched_process_job(struct 
dma_fence *f, struct dma_fence_cb *cb)

>
>    trace_drm_sched_process_job(s_fence);
>
> + dma_fence_get(_fence->finished);
>    drm_sched_fence_finished(s_fence);


If the fence was already released during call to
drm_sched_fence_finished->dma_fence_signal->... why is it safe to
reference the s_fence just before that call ? Can't it already be
released by this time ?

Andrey

Re: [PATCH] drm/amdgpu: hold the reference of finished fence

2020-03-24 Thread Andrey Grodzovsky
This is only for the guilty job which was removed from the 
ring_mirror_list due to completion and hence will not be resubmitted by 
recovery and will not be freed by the usual flow in 
drm_sched_get_cleanup_job (see drm_sched_stop)


Andrey

On 3/24/20 10:45 AM, Pan, Xinhui wrote:


[AMD Official Use Only - Internal Distribution Only]


Does this issue occur when gpu recovery?
I just check the code,  fence timedout will free job and put its 
fence. but gpu recovery might resubmit job.

Correct me if I am wrong.

*From:* amd-gfx  on behalf of 
Andrey Grodzovsky 

*Sent:* Tuesday, March 24, 2020 11:40:06 AM
*To:* Tao, Yintian ; Koenig, Christian 
; Deucher, Alexander 

*Cc:* amd-gfx@lists.freedesktop.org 
*Subject:* Re: [PATCH] drm/amdgpu: hold the reference of finished fence

On 3/23/20 10:22 AM, Yintian Tao wrote:
> There is one one corner case at dma_fence_signal_locked
> which will raise the NULL pointer problem just like below.
> ->dma_fence_signal
>  ->dma_fence_signal_locked
>    ->test_and_set_bit
> here trigger dma_fence_release happen due to the zero of fence refcount.


Did you find out why the zero refcount on the finished fence happens
before the fence was signaled ? The finished fence is created with
refcount set to 1 in drm_sched_fence_create->dma_fence_init and then the
refcount is decremented in
drm_sched_main->amdgpu_job_free_cb->drm_sched_job_cleanup. This should
only happen after fence is already signaled (see
drm_sched_get_cleanup_job). On top of that the finished fence is
referenced from other places (e.g. entity->last_scheduled e.t.c)...


>
> ->dma_fence_put
>  ->dma_fence_release
>    ->drm_sched_fence_release_scheduled
>    ->call_rcu
> here make the union fled “cb_list” at finished fence
> to NULL because struct rcu_head contains two pointer
> which is same as struct list_head cb_list
>
> Therefore, to hold the reference of finished fence at 
drm_sched_process_job

> to prevent the null pointer during finished fence dma_fence_signal
>
> [  732.912867] BUG: kernel NULL pointer dereference, address: 
0008

> [  732.914815] #PF: supervisor write access in kernel mode
> [  732.915731] #PF: error_code(0x0002) - not-present page
> [  732.916621] PGD 0 P4D 0
> [  732.917072] Oops: 0002 [#1] SMP PTI
> [  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   
OE 5.4.0-rc7 #1
> [  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 
1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014

> [  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
> [  732.938569] Call Trace:
> [  732.939003]  
> [  732.939364]  dma_fence_signal+0x29/0x50
> [  732.940036]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
> [  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
> [  732.941910]  dma_fence_signal_locked+0x85/0x100
> [  732.942692]  dma_fence_signal+0x29/0x50
> [  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
> [  732.944393] sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
>
> v2: hold the finished fence at drm_sched_process_job instead of
>  amdgpu_fence_process
> v3: resume the blank line
>
> Signed-off-by: Yintian Tao 
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c

> index a18eabf692e4..8e731ed0d9d9 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -651,7 +651,9 @@ static void drm_sched_process_job(struct 
dma_fence *f, struct dma_fence_cb *cb)

>
>    trace_drm_sched_process_job(s_fence);
>
> + dma_fence_get(_fence->finished);
>    drm_sched_fence_finished(s_fence);


If the fence was already released during call to
drm_sched_fence_finished->dma_fence_signal->... why is it safe to
reference the s_fence just before that call ? Can't it already be
released by this time ?

Andrey



> + dma_fence_put(_fence->finished);
> wake_up_interruptible(>wake_up_worker);
>   }
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cxinhui.pan%40amd.com%7C65933fca0b414d12aab408d7cfa51165%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637206180230440562sdata=z6ec%2BcWkwjaDgZvkpL3jOMYkBtDjbNOxlXiAk4Ri5Ck%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: hold the reference of finished fence

2020-03-24 Thread Pan, Xinhui
[AMD Official Use Only - Internal Distribution Only]

Does this issue occur when gpu recovery?
I just check the code,  fence timedout will free job and put its fence. but gpu 
recovery might resubmit job.
Correct me if I am wrong.

From: amd-gfx  on behalf of Andrey 
Grodzovsky 
Sent: Tuesday, March 24, 2020 11:40:06 AM
To: Tao, Yintian ; Koenig, Christian 
; Deucher, Alexander 
Cc: amd-gfx@lists.freedesktop.org 
Subject: Re: [PATCH] drm/amdgpu: hold the reference of finished fence


On 3/23/20 10:22 AM, Yintian Tao wrote:
> There is one one corner case at dma_fence_signal_locked
> which will raise the NULL pointer problem just like below.
> ->dma_fence_signal
>  ->dma_fence_signal_locked
>->test_and_set_bit
> here trigger dma_fence_release happen due to the zero of fence refcount.


Did you find out why the zero refcount on the finished fence happens
before the fence was signaled ? The finished fence is created with
refcount set to 1 in drm_sched_fence_create->dma_fence_init and then the
refcount is decremented in
drm_sched_main->amdgpu_job_free_cb->drm_sched_job_cleanup. This should
only happen after fence is already signaled (see
drm_sched_get_cleanup_job). On top of that the finished fence is
referenced from other places (e.g. entity->last_scheduled e.t.c)...


>
> ->dma_fence_put
>  ->dma_fence_release
>->drm_sched_fence_release_scheduled
>->call_rcu
> here make the union fled “cb_list” at finished fence
> to NULL because struct rcu_head contains two pointer
> which is same as struct list_head cb_list
>
> Therefore, to hold the reference of finished fence at drm_sched_process_job
> to prevent the null pointer during finished fence dma_fence_signal
>
> [  732.912867] BUG: kernel NULL pointer dereference, address: 0008
> [  732.914815] #PF: supervisor write access in kernel mode
> [  732.915731] #PF: error_code(0x0002) - not-present page
> [  732.916621] PGD 0 P4D 0
> [  732.917072] Oops: 0002 [#1] SMP PTI
> [  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   OE 
> 5.4.0-rc7 #1
> [  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
> [  732.938569] Call Trace:
> [  732.939003]  
> [  732.939364]  dma_fence_signal+0x29/0x50
> [  732.940036]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
> [  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
> [  732.941910]  dma_fence_signal_locked+0x85/0x100
> [  732.942692]  dma_fence_signal+0x29/0x50
> [  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
> [  732.944393]  sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
>
> v2: hold the finished fence at drm_sched_process_job instead of
>  amdgpu_fence_process
> v3: resume the blank line
>
> Signed-off-by: Yintian Tao 
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index a18eabf692e4..8e731ed0d9d9 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -651,7 +651,9 @@ static void drm_sched_process_job(struct dma_fence *f, 
> struct dma_fence_cb *cb)
>
>trace_drm_sched_process_job(s_fence);
>
> + dma_fence_get(_fence->finished);
>drm_sched_fence_finished(s_fence);


If the fence was already released during call to
drm_sched_fence_finished->dma_fence_signal->... why is it safe to
reference the s_fence just before that call ? Can't it already be
released by this time ?

Andrey



> + dma_fence_put(_fence->finished);
>wake_up_interruptible(>wake_up_worker);
>   }
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cxinhui.pan%40amd.com%7C65933fca0b414d12aab408d7cfa51165%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637206180230440562sdata=z6ec%2BcWkwjaDgZvkpL3jOMYkBtDjbNOxlXiAk4Ri5Ck%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: hold the reference of finished fence

2020-03-23 Thread Andrey Grodzovsky


On 3/23/20 10:22 AM, Yintian Tao wrote:

There is one one corner case at dma_fence_signal_locked
which will raise the NULL pointer problem just like below.
->dma_fence_signal
 ->dma_fence_signal_locked
->test_and_set_bit
here trigger dma_fence_release happen due to the zero of fence refcount.



Did you find out why the zero refcount on the finished fence happens 
before the fence was signaled ? The finished fence is created with 
refcount set to 1 in drm_sched_fence_create->dma_fence_init and then the 
refcount is decremented in 
drm_sched_main->amdgpu_job_free_cb->drm_sched_job_cleanup. This should 
only happen after fence is already signaled (see 
drm_sched_get_cleanup_job). On top of that the finished fence is 
referenced from other places (e.g. entity->last_scheduled e.t.c)...





->dma_fence_put
 ->dma_fence_release
->drm_sched_fence_release_scheduled
->call_rcu
here make the union fled “cb_list” at finished fence
to NULL because struct rcu_head contains two pointer
which is same as struct list_head cb_list

Therefore, to hold the reference of finished fence at drm_sched_process_job
to prevent the null pointer during finished fence dma_fence_signal

[  732.912867] BUG: kernel NULL pointer dereference, address: 0008
[  732.914815] #PF: supervisor write access in kernel mode
[  732.915731] #PF: error_code(0x0002) - not-present page
[  732.916621] PGD 0 P4D 0
[  732.917072] Oops: 0002 [#1] SMP PTI
[  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   OE 
5.4.0-rc7 #1
[  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
[  732.938569] Call Trace:
[  732.939003]  
[  732.939364]  dma_fence_signal+0x29/0x50
[  732.940036]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
[  732.941910]  dma_fence_signal_locked+0x85/0x100
[  732.942692]  dma_fence_signal+0x29/0x50
[  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
[  732.944393]  sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]

v2: hold the finished fence at drm_sched_process_job instead of
 amdgpu_fence_process
v3: resume the blank line

Signed-off-by: Yintian Tao 
---
  drivers/gpu/drm/scheduler/sched_main.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index a18eabf692e4..8e731ed0d9d9 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -651,7 +651,9 @@ static void drm_sched_process_job(struct dma_fence *f, 
struct dma_fence_cb *cb)
  
  	trace_drm_sched_process_job(s_fence);
  
+	dma_fence_get(_fence->finished);

drm_sched_fence_finished(s_fence);



If the fence was already released during call to 
drm_sched_fence_finished->dma_fence_signal->... why is it safe to 
reference the s_fence just before that call ? Can't it already be 
released by this time ?


Andrey




+   dma_fence_put(_fence->finished);
wake_up_interruptible(>wake_up_worker);
  }
  

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: hold the reference of finished fence

2020-03-23 Thread Tao, Yintian
Hi  Christian

The error fence is finished fence not the hw ring fence. Please the call trace 
below.
[  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
[  732.939364]  dma_fence_signal+0x29/0x50  ===>drm sched finished fence
[  732.940036]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
[  732.941910]  dma_fence_signal_locked+0x85/0x100
[  732.942692]  dma_fence_signal+0x29/0x50  > hw fence
[  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
[  732.944393]  sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
[  732.945398]  amdgpu_irq_dispatch+0xaf/0x1d0 [amdgpu]
[  732.946317]  amdgpu_ih_process+0x8c/0x110 [amdgpu]
[  732.947206]  amdgpu_irq_handler+0x24/0xa0 [amdgpu]

Best Regards
Yintian Tao
-Original Message-
From: Koenig, Christian  
Sent: 2020年3月23日 20:06
To: Tao, Yintian ; Deucher, Alexander 

Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: hold the reference of finished fence

I've just double checked and your analyses actually can't be correct.

When we call dma_fence_signal() in amdgpu_fence_process() we still have a 
reference to the fence.

See the code here:
>     r = dma_fence_signal(fence);
>     if (!r)
>     DMA_FENCE_TRACE(fence, "signaled from irq 
> context\n");
>     else
>     BUG();
>
>     dma_fence_put(fence);

So I'm not sure how you ran into the crash in the first place, this is most 
likely something else.

Regards,
Christian.

Am 23.03.20 um 12:49 schrieb Yintian Tao:
> There is one one corner case at dma_fence_signal_locked which will 
> raise the NULL pointer problem just like below.
> ->dma_fence_signal
>  ->dma_fence_signal_locked
>   ->test_and_set_bit
> here trigger dma_fence_release happen due to the zero of fence refcount.
>
> ->dma_fence_put
>  ->dma_fence_release
>   ->drm_sched_fence_release_scheduled
>   ->call_rcu
> here make the union fled “cb_list” at finished fence to NULL because 
> struct rcu_head contains two pointer which is same as struct list_head 
> cb_list
>
> Therefore, to hold the reference of finished fence at amdgpu_job_run 
> to prevent the null pointer during dma_fence_signal
>
> [  732.912867] BUG: kernel NULL pointer dereference, address: 
> 0008 [  732.914815] #PF: supervisor write access in kernel 
> mode [  732.915731] #PF: error_code(0x0002) - not-present page [  
> 732.916621] PGD 0 P4D 0 [  732.917072] Oops: 0002 [#1] SMP PTI
> [  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   OE 
> 5.4.0-rc7 #1
> [  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 [  
> 732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
> [  732.938569] Call Trace:
> [  732.939003]  
> [  732.939364]  dma_fence_signal+0x29/0x50 [  732.940036]  
> drm_sched_fence_finished+0x12/0x20 [gpu_sched] [  732.940996]  
> drm_sched_process_job+0x34/0xa0 [gpu_sched] [  732.941910]  
> dma_fence_signal_locked+0x85/0x100
> [  732.942692]  dma_fence_signal+0x29/0x50 [  732.943457]  
> amdgpu_fence_process+0x99/0x120 [amdgpu] [  732.944393]  
> sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
>
> Signed-off-by: Yintian Tao 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |  3 +++
>   3 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 7531527067df..03573eff660a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -52,7 +52,7 @@
>   
>   struct amdgpu_fence {
>   struct dma_fence base;
> -
> + struct dma_fence *finished;
>   /* RB, DMA, etc. */
>   struct amdgpu_ring  *ring;
>   };
> @@ -149,6 +149,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
> struct dma_fence **f,
>   
>   seq = ++ring->fence_drv.sync_seq;
>   fence->ring = ring;
> + fence->finished = NULL;
>   dma_fence_init(>base, _fence_ops,
>  >fence_drv.lock,
>  adev->fence_context + ring->idx, @@ -182,6 +183,21 @@ 
> int 
> amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
>   return 0;
>   }
>   
> +void amdgpu_fence_get_finished(struct dma_fence *base,
> +struct dma_fence *finished) {
> + struct amdgpu_fence *afence = to_amdgpu_fence(base);
> +
> + afence-

Re: [PATCH] drm/amdgpu: hold the reference of finished fence

2020-03-23 Thread Christian König

I've just double checked and your analyses actually can't be correct.

When we call dma_fence_signal() in amdgpu_fence_process() we still have 
a reference to the fence.


See the code here:

    r = dma_fence_signal(fence);
    if (!r)
    DMA_FENCE_TRACE(fence, "signaled from irq 
context\n");

    else
    BUG();

    dma_fence_put(fence);


So I'm not sure how you ran into the crash in the first place, this is 
most likely something else.


Regards,
Christian.

Am 23.03.20 um 12:49 schrieb Yintian Tao:

There is one one corner case at dma_fence_signal_locked
which will raise the NULL pointer problem just like below.
->dma_fence_signal
 ->dma_fence_signal_locked
->test_and_set_bit
here trigger dma_fence_release happen due to the zero of fence refcount.

->dma_fence_put
 ->dma_fence_release
->drm_sched_fence_release_scheduled
->call_rcu
here make the union fled “cb_list” at finished fence
to NULL because struct rcu_head contains two pointer
which is same as struct list_head cb_list

Therefore, to hold the reference of finished fence at amdgpu_job_run
to prevent the null pointer during dma_fence_signal

[  732.912867] BUG: kernel NULL pointer dereference, address: 0008
[  732.914815] #PF: supervisor write access in kernel mode
[  732.915731] #PF: error_code(0x0002) - not-present page
[  732.916621] PGD 0 P4D 0
[  732.917072] Oops: 0002 [#1] SMP PTI
[  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   OE 
5.4.0-rc7 #1
[  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
[  732.938569] Call Trace:
[  732.939003]  
[  732.939364]  dma_fence_signal+0x29/0x50
[  732.940036]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
[  732.941910]  dma_fence_signal_locked+0x85/0x100
[  732.942692]  dma_fence_signal+0x29/0x50
[  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
[  732.944393]  sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]

Signed-off-by: Yintian Tao 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |  3 +++
  3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 7531527067df..03573eff660a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -52,7 +52,7 @@
  
  struct amdgpu_fence {

struct dma_fence base;
-
+   struct dma_fence *finished;
/* RB, DMA, etc. */
struct amdgpu_ring  *ring;
  };
@@ -149,6 +149,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
dma_fence **f,
  
  	seq = ++ring->fence_drv.sync_seq;

fence->ring = ring;
+   fence->finished = NULL;
dma_fence_init(>base, _fence_ops,
   >fence_drv.lock,
   adev->fence_context + ring->idx,
@@ -182,6 +183,21 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
dma_fence **f,
return 0;
  }
  
+void amdgpu_fence_get_finished(struct dma_fence *base,

+  struct dma_fence *finished)
+{
+   struct amdgpu_fence *afence = to_amdgpu_fence(base);
+
+   afence->finished = dma_fence_get(finished);
+}
+
+void amdgpu_fence_put_finished(struct dma_fence *base)
+{
+   struct amdgpu_fence *afence = to_amdgpu_fence(base);
+
+   dma_fence_put(afence->finished);
+}
+
  /**
   * amdgpu_fence_emit_polling - emit a fence on the requeste ring
   *
@@ -276,6 +292,7 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
BUG();
  
  		dma_fence_put(fence);

+   amdgpu_fence_put_finished(fence);
pm_runtime_mark_last_busy(adev->ddev->dev);
pm_runtime_put_autosuspend(adev->ddev->dev);
} while (last_seq != seq);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4981e443a884..deb2aeeadfb3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -229,6 +229,8 @@ static struct dma_fence *amdgpu_job_run(struct 
drm_sched_job *sched_job)
   );
if (r)
DRM_ERROR("Error scheduling IBs (%d)\n", r);
+   else
+   amdgpu_fence_get_finished(fence, finished);
}
/* if gpu reset, hw fence will be replaced here */
dma_fence_put(job->fence);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 448c76cbf3ed..fd4da91859aa 100644
---