RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

2022-07-17 Thread Zhu, Jiadong
[AMD Official Use Only - General]

Hi Andrey,

Thank you for your clarifying.
The refcount modified by amdgpu_fence_emit on my side is different.
I update my code and get the patch 'drm/amdgpu: Follow up change to previous 
drm scheduler change.' , the " underflow " disappears.

My patch is not needed.


Thanks,
Jiadong


-Original Message-
From: Grodzovsky, Andrey 
Sent: Friday, July 15, 2022 11:43 PM
To: Zhu, Jiadong ; Christian König 
; amd-gfx@lists.freedesktop.org
Cc: Huang, Ray ; Liu, Aaron 
Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails


On 2022-07-15 05:28, Zhu, Jiadong wrote:
> [AMD Official Use Only - General]
>
> Updated some comments
>
> -Original Message-
> From: Zhu, Jiadong
> Sent: Friday, July 15, 2022 5:13 PM
> To: Christian König ;
> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
> 
> Cc: Huang, Ray ; Liu, Aaron 
> Subject: RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails
>
> Hi Christian,
>
> The resubmitted job in function amdgpu_ib_preempt_job_recovery returns the 
> same hw fence because of this commit:
>
> static void amdgpu_ib_preempt_job_recovery(struct drm_gpu_scheduler *sched) {
>  struct drm_sched_job *s_job;
>  struct dma_fence *fence;
>
>  spin_lock(>job_list_lock);
>  list_for_each_entry(s_job, >pending_list, list) {
>  fence = sched->ops->run_job(s_job);   //fence returned 
> has the same address with swapped fences
>  dma_fence_put(fence);
>  }
>  spin_unlock(>job_list_lock);
> }
>
>
>
> commit c530b02f39850a639b72d01ebbf7e5d745c60831
> Author: Jack Zhang 
> Date:   Wed May 12 15:06:35 2021 +0800
>
>  drm/amd/amdgpu embed hw_fence into amdgpu_job
>
>  Why: Previously hw fence is alloced separately with job.
>  It caused historical lifetime issues and corner cases.
>  The ideal situation is to take fence to manage both job
>  and fence's lifetime, and simplify the design of gpu-scheduler.
>
>  How:
>  We propose to embed hw_fence into amdgpu_job.
>  1. We cover the normal job submission by this method.
>  2. For ib_test, and submit without a parent job keep the
>  legacy way to create a hw fence separately.
>  v2:
>  use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
>  embedded in a job.
>  v3:
>  remove redundant variable ring in amdgpu_job
>  v4:
>  add tdr sequence support for this feature. Add a job_run_counter to
>  indicate whether this job is a resubmit job.
>  v5
>  add missing handling in amdgpu_fence_enable_signaling
>
>  Signed-off-by: Jingwen Chen 
>  Signed-off-by: Jack Zhang 
>  Reviewed-by: Andrey Grodzovsky 
>  Reviewed by: Monk Liu 
>  Signed-off-by: Alex Deucher 
>
>
> Thus the fence we swapped out is signaled and put twice in the following 2 
> functions and we get " refcount_t: underflow; use-after-free. " errors latter.
>
>  /* wait for jobs finished */
>  amdgpu_fence_wait_empty(ring); //wait on the resubmitted 
> fence which is signaled and put somewhere else. The refcount decreased by 1 
> after amdgpu_fence_wait_empty.
>
>  /* signal the old fences */
>  amdgpu_ib_preempt_signal_fences(fences, length);   //signal 
> and put the previous swapped fence, signal would return -22.
>
> Thanks,
> Jiadong


Did you have 'drm/amdgpu: Follow up change to previous drm scheduler change.' 
this commit in your branch while you encountered this problem ?
I don't see an underflow issue
for the preempted job when inspecting the code with this commit in mind -

amdgpu_fence_emit
 dma_fence_init 1
 dma_fence_get(fence) 2
 rcu_assign_pointer(*ptr, dma_fence_get(fence) 3

drm_sched_main
 s_fence->parent = dma_fence_get(fence); 4
 dma_fence_put(fence); 3

amdgpu_ib_preempt_job_recovery
 amdgpu_fence_emit
 if (job && job->job_run_counter) -> dma_fence_get(fence); 4
 rcu_assign_pointer(*ptr, dma_fence_get(fence)); 5

 dma_fence_put(fence); 4

amdgpu_fence_wait_empty
 dma_fence_get_rcu(fence) 5
 dma_fence_put(fence) 4

amdgpu_process_fence (EOP interrupt for re-submission of preempted job)
 dma_fence_put 3

amdgpu_ib_preempt_signal_fences
 dma_fence_put 2

amdgpu_job_free_cb
 dma_fence_put(>hw_fence) 1

drm_sched_fence_release_scheduled
 dma_fence_put(fence->parent); 0

Also take a look here for reference -
https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view

Andrey





Andrey


>
>
> -Original Message-
> From: Christian König 
> 

Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

2022-07-16 Thread Andrey Grodzovsky



On 2022-07-15 05:28, Zhu, Jiadong wrote:

[AMD Official Use Only - General]

Updated some comments

-Original Message-
From: Zhu, Jiadong
Sent: Friday, July 15, 2022 5:13 PM
To: Christian König ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
Cc: Huang, Ray ; Liu, Aaron 
Subject: RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

Hi Christian,

The resubmitted job in function amdgpu_ib_preempt_job_recovery returns the same 
hw fence because of this commit:

static void amdgpu_ib_preempt_job_recovery(struct drm_gpu_scheduler *sched) {
 struct drm_sched_job *s_job;
 struct dma_fence *fence;

 spin_lock(>job_list_lock);
 list_for_each_entry(s_job, >pending_list, list) {
 fence = sched->ops->run_job(s_job);   //fence returned has 
the same address with swapped fences
 dma_fence_put(fence);
 }
 spin_unlock(>job_list_lock);
}



commit c530b02f39850a639b72d01ebbf7e5d745c60831
Author: Jack Zhang 
Date:   Wed May 12 15:06:35 2021 +0800

 drm/amd/amdgpu embed hw_fence into amdgpu_job

 Why: Previously hw fence is alloced separately with job.
 It caused historical lifetime issues and corner cases.
 The ideal situation is to take fence to manage both job
 and fence's lifetime, and simplify the design of gpu-scheduler.

 How:
 We propose to embed hw_fence into amdgpu_job.
 1. We cover the normal job submission by this method.
 2. For ib_test, and submit without a parent job keep the
 legacy way to create a hw fence separately.
 v2:
 use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
 embedded in a job.
 v3:
 remove redundant variable ring in amdgpu_job
 v4:
 add tdr sequence support for this feature. Add a job_run_counter to
 indicate whether this job is a resubmit job.
 v5
 add missing handling in amdgpu_fence_enable_signaling

 Signed-off-by: Jingwen Chen 
 Signed-off-by: Jack Zhang 
 Reviewed-by: Andrey Grodzovsky 
 Reviewed by: Monk Liu 
 Signed-off-by: Alex Deucher 


Thus the fence we swapped out is signaled and put twice in the following 2 functions and 
we get " refcount_t: underflow; use-after-free. " errors latter.

 /* wait for jobs finished */
 amdgpu_fence_wait_empty(ring); //wait on the resubmitted fence 
which is signaled and put somewhere else. The refcount decreased by 1 after 
amdgpu_fence_wait_empty.

 /* signal the old fences */
 amdgpu_ib_preempt_signal_fences(fences, length);   //signal 
and put the previous swapped fence, signal would return -22.

Thanks,
Jiadong



Did you have 'drm/amdgpu: Follow up change to previous drm scheduler 
change.' this commit in your branch while you encountered this problem ? 
I don't see an underflow issue

for the preempted job when inspecting the code with this commit in mind -

amdgpu_fence_emit
    dma_fence_init 1
    dma_fence_get(fence) 2
    rcu_assign_pointer(*ptr, dma_fence_get(fence) 3

drm_sched_main
    s_fence->parent = dma_fence_get(fence); 4
    dma_fence_put(fence); 3

amdgpu_ib_preempt_job_recovery
    amdgpu_fence_emit
        if (job && job->job_run_counter) -> dma_fence_get(fence); 4
        rcu_assign_pointer(*ptr, dma_fence_get(fence)); 5

    dma_fence_put(fence); 4

amdgpu_fence_wait_empty
    dma_fence_get_rcu(fence) 5
    dma_fence_put(fence) 4

amdgpu_process_fence (EOP interrupt for re-submission of preempted job)
    dma_fence_put 3

amdgpu_ib_preempt_signal_fences
    dma_fence_put 2

amdgpu_job_free_cb
    dma_fence_put(>hw_fence) 1

drm_sched_fence_release_scheduled
    dma_fence_put(fence->parent); 0

Also take a look here for reference - 
https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view


Andrey





Andrey





-Original Message-
From: Christian König 
Sent: Friday, July 15, 2022 4:48 PM
To: Zhu, Jiadong ; amd-gfx@lists.freedesktop.org; Grodzovsky, 
Andrey 
Cc: Huang, Ray ; Liu, Aaron 
Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

[CAUTION: External Email]

Am 15.07.22 um 10:43 schrieb jiadong@amd.com:

From: "Jiadong.Zhu" 

Dma_fence_signal returning non-zero indicates that the fence is
signaled and put somewhere else.
Skip dma_fence_put to make the fence refcount correct.

Well quite a big NAK on this.

Reference counting should be completely independent where a fence signals.

Andrey can you take a look at this as well?

Thanks,
Christian.


Signed-off-by: Jiadong.Zhu 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index f4ed0785d523..93c1a5e83835 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd

RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

2022-07-15 Thread Zhu, Jiadong
[AMD Official Use Only - General]

Updated some comments

-Original Message-
From: Zhu, Jiadong
Sent: Friday, July 15, 2022 5:13 PM
To: Christian König ; 
amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
Cc: Huang, Ray ; Liu, Aaron 
Subject: RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

Hi Christian,

The resubmitted job in function amdgpu_ib_preempt_job_recovery returns the same 
hw fence because of this commit:

static void amdgpu_ib_preempt_job_recovery(struct drm_gpu_scheduler *sched) {
struct drm_sched_job *s_job;
struct dma_fence *fence;

spin_lock(>job_list_lock);
list_for_each_entry(s_job, >pending_list, list) {
fence = sched->ops->run_job(s_job);   //fence returned has 
the same address with swapped fences
dma_fence_put(fence);
}
spin_unlock(>job_list_lock);
}



commit c530b02f39850a639b72d01ebbf7e5d745c60831
Author: Jack Zhang 
Date:   Wed May 12 15:06:35 2021 +0800

drm/amd/amdgpu embed hw_fence into amdgpu_job

Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling

Signed-off-by: Jingwen Chen 
Signed-off-by: Jack Zhang 
Reviewed-by: Andrey Grodzovsky 
Reviewed by: Monk Liu 
Signed-off-by: Alex Deucher 


Thus the fence we swapped out is signaled and put twice in the following 2 
functions and we get " refcount_t: underflow; use-after-free. " errors latter.

/* wait for jobs finished */
amdgpu_fence_wait_empty(ring); //wait on the resubmitted fence 
which is signaled and put somewhere else. The refcount decreased by 1 after 
amdgpu_fence_wait_empty.

/* signal the old fences */
amdgpu_ib_preempt_signal_fences(fences, length);   //signal and 
put the previous swapped fence, signal would return -22.

Thanks,
Jiadong


-Original Message-
From: Christian König 
Sent: Friday, July 15, 2022 4:48 PM
To: Zhu, Jiadong ; amd-gfx@lists.freedesktop.org; 
Grodzovsky, Andrey 
Cc: Huang, Ray ; Liu, Aaron 
Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

[CAUTION: External Email]

Am 15.07.22 um 10:43 schrieb jiadong@amd.com:
> From: "Jiadong.Zhu" 
>
> Dma_fence_signal returning non-zero indicates that the fence is
> signaled and put somewhere else.
> Skip dma_fence_put to make the fence refcount correct.

Well quite a big NAK on this.

Reference counting should be completely independent where a fence signals.

Andrey can you take a look at this as well?

Thanks,
Christian.

>
> Signed-off-by: Jiadong.Zhu 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index f4ed0785d523..93c1a5e83835 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1500,8 +1500,8 @@ static void amdgpu_ib_preempt_signal_fences(struct 
> dma_fence **fences,
>   fence = fences[i];
>   if (!fence)
>   continue;
> - dma_fence_signal(fence);
> - dma_fence_put(fence);
> + if (!dma_fence_signal(fence))
> + dma_fence_put(fence);
>   }
>   }
>



RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

2022-07-15 Thread Zhu, Jiadong
[AMD Official Use Only - General]

Hi Christian,

The resubmitted job in function amdgpu_ib_preempt_job_recovery returns the same 
hw fence because of this commit:

static void amdgpu_ib_preempt_job_recovery(struct drm_gpu_scheduler *sched)
{
struct drm_sched_job *s_job;
struct dma_fence *fence;

spin_lock(>job_list_lock);
list_for_each_entry(s_job, >pending_list, list) {
fence = sched->ops->run_job(s_job);   //fence returned has 
the same address with swapped fences
dma_fence_put(fence);
}
spin_unlock(>job_list_lock);
}



commit c530b02f39850a639b72d01ebbf7e5d745c60831
Author: Jack Zhang 
Date:   Wed May 12 15:06:35 2021 +0800

drm/amd/amdgpu embed hw_fence into amdgpu_job

Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling

Signed-off-by: Jingwen Chen 
Signed-off-by: Jack Zhang 
Reviewed-by: Andrey Grodzovsky 
Reviewed by: Monk Liu 
Signed-off-by: Alex Deucher 


Thus the fence we swapped out is signaled and put twice in the following 2 
functions and we get " refcount_t: underflow; use-after-free. " errors latter.

/* wait for jobs finished */
amdgpu_fence_wait_empty(ring); //signal and put the resubmitted 
jobs fence.

/* signal the old fences */
amdgpu_ib_preempt_signal_fences(fences, length);   //signal and 
put the previous swapped fence, signal would return -22.

Thanks,
Jiadong


-Original Message-
From: Christian König 
Sent: Friday, July 15, 2022 4:48 PM
To: Zhu, Jiadong ; amd-gfx@lists.freedesktop.org; 
Grodzovsky, Andrey 
Cc: Huang, Ray ; Liu, Aaron 
Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

[CAUTION: External Email]

Am 15.07.22 um 10:43 schrieb jiadong@amd.com:
> From: "Jiadong.Zhu" 
>
> Dma_fence_signal returning non-zero indicates that the fence is
> signaled and put somewhere else.
> Skip dma_fence_put to make the fence refcount correct.

Well quite a big NAK on this.

Reference counting should be completely independent where a fence signals.

Andrey can you take a look at this as well?

Thanks,
Christian.

>
> Signed-off-by: Jiadong.Zhu 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index f4ed0785d523..93c1a5e83835 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1500,8 +1500,8 @@ static void amdgpu_ib_preempt_signal_fences(struct 
> dma_fence **fences,
>   fence = fences[i];
>   if (!fence)
>   continue;
> - dma_fence_signal(fence);
> - dma_fence_put(fence);
> + if (!dma_fence_signal(fence))
> + dma_fence_put(fence);
>   }
>   }
>



Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

2022-07-15 Thread Christian König

Am 15.07.22 um 10:43 schrieb jiadong@amd.com:

From: "Jiadong.Zhu" 

Dma_fence_signal returning non-zero indicates
that the fence is signaled and put somewhere else.
Skip dma_fence_put to make the fence refcount correct.


Well quite a big NAK on this.

Reference counting should be completely independent where a fence signals.

Andrey can you take a look at this as well?

Thanks,
Christian.



Signed-off-by: Jiadong.Zhu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index f4ed0785d523..93c1a5e83835 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1500,8 +1500,8 @@ static void amdgpu_ib_preempt_signal_fences(struct 
dma_fence **fences,
fence = fences[i];
if (!fence)
continue;
-   dma_fence_signal(fence);
-   dma_fence_put(fence);
+   if (!dma_fence_signal(fence))
+   dma_fence_put(fence);
}
  }