amdgpu: fix potential bad job hw_fence underflow

Andrey Grodzovsky Thu, 28 Oct 2021 07:46:33 -0700


On 2021-10-27 10:43 p.m., JingWen Chen wrote:

On 2021/10/28 上午3:43, Andrey Grodzovsky wrote:

On 2021-10-25 10:57 p.m., JingWen Chen wrote:

On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:

On 2021-10-24 10:56 p.m., JingWen Chen wrote:

On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:

What do you mean by underflow in this case ? You mean use after free because of 
extra dma_fence_put() ?

yes

Then maybe update the description  because 'underflow' is very confusing

will do

On 2021-10-22 4:14 a.m., JingWen Chen wrote:

ping

On 2021/10/22 AM11:33, Jingwen Chen wrote:

[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
is put one more time than other jobs.

[How]
Adding dma_fence_get before resbumit job in
amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs

Signed-off-by: Jingwen Chen <jingwen.ch...@amd.com>
---
     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
     1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 41ce86244144..975f069f6fe8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
               /* clear job's guilty and depend the folowing step to decide the 
real one */
             drm_sched_reset_karma(s_job);
+        /* for the real bad job, it will be resubmitted twice, adding a 
dma_fence_get
+         * to make sure fence is balanced */

But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops 
the refcount on the old one.

Andrey

Hi Andrey,

If I remember correctly, after we embedded the hw_fence into amdgpu_job, there 
will be not fence replacement in amdgpu_job_run.

Right, I forgot that... What about removing line 
https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265
 ?
What if you make dma_get_fence unconditional instead ?

Andrey

Hi Andrey,

I have tried this and this will cause normal jobs cannot be free(lacks a 
dma_fence_put).


I can't see it  - can you point me where in that case you get unbalanced 
refcount ? As far as I see for a a normal job
being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence is  -

drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1
drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1

In total refcount didn't change until now

Next,  dma_fence_wait_timeout completed successfully because the job is normal 
and then you delete that job from pending list and call the
free_job cb which drops remaining refcounts on the hw_fence.

I am probably missing some  dma_fence_get since you checked it on a device but 
I wonder where is my mistake ?

Andrey

Hi Andrey,

The thing is the put/get is balanced right now for normal jobs in TDR. Changing 
this dma_fence_get to unconditional simply adds 1 dma_fence_get but there's no 
corresponding dma_fence_put for normal jobs.

And if this can be helpful, I try to find all dma_fence_get/put for a normal 
job in advance TDR based on the latest drm-next.

amdgpu_fence_emit -> dma_fence_init    ref_count = 1
amdgpu_fence_emit -> add into rcu    ref_count = 2
amdgpu_job_run->get after ib_schedule    ref_count = 3
drm_sched_main-> add fence callback get    ref_count = 4
drm_sched_main-> add fence callback put    ref_count = 3
drm_sched_resubmit_jobs_ext    ref_count = 2
amdgpu_fence_emit -> add into rcu    ref_count = 3

Now I see that that put in drm_sched_resubmit_jobs_ext is for droppingthe refcount for the previous'amdgpu_fence_emit -> add into rcu' get... This is all very convolutedand confusing. Probably requires somerework to make the code more clear but for now we need the bug fixed sowith the title changed

the patch is Reviewed-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>

Andrey

amdgpu_fence_process-> put after signal    ref_count = 2
drm_sched_main->free_job    ref_count = 1
drm_sched_fence_release_finished    ref_count = 0

If we do unconditional get, this sequence will turn into:

amdgpu_fence_emit -> dma_fence_init    ref_count = 1
amdgpu_fence_emit -> add into rcu    ref_count = 2
amdgpu_job_run->get after ib_schedule    ref_count = 3
drm_sched_main-> add fence callback get    ref_count = 4
drm_sched_main-> add fence callback put    ref_count = 3
drm_sched_resubmit_jobs_ext    ref_count = 2
amdgpu_fence_emit -> add into rcu    ref_count = 3
+  amdgpu_job_run->get after ib_schedule    ref_count = 4
amdgpu_fence_process-> put after signal    ref_count = 3
drm_sched_main->free_job    ref_count = 2
drm_sched_fence_release_finished    ref_count = 1

I have figured out all the get/put

for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just 
fine.

+        dma_fence_get(s_job->s_fence->parent);
             drm_sched_resubmit_jobs_ext(&ring->sched, 1);
               ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, 
ring->sched.timeout);
@@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
               /* got the hw fence, signal finished fence */
             atomic_dec(ring->sched.score);
+        dma_fence_put(s_job->s_fence->parent);
             dma_fence_get(&s_job->s_fence->finished);
             dma_fence_signal(&s_job->s_fence->finished);
             dma_fence_put(&s_job->s_fence->finished);

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

Reply via email to