Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-11 Thread Christian König
Alternatively changing all callers to give MAX_INT as parameter when 
they don't care is the preferred variant, but a bit more work.


Christian.

Am 11.03.21 um 04:24 schrieb Grodzovsky, Andrey:
You can just rename drm_sched_resubmit_jobs to 
drm_sched_resubmit_jobs_imp and create a wrapper 
functiondrm_sched_resubmit_jobs which will call that function with 
MAX_INT and so no need for code change in other driver will be needed. 
For amdgpu you call directly drm_sched_resubmit_jobs_imp with 1 and 
MAX_INT.


Andrey


*From:* Zhang, Jack (Jian) 
*Sent:* 10 March 2021 22:05
*To:* Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org ; Koenig, 
Christian ; Liu, Monk ; 
Deng, Emily 

*Subject:* RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

Thank you for your review.

>> This is identical to drm_sched_resubmit_jobs in all but the 'int 
max' handling, can't you reuse drm_sched_resubmit_jobs by passing max 
argument==1 for the find guilty run and then, for the later normal run 
passing max==MAX_INT to avoid break the iteration to early ?


Due to C language doesn't support function overloading, we couldn’t 
define function like the following(compiler error):
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max 
= INT_MAX);


We have to rewrite every vender's  code where call the 
drm_sched_resubmit_jobs if we defined the function like this.

void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max);

Meanwhile, I also explored the Variable Arguments in C, But it still 
cannot meet the needs very well.


Do you know some other methods that could help us to make this happen?
Otherwise, I'm afraid we have to define a separate ways as patch V6 did.

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
Sent: Thursday, March 11, 2021 12:19 AM
To: Zhang, Jack (Jian) ; 
amd-gfx@lists.freedesktop.org; Koenig, Christian 
; Liu, Monk ; Deng, Emily 


Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode



On 2021-03-10 8:33 a.m., Jack Zhang wrote:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h    |  2 +
>   4 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..dac0a242e5f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct 
amdgpu_device *adev,

>   int i, r = 0;
>   bool need_emergency_restart = false;
>   bool audio_suspended = false;
> +int tmp_vram_lost_counter;
>
>   /*
>    * Special case: RAS triggered and full reset isn't supported @@
> -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct 
amdgpu_device *adev,
>   dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another 
already in progress",

>   job ? job->base.id : -1);
>
> -/* even we skipped this reset, still need to set the job to guilty */
> -if (job)
> +if (job) {
> +/* re-insert node to avoid memory leak */
> +spin_lock(>base.sched->job_list_lock);
> +list_add(>base.node, >base.sched->pending_list);
> +spin_unlock(>base.sched->job_list_lock);
> +/* even we skipped this reset, still need to set the job to guilty
> +*/
>   drm_sched_increase_karma(>base);
> +}

I think this piece should be in a standalone patch as it comes to fix 
a bug of leaking jobs and not directly related to find actual guilty 
job. Also, this is not the only place where the problem arises - it 
also arises in other drivers whe

RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

I just submitted the V7 patch in which I wrapped the increase_karma func and 
resubmit_job func.

Thanks,
Jack
From: amd-gfx  On Behalf Of Zhang, Jack 
(Jian)
Sent: Thursday, March 11, 2021 11:28 AM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
Liu, Monk ; Deng, Emily 
Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode


[AMD Official Use Only - Internal Distribution Only]


[AMD Official Use Only - Internal Distribution Only]

Ok, that will do the trick. Thank you!

Jack
From: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>
Sent: Thursday, March 11, 2021 11:24 AM
To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

You can just rename drm_sched_resubmit_jobs to drm_sched_resubmit_jobs_imp and 
create a wrapper function drm_sched_resubmit_jobs which will call that function 
with MAX_INT and so no need for code change in other driver will be needed. For 
amdgpu you call directly drm_sched_resubmit_jobs_imp with 1 and MAX_INT.

Andrey


From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Sent: 10 March 2021 22:05
To: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

Thank you for your review.

>> This is identical to drm_sched_resubmit_jobs in all but the 'int max' 
>> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 
>> for the find guilty run and then, for the later normal run passing 
>> max==MAX_INT to avoid break the iteration to early ?

Due to C language doesn't support function overloading, we couldn’t define 
function like the following(compiler error):
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = 
INT_MAX);

We have to rewrite every vender's  code where call the drm_sched_resubmit_jobs 
if we defined the function like this.
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max);

Meanwhile, I also explored the Variable Arguments in C, But it still cannot 
meet the needs very well.

Do you know some other methods that could help us to make this happen?
Otherwise, I'm afraid we have to define a separate ways as patch V6 did.

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>
Sent: Thursday, March 11, 2021 12:19 AM
To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode



On 2021-03-10 8:33 a.m., Jack Zhang wrote:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..dac0a242e5f5 100644
> --- a/dri

RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Ok, that will do the trick. Thank you!

Jack
From: Grodzovsky, Andrey 
Sent: Thursday, March 11, 2021 11:24 AM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; 
Koenig, Christian ; Liu, Monk ; 
Deng, Emily 
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

You can just rename drm_sched_resubmit_jobs to drm_sched_resubmit_jobs_imp and 
create a wrapper function drm_sched_resubmit_jobs which will call that function 
with MAX_INT and so no need for code change in other driver will be needed. For 
amdgpu you call directly drm_sched_resubmit_jobs_imp with 1 and MAX_INT.

Andrey


From: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>
Sent: 10 March 2021 22:05
To: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

Thank you for your review.

>> This is identical to drm_sched_resubmit_jobs in all but the 'int max' 
>> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 
>> for the find guilty run and then, for the later normal run passing 
>> max==MAX_INT to avoid break the iteration to early ?

Due to C language doesn't support function overloading, we couldn’t define 
function like the following(compiler error):
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = 
INT_MAX);

We have to rewrite every vender's  code where call the drm_sched_resubmit_jobs 
if we defined the function like this.
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max);

Meanwhile, I also explored the Variable Arguments in C, But it still cannot 
meet the needs very well.

Do you know some other methods that could help us to make this happen?
Otherwise, I'm afraid we have to define a separate ways as patch V6 did.

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
mailto:andrey.grodzov...@amd.com>>
Sent: Thursday, March 11, 2021 12:19 AM
To: Zhang, Jack (Jian) mailto:jack.zha...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Koenig, 
Christian mailto:christian.koe...@amd.com>>; Liu, 
Monk mailto:monk@amd.com>>; Deng, Emily 
mailto:emily.d...@amd.com>>
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode



On 2021-03-10 8:33 a.m., Jack Zhang wrote:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang mailto:jack.zha...@amd.com>>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..dac0a242e5f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   int i, r = 0;
>   bool need_emergency_restart = false;
>   bool audio_suspended = false;
> +int tmp_vram_lost_counter;
>
>   /*
>* Special case: RAS triggered and full reset isn't supported @@
> -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in 
> progress",
>   job ? job->base.id : -1);
>
> -/* even we skipped this reset, still need to set the job to guilty 

Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Grodzovsky, Andrey
You can just rename drm_sched_resubmit_jobs to drm_sched_resubmit_jobs_imp and 
create a wrapper function drm_sched_resubmit_jobs which will call that function 
with MAX_INT and so no need for code change in other driver will be needed. For 
amdgpu you call directly drm_sched_resubmit_jobs_imp with 1 and MAX_INT.

Andrey


From: Zhang, Jack (Jian) 
Sent: 10 March 2021 22:05
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org ; Koenig, 
Christian ; Liu, Monk ; Deng, Emily 

Subject: RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

Thank you for your review.

>> This is identical to drm_sched_resubmit_jobs in all but the 'int max' 
>> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 
>> for the find guilty run and then, for the later normal run passing 
>> max==MAX_INT to avoid break the iteration to early ?

Due to C language doesn't support function overloading, we couldn’t define 
function like the following(compiler error):
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = 
INT_MAX);

We have to rewrite every vender's  code where call the drm_sched_resubmit_jobs 
if we defined the function like this.
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max);

Meanwhile, I also explored the Variable Arguments in C, But it still cannot 
meet the needs very well.

Do you know some other methods that could help us to make this happen?
Otherwise, I'm afraid we have to define a separate ways as patch V6 did.

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
Sent: Thursday, March 11, 2021 12:19 AM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; 
Koenig, Christian ; Liu, Monk ; 
Deng, Emily 
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode



On 2021-03-10 8:33 a.m., Jack Zhang wrote:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..dac0a242e5f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   int i, r = 0;
>   bool need_emergency_restart = false;
>   bool audio_suspended = false;
> +int tmp_vram_lost_counter;
>
>   /*
>* Special case: RAS triggered and full reset isn't supported @@
> -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in 
> progress",
>   job ? job->base.id : -1);
>
> -/* even we skipped this reset, still need to set the job to guilty */
> -if (job)
> +if (job) {
> +/* re-insert node to avoid memory leak */
> +spin_lock(>base.sched->job_list_lock);
> +list_add(>base.node, >base.sched->pending_list);
> +spin_unlock(>base.sched->job_list_lock);
> +/* even we skipped this reset, still need to set the job to guilty
> +*/
>   drm_sched_increase_karma(>base);
> +}

I think this piece should be in a standalone patch as it comes to fix a bug of 
leaking jobs and not directly related to find actual guilty job. Also, this is 
not the only place where the problem arises - it also arises in other drivers 
where they check that guilty job's fence already signaled and bail out before 
reinsertion of bad job to mirror list. Seems to me better fix would be to 
handle this within scheduler code by adding specific return value to 
drm_sched_backend_op

RE: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Zhang, Jack (Jian)
[AMD Official Use Only - Internal Distribution Only]

Hi, Andrey,

Thank you for your review.

>> This is identical to drm_sched_resubmit_jobs in all but the 'int max' 
>> handling, can't you reuse drm_sched_resubmit_jobs by passing max argument==1 
>> for the find guilty run and then, for the later normal run passing 
>> max==MAX_INT to avoid break the iteration to early ?

Due to C language doesn't support function overloading, we couldn’t define 
function like the following(compiler error):
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max = 
INT_MAX);

We have to rewrite every vender's  code where call the drm_sched_resubmit_jobs 
if we defined the function like this.
void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched, int max);

Meanwhile, I also explored the Variable Arguments in C, But it still cannot 
meet the needs very well.

Do you know some other methods that could help us to make this happen?
Otherwise, I'm afraid we have to define a separate ways as patch V6 did.

Thanks,
Jack

-Original Message-
From: Grodzovsky, Andrey 
Sent: Thursday, March 11, 2021 12:19 AM
To: Zhang, Jack (Jian) ; amd-gfx@lists.freedesktop.org; 
Koenig, Christian ; Liu, Monk ; 
Deng, Emily 
Subject: Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode



On 2021-03-10 8:33 a.m., Jack Zhang wrote:
> [Why]
> Previous tdr design treats the first job in job_timeout as the bad job.
> But sometimes a later bad compute job can block a good gfx job and
> cause an unexpected gfx job timeout because gfx and compute ring share
> internal GC HW mutually.
>
> [How]
> This patch implements an advanced tdr mode.It involves an additinal
> synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
> step in order to find the real bad job.
>
> 1. At Step0 Resubmit stage, it synchronously submits and pends for the
> first job being signaled. If it gets timeout, we identify it as guilty
> and do hw reset. After that, we would do the normal resubmit step to
> resubmit left jobs.
>
> 2. Re-insert Bailing job to mirror_list, and leave it to be handled by
> the main reset thread.
>
> 3. For whole gpu reset(vram lost), do resubmit as the old way.
>
> Signed-off-by: Jack Zhang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c | 75 ++
>   include/drm/gpu_scheduler.h|  2 +
>   4 files changed, 148 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e247c3a2ec08..dac0a242e5f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   int i, r = 0;
>   bool need_emergency_restart = false;
>   bool audio_suspended = false;
> +int tmp_vram_lost_counter;
>
>   /*
>* Special case: RAS triggered and full reset isn't supported @@
> -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in 
> progress",
>   job ? job->base.id : -1);
>
> -/* even we skipped this reset, still need to set the job to guilty */
> -if (job)
> +if (job) {
> +/* re-insert node to avoid memory leak */
> +spin_lock(>base.sched->job_list_lock);
> +list_add(>base.node, >base.sched->pending_list);
> +spin_unlock(>base.sched->job_list_lock);
> +/* even we skipped this reset, still need to set the job to guilty
> +*/
>   drm_sched_increase_karma(>base);
> +}

I think this piece should be in a standalone patch as it comes to fix a bug of 
leaking jobs and not directly related to find actual guilty job. Also, this is 
not the only place where the problem arises - it also arises in other drivers 
where they check that guilty job's fence already signaled and bail out before 
reinsertion of bad job to mirror list. Seems to me better fix would be to 
handle this within scheduler code by adding specific return value to 
drm_sched_backend_ops.timedout_job marking that the code terminated early - 
before calling drm_sched_stop and so drm_sched_job_timedout would know this and 
then reinsert the job back to mirror_list itself.

>   goto skip_recovery;
>   }
>
> @@ -4788,6 +4794,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   }
>   }
>
> +tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
>   /* Actual ASIC resets if needed.*/
>   /* TODO Implement XGMI hive reset logic for SRIOV */
>   if (amdgpu_sriov_vf(adev)) {
> @@ -

Re: [PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Andrey Grodzovsky




On 2021-03-10 8:33 a.m., Jack Zhang wrote:

[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.

1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.

2. Re-insert Bailing job to mirror_list, and leave it to be handled by
the main reset thread.

3. For whole gpu reset(vram lost), do resubmit as the old way.

Signed-off-by: Jack Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
  drivers/gpu/drm/scheduler/sched_main.c | 75 ++
  include/drm/gpu_scheduler.h|  2 +
  4 files changed, 148 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..dac0a242e5f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+   int tmp_vram_lost_counter;
  
  	/*

 * Special case: RAS triggered and full reset isn't supported
@@ -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another 
already in progress",
job ? job->base.id : -1);
  
-		/* even we skipped this reset, still need to set the job to guilty */

-   if (job)
+   if (job) {
+   /* re-insert node to avoid memory leak */
+   spin_lock(>base.sched->job_list_lock);
+   list_add(>base.node, 
>base.sched->pending_list);
+   spin_unlock(>base.sched->job_list_lock);
+   /* even we skipped this reset, still need to set the 
job to guilty */
drm_sched_increase_karma(>base);
+   }


I think this piece should be in a standalone
patch as it comes to fix a bug of leaking
jobs and not directly related to find actual
guilty job. Also, this is not the only place
where the problem arises - it also arises
in other drivers where they check that guilty
job's fence already signaled and bail out before
reinsertion of bad job to mirror list. Seems to me
better fix would be to handle this within scheduler
code by adding specific return value to
drm_sched_backend_ops.timedout_job marking that the code
terminated early - before calling drm_sched_stop and so 
drm_sched_job_timedout

would know this and then reinsert the job back to mirror_list
itself.


goto skip_recovery;
}
  
@@ -4788,6 +4794,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

}
}
  
+	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));

/* Actual ASIC resets if needed.*/
/* TODO Implement XGMI hive reset logic for SRIOV */
if (amdgpu_sriov_vf(adev)) {
@@ -4805,6 +4812,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Post ASIC reset for all devs .*/
list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
  
+		if (amdgpu_gpu_recovery == 2 &&

+   !(tmp_vram_lost_counter < 
atomic_read(>vram_lost_counter))) {
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = tmp_adev->rings[i];
+   int ret = 0;
+   struct drm_sched_job *s_job;
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   /* No point to resubmit jobs if we didn't HW 
reset*/
+   if (!tmp_adev->asic_reset_res && !job_signaled) 
{
+
+   s_job = 
list_first_entry_or_null(>sched.pending_list, struct drm_sched_job, list);
+   if (s_job == NULL)
+   continue;
+
+   /* clear job's guilty and depend the 
folowing step to decide the real one */
+   drm_sched_reset_karma(s_job);
+   drm_sched_resubmit_jobs2(>sched, 
1);
+   

[PATCH v6] drm/amd/amdgpu implement tdr advanced mode

2021-03-10 Thread Jack Zhang
[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.

1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.

2. Re-insert Bailing job to mirror_list, and leave it to be handled by
the main reset thread.

3. For whole gpu reset(vram lost), do resubmit as the old way.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 72 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
 drivers/gpu/drm/scheduler/sched_main.c | 75 ++
 include/drm/gpu_scheduler.h|  2 +
 4 files changed, 148 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..dac0a242e5f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4639,6 +4639,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+   int tmp_vram_lost_counter;
 
/*
 * Special case: RAS triggered and full reset isn't supported
@@ -4689,9 +4690,14 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another 
already in progress",
job ? job->base.id : -1);
 
-   /* even we skipped this reset, still need to set the job to 
guilty */
-   if (job)
+   if (job) {
+   /* re-insert node to avoid memory leak */
+   spin_lock(>base.sched->job_list_lock);
+   list_add(>base.node, 
>base.sched->pending_list);
+   spin_unlock(>base.sched->job_list_lock);
+   /* even we skipped this reset, still need to set the 
job to guilty */
drm_sched_increase_karma(>base);
+   }
goto skip_recovery;
}
 
@@ -4788,6 +4794,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
}
}
 
+   tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
/* Actual ASIC resets if needed.*/
/* TODO Implement XGMI hive reset logic for SRIOV */
if (amdgpu_sriov_vf(adev)) {
@@ -4805,6 +4812,67 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Post ASIC reset for all devs .*/
list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 
+   if (amdgpu_gpu_recovery == 2 &&
+   !(tmp_vram_lost_counter < 
atomic_read(>vram_lost_counter))) {
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = tmp_adev->rings[i];
+   int ret = 0;
+   struct drm_sched_job *s_job;
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   /* No point to resubmit jobs if we didn't HW 
reset*/
+   if (!tmp_adev->asic_reset_res && !job_signaled) 
{
+
+   s_job = 
list_first_entry_or_null(>sched.pending_list, struct drm_sched_job, list);
+   if (s_job == NULL)
+   continue;
+
+   /* clear job's guilty and depend the 
folowing step to decide the real one */
+   drm_sched_reset_karma(s_job);
+   drm_sched_resubmit_jobs2(>sched, 
1);
+   ret = 
dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
+
+   if (ret == 0) { /* timeout */
+   DRM_ERROR("Found the real bad 
job! ring:%s, job_id:%llx\n", ring->sched.name, s_job->id);
+   /* set guilty */
+   drm_sched_increase_karma(s_job);
+
+   /* do hw reset */
+   if (amdgpu_sriov_vf(adev)) {
+