amdgpu: add completion to wait for ras reset to complete

Chai, Thomas Tue, 18 Jun 2024 20:04:34 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

-----------------
Best Regards,
Thomas


-----Original Message-----
From: Lazar, Lijo <lijo.la...@amd.com>
Sent: Tuesday, June 18, 2024 8:00 PM
To: Chai, Thomas <yipeng.c...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking <hawking.zh...@amd.com>; Zhou1, Tao <tao.zh...@amd.com>; Li, 
Candice <candice...@amd.com>; Wang, Yang(Kevin) <kevinyang.w...@amd.com>; Yang, 
Stanley <stanley.y...@amd.com>
Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
complete



On 6/18/2024 4:51 PM, Chai, Thomas wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> -----------------
> Best Regards,
> Thomas
>
> -----Original Message-----
> From: Chai, Thomas
> Sent: Tuesday, June 18, 2024 7:09 PM
> To: Lazar, Lijo <lijo.la...@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking <hawking.zh...@amd.com>; Zhou1, Tao
> <tao.zh...@amd.com>; Li, Candice <candice...@amd.com>; Wang,
> Yang(Kevin) <kevinyang.w...@amd.com>; Yang, Stanley
> <stanley.y...@amd.com>
> Subject: RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras
> reset to complete
>
>
>
>
> -----------------
> Best Regards,
> Thomas
>
> -----Original Message-----
> From: Lazar, Lijo <lijo.la...@amd.com>
> Sent: Tuesday, June 18, 2024 6:09 PM
> To: Chai, Thomas <yipeng.c...@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking <hawking.zh...@amd.com>; Zhou1, Tao
> <tao.zh...@amd.com>; Li, Candice <candice...@amd.com>; Wang,
> Yang(Kevin) <kevinyang.w...@amd.com>; Yang, Stanley
> <stanley.y...@amd.com>
> Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras
> reset to complete
>
>
>
> On 6/18/2024 12:03 PM, YiPeng Chai wrote:
>> Add completion to wait for ras reset to complete.
>>
>> Signed-off-by: YiPeng Chai <yipeng.c...@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 898889600771..7f8e6ca07957 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct
>> ras_common_if
>> *ras_block)
>>
>>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>>
>> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  120000 //ms
>> +
>>  enum amdgpu_ras_retire_page_reservation {
>>       AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>>       AMDGPU_RAS_RETIRE_PAGE_PENDING, @@ -2518,6 +2520,8 @@ static
>> void amdgpu_ras_do_recovery(struct work_struct *work)
>>               atomic_set(&hive->ras_recovery, 0);
>>               amdgpu_put_xgmi_hive(hive);
>>       }
>> +
>> +     complete_all(&ras->ras_recovery_completion);
>>  }
>>
>>  /* alloc/realloc bps array */
>> @@ -2911,10 +2915,16 @@ static int
>> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>>
>>               flush_delayed_work(&con->page_retirement_dwork);
>>
>> +             reinit_completion(&con->ras_recovery_completion);
>> +
>>               con->gpu_reset_flags |= reset;
>>               amdgpu_ras_reset_gpu(adev);
>>
>>               *gpu_reset = reset;
>> +             if (!wait_for_completion_timeout(&con->ras_recovery_completion,
>> +                             
>> msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
>> +                     dev_err(adev->dev, "Waiting for GPU to complete ras 
>> reset timeout! reset:0x%x\n",
>> +                             reset);
>
>> If a mode-1 reset gets to execute first due to job timeout/hws detect cases 
>> in poison timeout, then the ras handler will never get executed.
>> Why this wait is required?
>
>> Thanks,
>> Lijo
>
> [Thomas]  "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception 
> handling" add the check before ras gpu reset.
>                 Poison ras reset is different from reset triggered by other 
> fatal errors, and all poison RAS resets are triggered from here,
>              in order to distinguish other gpu resets and facilitate 
> subsequent  code processing, so add wait for gpu ras reset here.
>

> Reset mechanism resets the GPU state - whether it's triggered due to poison 
> or fatal errors. As soon as the device is reset successfully, GPU operations 
> can continue.

>So why there needs to be a special wait for poison triggred reset alone?
[Thomas] Different applications may randomly trigger poison errors before gpu 
reset.
                 Since poison gpu reset is triggered asynchronously, new poison 
consumption interrupts may occur in the period after gpu reset request is sent 
and before the GPU reset is actually performed..
                  In order to avoid performing a poison gpu reset again after 
completing the current poison gpu reset,  It need to stay here to wait for gpu 
to complete reset and then clear the cached poison consumption messages.

>Why not wait on the RAS recovery work object  rather than another completion 
>notification?
[Thomas] Yes, "wait on RAS recovery work object" is a good idea,  I will do it.


Thanks,
Lijo

>>       }
>>
>>       return 0;
>> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device 
>> *adev)
>>               }
>>       }
>>
>> +     init_completion(&con->ras_recovery_completion);
>>       mutex_init(&con->page_rsv_lock);
>>       INIT_KFIFO(con->poison_fifo);
>>       mutex_init(&con->page_retirement_lock);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 91daf48be03a..b47f03edac87 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>>       DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>>       struct ras_ecc_log_info  umc_ecc_log;
>>       struct delayed_work page_retirement_dwork;
>> +     struct completion ras_recovery_completion;
>>
>>       /* Fatal error detected flag */
>>       atomic_t fed;

RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

Reply via email to