RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

2024-06-18 Thread Chai, Thomas
[AMD Official Use Only - AMD Internal Distribution Only]

-
Best Regards,
Thomas

-Original Message-
From: Lazar, Lijo 
Sent: Tuesday, June 18, 2024 8:00 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; Li, 
Candice ; Wang, Yang(Kevin) ; Yang, 
Stanley 
Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
complete



On 6/18/2024 4:51 PM, Chai, Thomas wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> -
> Best Regards,
> Thomas
>
> -Original Message-
> From: Chai, Thomas
> Sent: Tuesday, June 18, 2024 7:09 PM
> To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang,
> Yang(Kevin) ; Yang, Stanley
> 
> Subject: RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras
> reset to complete
>
>
>
>
> -
> Best Regards,
> Thomas
>
> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, June 18, 2024 6:09 PM
> To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang,
> Yang(Kevin) ; Yang, Stanley
> 
> Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras
> reset to complete
>
>
>
> On 6/18/2024 12:03 PM, YiPeng Chai wrote:
>> Add completion to wait for ras reset to complete.
>>
>> Signed-off-by: YiPeng Chai 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 898889600771..7f8e6ca07957 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct
>> ras_common_if
>> *ras_block)
>>
>>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>>
>> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  12 //ms
>> +
>>  enum amdgpu_ras_retire_page_reservation {
>>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>>   AMDGPU_RAS_RETIRE_PAGE_PENDING, @@ -2518,6 +2520,8 @@ static
>> void amdgpu_ras_do_recovery(struct work_struct *work)
>>   atomic_set(>ras_recovery, 0);
>>   amdgpu_put_xgmi_hive(hive);
>>   }
>> +
>> + complete_all(>ras_recovery_completion);
>>  }
>>
>>  /* alloc/realloc bps array */
>> @@ -2911,10 +2915,16 @@ static int
>> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>>
>>   flush_delayed_work(>page_retirement_dwork);
>>
>> + reinit_completion(>ras_recovery_completion);
>> +
>>   con->gpu_reset_flags |= reset;
>>   amdgpu_ras_reset_gpu(adev);
>>
>>   *gpu_reset = reset;
>> + if (!wait_for_completion_timeout(>ras_recovery_completion,
>> + 
>> msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
>> + dev_err(adev->dev, "Waiting for GPU to complete ras 
>> reset timeout! reset:0x%x\n",
>> + reset);
>
>> If a mode-1 reset gets to execute first due to job timeout/hws detect cases 
>> in poison timeout, then the ras handler will never get executed.
>> Why this wait is required?
>
>> Thanks,
>> Lijo
>
> [Thomas]  "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception 
> handling" add the check before ras gpu reset.
> Poison ras reset is different from reset triggered by other 
> fatal errors, and all poison RAS resets are triggered from here,
>  in order to distinguish other gpu resets and facilitate 
> subsequent  code processing, so add wait for gpu ras reset here.
>

> Reset mechanism resets the GPU state - whether it's triggered due to poison 
> or fatal errors. As soon as the device is reset successfully, GPU operations 
> can continue.

>So why there needs to be a special wait for poison triggred reset alone?
[Thomas] Different applications may randomly trigger poison errors before gpu 
reset.
 Since poison gpu reset is triggered asynchronously, new poison 
consumption interrupts may occur in the period after gpu reset request is sent 
and before the GPU reset is actually performed..
  In order to avoid performing a poison gpu reset again after 
completing the current poison gpu reset,  It need to stay here to wait for gpu 
to complete reset and then clear the cached poison consumption messages.

>Why not wait on the RAS recovery work object  rather than another completion 
>notification?
[Thomas] Yes, "wait on RAS recovery work object" is a good idea,  I will do it.


Thanks,
Lijo

>>   }
>>
>>   return 0;
>> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device 
>> *adev)
>>   }
>>   }
>>
>> + init_completion(>ras_recovery_completion);
>>   mutex_init(>page_rsv_lock);
>>   INIT_KFIFO(con->poison_fifo);
>>   

Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

2024-06-18 Thread Lazar, Lijo



On 6/18/2024 4:51 PM, Chai, Thomas wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> -
> Best Regards,
> Thomas
> 
> -Original Message-
> From: Chai, Thomas
> Sent: Tuesday, June 18, 2024 7:09 PM
> To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao ; 
> Li, Candice ; Wang, Yang(Kevin) ; 
> Yang, Stanley 
> Subject: RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
> complete
> 
> 
> 
> 
> -
> Best Regards,
> Thomas
> 
> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, June 18, 2024 6:09 PM
> To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao ; 
> Li, Candice ; Wang, Yang(Kevin) ; 
> Yang, Stanley 
> Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
> complete
> 
> 
> 
> On 6/18/2024 12:03 PM, YiPeng Chai wrote:
>> Add completion to wait for ras reset to complete.
>>
>> Signed-off-by: YiPeng Chai 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 898889600771..7f8e6ca07957 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct ras_common_if
>> *ras_block)
>>
>>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>>
>> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  12 //ms
>> +
>>  enum amdgpu_ras_retire_page_reservation {
>>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
>> @@ -2518,6 +2520,8 @@ static void amdgpu_ras_do_recovery(struct work_struct 
>> *work)
>>   atomic_set(>ras_recovery, 0);
>>   amdgpu_put_xgmi_hive(hive);
>>   }
>> +
>> + complete_all(>ras_recovery_completion);
>>  }
>>
>>  /* alloc/realloc bps array */
>> @@ -2911,10 +2915,16 @@ static int
>> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>>
>>   flush_delayed_work(>page_retirement_dwork);
>>
>> + reinit_completion(>ras_recovery_completion);
>> +
>>   con->gpu_reset_flags |= reset;
>>   amdgpu_ras_reset_gpu(adev);
>>
>>   *gpu_reset = reset;
>> + if (!wait_for_completion_timeout(>ras_recovery_completion,
>> + 
>> msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
>> + dev_err(adev->dev, "Waiting for GPU to complete ras 
>> reset timeout! reset:0x%x\n",
>> + reset);
> 
>> If a mode-1 reset gets to execute first due to job timeout/hws detect cases 
>> in poison timeout, then the ras handler will never get executed.
>> Why this wait is required?
> 
>> Thanks,
>> Lijo
> 
> [Thomas]  "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception 
> handling" add the check before ras gpu reset.
> Poison ras reset is different from reset triggered by other 
> fatal errors, and all poison RAS resets are triggered from here,
>  in order to distinguish other gpu resets and facilitate 
> subsequent  code processing, so add wait for gpu ras reset here.
> 

Reset mechanism resets the GPU state - whether it's triggered due to
poison or fatal errors. As soon as the device is reset successfully, GPU
operations can continue. So why there needs to be a special wait for
poison triggred reset alone? Why not wait on the RAS recovery work
object rather than another completion notification?

Thanks,
Lijo

>>   }
>>
>>   return 0;
>> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device 
>> *adev)
>>   }
>>   }
>>
>> + init_completion(>ras_recovery_completion);
>>   mutex_init(>page_rsv_lock);
>>   INIT_KFIFO(con->poison_fifo);
>>   mutex_init(>page_retirement_lock);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> index 91daf48be03a..b47f03edac87 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>>   DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>>   struct ras_ecc_log_info  umc_ecc_log;
>>   struct delayed_work page_retirement_dwork;
>> + struct completion ras_recovery_completion;
>>
>>   /* Fatal error detected flag */
>>   atomic_t fed;


RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

2024-06-18 Thread Chai, Thomas
[AMD Official Use Only - AMD Internal Distribution Only]

-
Best Regards,
Thomas

-Original Message-
From: Chai, Thomas
Sent: Tuesday, June 18, 2024 7:09 PM
To: Lazar, Lijo ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; Li, 
Candice ; Wang, Yang(Kevin) ; Yang, 
Stanley 
Subject: RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
complete




-
Best Regards,
Thomas

-Original Message-
From: Lazar, Lijo 
Sent: Tuesday, June 18, 2024 6:09 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; Li, 
Candice ; Wang, Yang(Kevin) ; Yang, 
Stanley 
Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
complete



On 6/18/2024 12:03 PM, YiPeng Chai wrote:
> Add completion to wait for ras reset to complete.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>  2 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 898889600771..7f8e6ca07957 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct ras_common_if
> *ras_block)
>
>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>
> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  12 //ms
> +
>  enum amdgpu_ras_retire_page_reservation {
>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -2518,6 +2520,8 @@ static void amdgpu_ras_do_recovery(struct work_struct 
> *work)
>   atomic_set(>ras_recovery, 0);
>   amdgpu_put_xgmi_hive(hive);
>   }
> +
> + complete_all(>ras_recovery_completion);
>  }
>
>  /* alloc/realloc bps array */
> @@ -2911,10 +2915,16 @@ static int
> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>
>   flush_delayed_work(>page_retirement_dwork);
>
> + reinit_completion(>ras_recovery_completion);
> +
>   con->gpu_reset_flags |= reset;
>   amdgpu_ras_reset_gpu(adev);
>
>   *gpu_reset = reset;
> + if (!wait_for_completion_timeout(>ras_recovery_completion,
> + 
> msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
> + dev_err(adev->dev, "Waiting for GPU to complete ras 
> reset timeout! reset:0x%x\n",
> + reset);

> If a mode-1 reset gets to execute first due to job timeout/hws detect cases 
> in poison timeout, then the ras handler will never get executed.
> Why this wait is required?

>Thanks,
>Lijo

[Thomas]  "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception handling" 
add the check before ras gpu reset.
Poison ras reset is different from reset triggered by other 
fatal errors, and all poison RAS resets are triggered from here,
 in order to distinguish other gpu resets and facilitate subsequent 
 code processing, so add wait for gpu ras reset here.

>   }
>
>   return 0;
> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>   }
>   }
>
> + init_completion(>ras_recovery_completion);
>   mutex_init(>page_rsv_lock);
>   INIT_KFIFO(con->poison_fifo);
>   mutex_init(>page_retirement_lock);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 91daf48be03a..b47f03edac87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>   DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>   struct ras_ecc_log_info  umc_ecc_log;
>   struct delayed_work page_retirement_dwork;
> + struct completion ras_recovery_completion;
>
>   /* Fatal error detected flag */
>   atomic_t fed;


RE: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

2024-06-18 Thread Chai, Thomas
[AMD Official Use Only - AMD Internal Distribution Only]

-
Best Regards,
Thomas

-Original Message-
From: Lazar, Lijo 
Sent: Tuesday, June 18, 2024 6:09 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; Li, 
Candice ; Wang, Yang(Kevin) ; Yang, 
Stanley 
Subject: Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to 
complete



On 6/18/2024 12:03 PM, YiPeng Chai wrote:
> Add completion to wait for ras reset to complete.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>  2 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 898889600771..7f8e6ca07957 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct ras_common_if
> *ras_block)
>
>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>
> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  12 //ms
> +
>  enum amdgpu_ras_retire_page_reservation {
>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -2518,6 +2520,8 @@ static void amdgpu_ras_do_recovery(struct work_struct 
> *work)
>   atomic_set(>ras_recovery, 0);
>   amdgpu_put_xgmi_hive(hive);
>   }
> +
> + complete_all(>ras_recovery_completion);
>  }
>
>  /* alloc/realloc bps array */
> @@ -2911,10 +2915,16 @@ static int
> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>
>   flush_delayed_work(>page_retirement_dwork);
>
> + reinit_completion(>ras_recovery_completion);
> +
>   con->gpu_reset_flags |= reset;
>   amdgpu_ras_reset_gpu(adev);
>
>   *gpu_reset = reset;
> + if (!wait_for_completion_timeout(>ras_recovery_completion,
> + 
> msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
> + dev_err(adev->dev, "Waiting for GPU to complete ras 
> reset timeout! reset:0x%x\n",
> + reset);

> If a mode-1 reset gets to execute first due to job timeout/hws detect cases 
> in poison timeout, then the ras handler will never get executed.
> Why this wait is required?

[Thomas]  "[PATCH 5/5] drm/amdgpu: add gpu reset check and exception handling" 
add the check before ras gpu reset.


Thanks,
Lijo

>   }
>
>   return 0;
> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>   }
>   }
>
> + init_completion(>ras_recovery_completion);
>   mutex_init(>page_rsv_lock);
>   INIT_KFIFO(con->poison_fifo);
>   mutex_init(>page_retirement_lock);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 91daf48be03a..b47f03edac87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>   DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>   struct ras_ecc_log_info  umc_ecc_log;
>   struct delayed_work page_retirement_dwork;
> + struct completion ras_recovery_completion;
>
>   /* Fatal error detected flag */
>   atomic_t fed;


Re: [PATCH 4/5] drm/amdgpu: add completion to wait for ras reset to complete

2024-06-18 Thread Lazar, Lijo



On 6/18/2024 12:03 PM, YiPeng Chai wrote:
> Add completion to wait for ras reset to complete.
> 
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 898889600771..7f8e6ca07957 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -124,6 +124,8 @@ const char *get_ras_block_str(struct ras_common_if 
> *ras_block)
>  
>  #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
>  
> +#define MAX_RAS_RECOVERY_COMPLETION_TIME  12 //ms
> +
>  enum amdgpu_ras_retire_page_reservation {
>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -2518,6 +2520,8 @@ static void amdgpu_ras_do_recovery(struct work_struct 
> *work)
>   atomic_set(>ras_recovery, 0);
>   amdgpu_put_xgmi_hive(hive);
>   }
> +
> + complete_all(>ras_recovery_completion);
>  }
>  
>  /* alloc/realloc bps array */
> @@ -2911,10 +2915,16 @@ static int 
> amdgpu_ras_poison_consumption_handler(struct amdgpu_device *adev,
>  
>   flush_delayed_work(>page_retirement_dwork);
>  
> + reinit_completion(>ras_recovery_completion);
> +
>   con->gpu_reset_flags |= reset;
>   amdgpu_ras_reset_gpu(adev);
>  
>   *gpu_reset = reset;
> + if (!wait_for_completion_timeout(>ras_recovery_completion,
> + 
> msecs_to_jiffies(MAX_RAS_RECOVERY_COMPLETION_TIME)))
> + dev_err(adev->dev, "Waiting for GPU to complete ras 
> reset timeout! reset:0x%x\n",
> + reset);

If a mode-1 reset gets to execute first due to job timeout/hws detect
cases in poison timeout, then the ras handler will never get executed.
Why this wait is required?

Thanks,
Lijo

>   }
>  
>   return 0;
> @@ -3041,6 +3051,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>   }
>   }
>  
> + init_completion(>ras_recovery_completion);
>   mutex_init(>page_rsv_lock);
>   INIT_KFIFO(con->poison_fifo);
>   mutex_init(>page_retirement_lock);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 91daf48be03a..b47f03edac87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -537,6 +537,7 @@ struct amdgpu_ras {
>   DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
>   struct ras_ecc_log_info  umc_ecc_log;
>   struct delayed_work page_retirement_dwork;
> + struct completion ras_recovery_completion;
>  
>   /* Fatal error detected flag */
>   atomic_t fed;