RE: [PATCH Review V2 1/1] drm/amdgpu: Fix eeprom max record count

2024-07-17 Thread Yang, Stanley
[AMD Official Use Only - AMD Internal Distribution Only]

Only set eeprom table version in the beginning of amdgpu_ras_recovery_init is 
not enough,
because the table version value is set to zero read from device eeprom table in 
function amdgpu_ras_eeprom_init
due to no available eeprom info in a new device that have never loaded driver 
before.

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Thursday, July 18, 2024 11:52 AM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: RE: [PATCH Review V2 1/1] drm/amdgpu: Fix eeprom max record count
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Can you please try moving amdgpu_ras_set_eeprom_table_version to the
> beginning of amdgpu_ras_recovery_init?
>
> In such way, we don't need to invoke this function from both
> amdgpu_ras_eeprom_max_record_count and amdgpu_ras_eeprom_init
>
> Regards,
> Hawking
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Thursday, July 18, 2024 11:20
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review V2 1/1] drm/amdgpu: Fix eeprom max record count
>
> The eeprom table is empty before initializing, set eeprom table version first
> before initializing.
>
> Changed from V1:
> Reuse amdgpu_ras_set_eeprom_table_version function
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index eae0a555df3c..aab8077e5098 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1011,6 +1011,9 @@ int amdgpu_ras_eeprom_read(struct
> amdgpu_ras_eeprom_control *control,
>
>  uint32_t amdgpu_ras_eeprom_max_record_count(struct
> amdgpu_ras_eeprom_control *control)  {
> +   /* get available eeprom table version first before eeprom table init 
> */
> +   amdgpu_ras_set_eeprom_table_version(control);
> +
> if (control->tbl_hdr.version == RAS_TABLE_VER_V2_1)
> return RAS_MAX_RECORD_COUNT_V2_1;
> else
> --
> 2.25.1
>



RE: [PATCH] drm/amdgpu: sysfs node disable query error count during gpu reset

2024-07-01 Thread Yang, Stanley
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Chai, Thomas 
> Sent: Monday, July 1, 2024 4:11 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang, Yang(Kevin)
> ; Yang, Stanley ; Chai,
> Thomas 
> Subject: [PATCH] drm/amdgpu: sysfs node disable query error count during gpu
> reset
>
> Sysfs node disable query error count during gpu reset.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/aldebaran.c | 2 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 3 +++
>  3 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> index d0a8da67dc2a..b0f95a7649bf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> @@ -316,8 +316,6 @@ static int aldebaran_mode2_restore_ip(struct
> amdgpu_device *adev)
>   adev->ip_blocks[i].status.late_initialized = true;
>   }
>
> - amdgpu_ras_set_error_query_ready(adev, true);
> -
>   amdgpu_device_set_cg_state(adev, AMD_CG_STATE_GATE);
>   amdgpu_device_set_pg_state(adev, AMD_PG_STATE_GATE);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e133a9982a77..41689aa24e67 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3157,7 +3157,8 @@ static int amdgpu_device_ip_late_init(struct
> amdgpu_device *adev)
>   return r;
>   }
>
> - amdgpu_ras_set_error_query_ready(adev, true);
> + if (!amdgpu_in_reset(adev))
> + amdgpu_ras_set_error_query_ready(adev, true);
>
>   amdgpu_device_set_cg_state(adev, AMD_CG_STATE_GATE);
>   amdgpu_device_set_pg_state(adev, AMD_PG_STATE_GATE); diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ac7ded01dad0..e2abc04112d2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1295,6 +1295,9 @@ ssize_t amdgpu_ras_aca_sysfs_read(struct device
> *dev, struct device_attribute *a
>   .head = obj->head,
>   };
>
> + if (!amdgpu_ras_get_error_query_ready(obj->adev))
> + return sysfs_emit(buf, "Query currently inaccessible\n");
> +
>   if (amdgpu_ras_query_error_status(obj->adev, ))
>   return -EINVAL;
>
> --
> 2.34.1



RE: [PATCH V2] drm/amdgpu: sysfs node disable query error count during gpu reset

2024-07-01 Thread Yang, Stanley
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Thomas,

I think we can optimize the amdgpu_ras_set_error_query_ready(adev, true) 
function calling during GPU recovery,
amdgpu_ras_set_error_query_ready(tmp_adev, false) -> recovery start -> recovery 
done -> amdgpu_ras_set_error_query_ready(tmp_adev, true),
above process can avoid access query error count during GPU recovery.

Regards,
Stanley
> -Original Message-
> From: Chai, Thomas 
> Sent: Monday, July 1, 2024 11:19 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang,
> Yang(Kevin) ; Yang, Stanley
> ; Chai, Thomas 
> Subject: [PATCH V2] drm/amdgpu: sysfs node disable query error count
> during gpu reset
>
> Sysfs node disable query error count during gpu reset.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ac7ded01dad0..a65b5197b0fc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -619,6 +619,7 @@ static const struct file_operations
> amdgpu_ras_debugfs_eeprom_ops = {  static ssize_t
> amdgpu_ras_sysfs_read(struct device *dev,
>   struct device_attribute *attr, char *buf)  {
> + int ret;
>   struct ras_manager *obj = container_of(attr, struct ras_manager,
> sysfs_attr);
>   struct ras_query_if info = {
>   .head = obj->head,
> @@ -627,7 +628,10 @@ static ssize_t amdgpu_ras_sysfs_read(struct device
> *dev,
>   if (!amdgpu_ras_get_error_query_ready(obj->adev))
>   return sysfs_emit(buf, "Query currently inaccessible\n");
>
> - if (amdgpu_ras_query_error_status(obj->adev, ))
> + ret = amdgpu_ras_query_error_status(obj->adev, );
> + if (ret == -EIO) /* gpu reset is ongoing */
> + return sysfs_emit(buf, "Query currently inaccessible\n");
> + else if (ret)
>   return -EINVAL;
>
>   if (amdgpu_ip_version(obj->adev, MP0_HWIP, 0) != IP_VERSION(11,
> 0, 2) && @@ -1290,12 +1294,19 @@ static int
> amdgpu_aca_log_ras_error_data(struct amdgpu_device *adev, enum amdgpu
> ssize_t amdgpu_ras_aca_sysfs_read(struct device *dev, struct device_attribute
> *attr,
> struct aca_handle *handle, char *buf, void
> *data)  {
> + int ret;
>   struct ras_manager *obj = container_of(handle, struct ras_manager,
> aca_handle);
>   struct ras_query_if info = {
>   .head = obj->head,
>   };
>
> - if (amdgpu_ras_query_error_status(obj->adev, ))
> + if (!amdgpu_ras_get_error_query_ready(obj->adev))
> + return sysfs_emit(buf, "Query currently inaccessible\n");
> +
> + ret = amdgpu_ras_query_error_status(obj->adev, );
> + if (ret == -EIO) /* gpu reset is ongoing */
> + return sysfs_emit(buf, "Query currently inaccessible\n");
> + else if (ret)
>   return -EINVAL;
>
>   return sysfs_emit(buf, "%s: %lu\n%s: %lu\n%s: %lu\n", "ue",
> info.ue_count,
> --
> 2.34.1



RE: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag

2024-05-23 Thread Yang, Stanley
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: amd-gfx  On Behalf Of Tao Zhou
> Sent: Thursday, May 23, 2024 6:02 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag
>
> Set the flag to true if bad page number reaches threshold.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c|  7 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h|  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h |  3 +--
>  4 files changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ecce022c657b..934dfb2bf9e5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2940,7 +2940,6 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   struct ras_err_handler_data **data;
>   u32  max_eeprom_records_count = 0;
> - bool exc_err_limit = false;
>   int ret;
>
>   if (!con || amdgpu_sriov_vf(adev))
> @@ -2977,12 +2976,12 @@ int amdgpu_ras_recovery_init(struct
> amdgpu_device *adev)
>*/
>   if (adev->gmc.xgmi.pending_reset)
>   return 0;
> - ret = amdgpu_ras_eeprom_init(>eeprom_control, _err_limit);
> + ret = amdgpu_ras_eeprom_init(>eeprom_control);
>   /*
>* This calling fails when exc_err_limit is true or
>* ret != 0.
>*/
> - if (exc_err_limit || ret)
> + if (con->is_rma || ret)
>   goto free;
>
>   if (con->eeprom_control.ras_num_recs) { @@ -3033,7 +3032,7 @@ int
> amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>* Except error threshold exceeding case, other failure cases in this
>* function would not fail amdgpu driver init.
>*/
> - if (!exc_err_limit)
> + if (!con->is_rma)
>   ret = 0;
>   else
>   ret = -EINVAL;

[Stanley]: Should stop device service if device is under RMA during running? 
the amdgpu_ras_recovery_init function only be called during the process of 
loading driver.

Regards,
Stanley
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index d06c01b978cd..437c58c85639 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -521,6 +521,7 @@ struct amdgpu_ras {
>   bool update_channel_flag;
>   /* Record status of smu mca debug mode */
>   bool is_aca_debug_mode;
> + bool is_rma;
>
>   /* Record special requirements of gpu reset caller */
>   uint32_t  gpu_reset_flags;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 9b789dcc2bd1..eae0a555df3c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -750,6 +750,9 @@ amdgpu_ras_eeprom_update_header(struct
> amdgpu_ras_eeprom_control *control)
>   control->tbl_rai.health_percent = 0;
>   }
>
> + if (amdgpu_bad_page_threshold != -1)
> + ras->is_rma = true;
> +
>   /* ignore the -ENOTSUPP return value */
>   amdgpu_dpm_send_rma_reason(adev);
>   }
> @@ -1321,8 +1324,7 @@ static int __read_table_ras_info(struct
> amdgpu_ras_eeprom_control *control)
>   return res == RAS_TABLE_V2_1_INFO_SIZE ? 0 : res;  }
>
> -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> -bool *exceed_err_limit)
> +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
>  {
>   struct amdgpu_device *adev = to_amdgpu_device(control);
>   unsigned char buf[RAS_TABLE_HEADER_SIZE] = { 0 }; @@ -1330,7
> +1332,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control
> *control,
>   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>   int res;
>
> - *exceed_err_limit = false;
> + ras->is_rma = false;
>
>   if (!__is_ras_eeprom_supported(adev))
>   return 0;
> @@ -1422,7 +1424,7 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
>   dev_warn(adev->dev, "GPU will be initialized
> due to bad_page_threshold = -1.");
>   res = 0;
>   } else {
> - *exceed_err_limit = true;
> + ras->is_rma = true;
>   dev_err(adev->dev,
>   "RAS records:%d exceed threshold:%d, "
>   "GPU will not be initialized. Replace 
> this
> GPU or increase the threshold", diff --git
> 

RE: [PATCH Review 1/1] drm/amdgpu: Adjust XGMI WAFL ras enable bit

2024-04-25 Thread Yang, Stanley
[AMD Official Use Only - General]

Thanks for reminding, the XGMI/WAFL caps is set on device without XGMI link, 
will notice PSP firmware team to fix.

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Thursday, April 25, 2024 3:26 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: RE: [PATCH Review 1/1] drm/amdgpu: Adjust XGMI WAFL ras enable bit
>
> [AMD Official Use Only - General]
>
> Hmm... we do expect PSP report the XGMI/WAFL Caps. This is different from
> legacy RAS CAP check through atomfirmware. But if you found the XGMI/WAFL
> bits are not set properly in the new PSP interface, let's reach out to PSP 
> firmware
> team for a fix.
>
> Regards,
> Hawking
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Thursday, April 25, 2024 15:08
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Adjust XGMI WAFL ras enable bit
>
> The way to get ras capability has changed for some asics, both of them need
> check XGMI physical nodes number to set XGMI WAFL ras enable bit.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index b2a883d3e19d..ea77e00cc002 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2918,13 +2918,6 @@ static void
> amdgpu_ras_query_ras_capablity_from_vbios(struct amdgpu_device *adev
> else
> adev->ras_hw_enabled &= ~(1 << AMDGPU_RAS_BLOCK__VCN |
>   1 << 
> AMDGPU_RAS_BLOCK__JPEG);
> -
> -   /*
> -* XGMI RAS is not supported if xgmi num physical nodes
> -* is zero
> -*/
> -   if (!adev->gmc.xgmi.num_physical_nodes)
> -   adev->ras_hw_enabled &= ~(1 <<
> AMDGPU_RAS_BLOCK__XGMI_WAFL);
> } else {
> dev_info(adev->dev, "SRAM ECC is not presented.\n");
> }
> @@ -3002,6 +2995,13 @@ static void amdgpu_ras_check_supported(struct
> amdgpu_device *adev)
> amdgpu_ras_query_poison_mode(adev);
>
>  init_ras_enabled_flag:
> +   /*
> +* XGMI RAS is not supported if xgmi num physical nodes
> +* is zero
> +*/
> +   if (!adev->gmc.xgmi.num_physical_nodes)
> +   adev->ras_hw_enabled &= ~(1 <<
> AMDGPU_RAS_BLOCK__XGMI_WAFL);
> +
> /* hw_supported needs to be aligned with RAS block mask. */
> adev->ras_hw_enabled &= AMDGPU_RAS_BLOCK_MASK;
>
> --
> 2.25.1
>



RE: [PATCH Review 1/1] drm/amdgpu: Support setting reset_method at runtime

2024-04-17 Thread Yang, Stanley
[AMD Official Use Only - General]

> -Original Message-
> From: Christian König 
> Sent: Wednesday, April 17, 2024 8:46 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH Review 1/1] drm/amdgpu: Support setting reset_method
> at runtime
>
> Am 12.04.24 um 08:21 schrieb Stanley.Yang:
> > Signed-off-by: Stanley.Yang 
>
> You are missing a commit message, without it the patch will automatically be
> rejected when you try to push it.

Thank you Chris, will add it before push.

Regards,
Stanley

>
> With that added Reviewed-by: Christian König 
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 80b9642f2bc4..5f5bf0c26b1f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -915,7 +915,7 @@ module_param_named(freesync_video,
> amdgpu_freesync_vid_mode, uint, 0444);
> >* GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = 
> > mode1,
> 3 = mode2, 4 = baco)
> >*/
> >   MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto
> > (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 =
> > baco/bamaco)"); -module_param_named(reset_method,
> amdgpu_reset_method,
> > int, 0444);
> > +module_param_named(reset_method, amdgpu_reset_method, int, 0644);
> >
> >   /**
> >* DOC: bad_page_threshold (int) Bad page threshold is specifies the



RE: [PATCH Review 1/1] drm/amdgpu: Support setting reset_method at runtime

2024-04-17 Thread Yang, Stanley
[AMD Official Use Only - General]

Ping...

Regards,
Stanley
> -Original Message-
> From: Yang, Stanley 
> Sent: Friday, April 12, 2024 2:21 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Support setting reset_method at
> runtime
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 80b9642f2bc4..5f5bf0c26b1f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -915,7 +915,7 @@ module_param_named(freesync_video,
> amdgpu_freesync_vid_mode, uint, 0444);
>   * GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 
> 3 =
> mode2, 4 = baco)
>   */
>  MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0
> = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco/bamaco)"); -
> module_param_named(reset_method, amdgpu_reset_method, int, 0444);
> +module_param_named(reset_method, amdgpu_reset_method, int, 0644);
>
>  /**
>   * DOC: bad_page_threshold (int) Bad page threshold is specifies the
> --
> 2.25.1



RE: [PATCH Review 1/1] drm/amdgpu: Support setting recover method

2024-04-11 Thread Yang, Stanley
[AMD Official Use Only - General]

> -Original Message-
> From: Christian König 
> Sent: Thursday, April 11, 2024 7:17 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH Review 1/1] drm/amdgpu: Support setting recover method
>
> Am 11.04.24 um 13:11 schrieb Stanley.Yang:
> > Don't modify amdgpu gpu recover get operation, add amdgpu gpu recover
> > set operation to select reset method, only support mode1 and mode2
> > currently.
>
> Well I don't think setting this from userspace is valid.
>
> The reset method to use is determined by the hardware and environment (e.g.
> SRIOV, passthrough, whatever) and can't be chosen simply.

[Stanley]: Agree, the setting is invalid for some devices not supported setting 
method and devices still reset with default method,
but it's valid for those devices supported setting reset method, user can 
conduct combination testing like mode1 test then mode2 test without
re-modprobe driver.

Regards,
Stanley
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h|  3 ++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 37
> +++---
> >   3 files changed, 37 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 9c62552bec34..c82976b2b977 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -1151,6 +1151,9 @@ struct amdgpu_device {
> > booldebug_largebar;
> > booldebug_disable_soft_recovery;
> > booldebug_use_vram_fw_buf;
> > +
> > +   /* Used to set gpu reset method */
> > +   int recover_method;
> >   };
> >
> >   static inline uint32_t amdgpu_ip_version(const struct amdgpu_device
> > *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 3204b8f6edeb..8411a793be18 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3908,6 +3908,7 @@ int amdgpu_device_init(struct amdgpu_device
> *adev,
> > else
> > adev->asic_type = flags & AMD_ASIC_MASK;
> >
> > +   adev->recover_method = AMD_RESET_METHOD_NONE;
> > adev->usec_timeout = AMDGPU_MAX_USEC_TIMEOUT;
> > if (amdgpu_emu_mode == 1)
> > adev->usec_timeout *= 10;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 10832b470448..e388a50d11d9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -965,9 +965,37 @@ static int gpu_recover_get(void *data, u64 *val)
> > return 0;
> >   }
> >
> > +static int gpu_recover_set(void *data, u64 val) {
> > +   struct amdgpu_device *adev = (struct amdgpu_device *)data;
> > +   struct drm_device *dev = adev_to_drm(adev);
> > +   int r;
> > +
> > +   /* TODO: support mode1 and mode2 currently */
> > +   if (val == AMD_RESET_METHOD_MODE1 ||
> > +   val == AMD_RESET_METHOD_MODE2)
> > +   adev->recover_method = val;
> > +   else
> > +   adev->recover_method = AMD_RESET_METHOD_NONE;
> > +
> > +   r = pm_runtime_get_sync(dev->dev);
> > +   if (r < 0) {
> > +   pm_runtime_put_autosuspend(dev->dev);
> > +   return 0;
> > +   }
> > +
> > +   if (amdgpu_reset_domain_schedule(adev->reset_domain, 
> >reset_work))
> > +   flush_work(>reset_work);
> > +
> > +   pm_runtime_mark_last_busy(dev->dev);
> > +   pm_runtime_put_autosuspend(dev->dev);
> > +
> > +   return 0;
> > +}
> > +
> >   DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_fence_info);
> > -DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_gpu_recover_fops,
> gpu_recover_get, NULL,
> > -"%lld\n");
> > +DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_gpu_recover_fops,
> gpu_recover_get,
> > +gpu_recover_set, "%lld\n");
> >
> >   static void amdgpu_debugfs_reset_work(struct work_struct *work)
> >   {
> > @@ -978,9 +1006,10 @@ static void amdgpu_debugfs_reset_work(struct
> > work_struct *work)
> >
> > memset(_context, 0, sizeof(reset_c

RE: [PATCH 2/2] drm/amdgpu: simplify convert_error_address interface for UMC v12

2024-03-21 Thread Yang, Stanley
[AMD Official Use Only - General]

The series is Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of Tao
> Zhou
> Sent: Thursday, March 21, 2024 11:30 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH 2/2] drm/amdgpu: simplify convert_error_address interface
> for UMC v12
>
> Replace separate parameters with struct ta_ras_query_address_input.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 57 ++-
> ---
>  1 file changed, 30 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index 0a9cc87e98d0..d0fcfcb3404f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -266,26 +266,19 @@ static void umc_v12_0_mca_addr_to_pa(struct
> amdgpu_device *adev,  }
>
>  static void umc_v12_0_convert_error_address(struct amdgpu_device *adev,
> - struct ras_err_data *err_data,
> uint64_t err_addr,
> - uint32_t ch_inst, uint32_t
> umc_inst,
> - uint32_t node_inst, uint32_t
> socket_id)
> + struct ras_err_data *err_data,
> + struct ta_ras_query_address_input
> *addr_in)
>  {
>   uint32_t col, row, row_xor, bank, channel_index;
> - uint64_t soc_pa, retired_page, column;
> - struct ta_ras_query_address_input addr_in;
> + uint64_t soc_pa, retired_page, column, err_addr;
>   struct ta_ras_query_address_output addr_out;
>
> - addr_in.addr_type = TA_RAS_MCA_TO_PA;
> - addr_in.ma.err_addr = err_addr;
> - addr_in.ma.ch_inst = ch_inst;
> - addr_in.ma.umc_inst = umc_inst;
> - addr_in.ma.node_inst = node_inst;
> - addr_in.ma.socket_id = socket_id;
> -
> - if (psp_ras_query_address(>psp, _in, _out))
> + err_addr = addr_in->ma.err_addr;
> + addr_in->addr_type = TA_RAS_MCA_TO_PA;
> + if (psp_ras_query_address(>psp, addr_in, _out))
>   /* fallback to old path if fail to get pa from psp */
> - umc_v12_0_mca_addr_to_pa(adev, err_addr, ch_inst,
> umc_inst,
> - node_inst, _out);
> + umc_v12_0_mca_addr_to_pa(adev, err_addr, addr_in-
> >ma.ch_inst,
> + addr_in->ma.umc_inst, addr_in-
> >ma.node_inst, _out);
>
>   soc_pa = addr_out.pa.pa;
>   bank = addr_out.pa.bank;
> @@ -310,7 +303,7 @@ static void umc_v12_0_convert_error_address(struct
> amdgpu_device *adev,
>   "Error Address(PA):0x%-10llx Row:0x%-4x Col:0x%-2x
> Bank:0x%x Channel:0x%x\n",
>   retired_page, row, col, bank, channel_index);
>   amdgpu_umc_fill_error_record(err_data, err_addr,
> - retired_page, channel_index, umc_inst);
> + retired_page, channel_index, addr_in->ma.umc_inst);
>
>   /* shift R13 bit */
>   retired_page ^= (0x1ULL << UMC_V12_0_PA_R13_BIT); @@ -
> 318,7 +311,7 @@ static void umc_v12_0_convert_error_address(struct
> amdgpu_device *adev,
>   "Error Address(PA):0x%-10llx Row:0x%-4x Col:0x%-2x
> Bank:0x%x Channel:0x%x\n",
>   retired_page, row_xor, col, bank, channel_index);
>   amdgpu_umc_fill_error_record(err_data, err_addr,
> - retired_page, channel_index, umc_inst);
> + retired_page, channel_index, addr_in->ma.umc_inst);
>   }
>  }
>
> @@ -326,13 +319,13 @@ static int umc_v12_0_query_error_address(struct
> amdgpu_device *adev,
>   uint32_t node_inst, uint32_t
> umc_inst,
>   uint32_t ch_inst, void *data)
>  {
> + struct ras_err_data *err_data = (struct ras_err_data *)data;
> + struct ta_ras_query_address_input addr_in;
>   uint64_t mc_umc_status_addr;
>   uint64_t mc_umc_status, err_addr;
>   uint64_t mc_umc_addrt0;
> - struct ras_err_data *err_data = (struct ras_err_data *)data;
>   uint64_t umc_reg_offset =
>   get_umc_v12_0_reg_offset(adev, node_inst, umc_inst,
> ch_inst);
> - uint32_t socket_id = 0;
>
>   mc_umc_status_addr =
>   SOC15_REG_OFFSET(UMC, 0,
> regMCA_UMC_UMC0_MCUMC_STATUST0); @@ -362,10 +355,16 @@ static
> int umc_v12_0_query_error_address(struct amdgpu_device *adev,
>   if (!adev->aid_mask &&
>   adev->smuio.funcs &&
>   adev->smuio.funcs->get_socket_id)
> - socket_id = adev->smuio.funcs->get_socket_id(adev);
> + addr_in.ma.socket_id = adev->smuio.funcs-
> >get_socket_id(adev);
> + else
> + addr_in.ma.socket_id = 0;
> +
> + addr_in.ma.err_addr = err_addr;
> + 

RE: [PATCH Review V2 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired

2023-12-20 Thread Yang, Stanley
[AMD Official Use Only - General]

Yes, it should add check ras cap before put gmc.ecc_irq, thanks.

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Wednesday, December 20, 2023 4:12 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH Review V2 1/1] drm/amdgpu: Fix ecc irq enable/disable
> unpaired
>
> [AMD Official Use Only - General]
>
> +   if (adev->gmc.ecc_irq.funcs)
> +   amdgpu_irq_put(adev, >gmc.ecc_irq, 0);
> +
>
> This doesn't match with amdgpu_irq_get call for gmc.ecc_irq, where driver
> checks ras cap to decide whether enabling the interrupt or not (see
> amdgpu_umc_ras_late_init). We do the same check for amdgpu_irq_put call.
>
> Regards,
> Hawking
>
> -Original Message-
> From: Yang, Stanley 
> Sent: Tuesday, December 19, 2023 20:48
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> 
> Cc: Yang, Stanley 
> Subject: [PATCH Review V2 1/1] drm/amdgpu: Fix ecc irq enable/disable
> unpaired
>
> The ecc_irq is disabled while GPU mode2 reset suspending process, but not
> be enabled during GPU mode2 reset resume process.
>
> Changed from V1:
> only do sdma/gfx ras_late_init in aldebaran_mode2_restore_ip,
> delete amdgpu_ras_late_resume function.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/aldebaran.c | 28
> +-  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> |  3 +++  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c |  4 
> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  3 +++
>  4 files changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> index 02f4c6f9d4f6..b60a3c1bd0f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> @@ -330,6 +330,7 @@ aldebaran_mode2_restore_hwcontext(struct
> amdgpu_reset_control *reset_ctl,  {
> struct list_head *reset_device_list = 
> reset_context->reset_device_list;
> struct amdgpu_device *tmp_adev = NULL;
> +   struct amdgpu_ras *con;
> int r;
>
> if (reset_device_list == NULL)
> @@ -355,7 +356,32 @@ aldebaran_mode2_restore_hwcontext(struct
> amdgpu_reset_control *reset_ctl,
>  */
> amdgpu_register_gpu_instance(tmp_adev);
>
> -   /* Resume RAS */
> +   /* Resume RAS, ecc_irq */
> +   con = amdgpu_ras_get_context(tmp_adev);
> +   if (!amdgpu_sriov_vf(tmp_adev) && con) {
> +   if (tmp_adev->sdma.ras &&
> +   amdgpu_ras_is_supported(tmp_adev,
> AMDGPU_RAS_BLOCK__SDMA) &&
> +   tmp_adev->sdma.ras->ras_block.ras_late_init) {
> +   r = tmp_adev->sdma.ras-
> >ras_block.ras_late_init(tmp_adev,
> +   
> _adev->sdma.ras->ras_block.ras_comm);
> +   if (r) {
> +   dev_err(tmp_adev->dev, "SDMA failed 
> to execute
> ras_late_init! ret:%d\n", r);
> +   goto end;
> +   }
> +   }
> +
> +   if (tmp_adev->gfx.ras &&
> +   amdgpu_ras_is_supported(tmp_adev,
> AMDGPU_RAS_BLOCK__GFX) &&
> +   tmp_adev->gfx.ras->ras_block.ras_late_init) {
> +   r = 
> tmp_adev->gfx.ras->ras_block.ras_late_init(tmp_adev,
> +   
> _adev->gfx.ras->ras_block.ras_comm);
> +   if (r) {
> +   dev_err(tmp_adev->dev, "GFX failed to 
> execute
> ras_late_init! ret:%d\n", r);
> +   goto end;
> +   }
> +   }
> +   }
> +
> amdgpu_ras_resume(tmp_adev);
>
> /* Update PSP FW topology after reset */ diff --git
> a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index 09cbca596bb5..b93a0baeb2d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -1043,6 +1043,9 @@ static int gmc_v10_0_hw_fini(void *handle)
>
> amdgpu_irq_put(adev, >gmc.vm_fault, 0);
>
> +   if (adev->gmc.ecc_irq.funcs)
> +   amdgpu_irq_put(adev, >gmc.ecc_irq, 0);
> +
> ret

RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired

2023-12-18 Thread Yang, Stanley
[AMD Official Use Only - General]

For mode2 reset, only call SDMA/GFX suspend to disable SDMA/GFX ecc_irq, driver 
just need enable SDMA/GFX ecc_irq during resume process.
Think about below scenario on aqua vanjaram, user modprobe amdgpu with 
reset_method=3, driver will do GPU recovery if the SDMA uncorrectable error is 
triggered,
It's difficult to distinguish whether need resume gmc ecc_irq, nbio 
ras_controller_irq, nbio ras_err_event_athub_irq since driver do full gpu reset.

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, December 18, 2023 3:03 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable
> unpaired
>
> [AMD Official Use Only - General]
>
> Can we put the irq resume in amdgpu_ras_resume?
>
> Regards,
> Hawking
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Saturday, December 16, 2023 00:50
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired
>
> The ecc_irq is disabled while GPU mode2 reset suspending process, but not be
> enabled during GPU mode2 reset resume process.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/aldebaran.c  |  6 
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37
> +
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c  |  3 ++
> drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  |  4 +++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  3 ++
>  6 files changed, 54 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> index 02f4c6f9d4f6..ba9238a93064 100644
> --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> @@ -358,6 +358,12 @@ aldebaran_mode2_restore_hwcontext(struct
> amdgpu_reset_control *reset_ctl,
> /* Resume RAS */
> amdgpu_ras_resume(tmp_adev);
>
> +   r = amdgpu_ras_late_resume(tmp_adev);
> +   if (r) {
> +   dev_err(tmp_adev->dev, "amdgpu_ras_late_resume
> failed %d\n", r);
> +   goto end;
> +   }
> +
> /* Update PSP FW topology after reset */
> if (reset_context->hive &&
> tmp_adev->gmc.xgmi.num_physical_nodes > 1) diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 8a04fb6c7c1f..318e77c493f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3164,6 +3164,43 @@ int amdgpu_ras_late_init(struct amdgpu_device
> *adev)
> return 0;
>  }
>
> +/* Handle mode 2 reset, resume ecc irq state */ int
> +amdgpu_ras_late_resume(struct amdgpu_device *adev) {
> +   struct amdgpu_ras_block_list *node, *tmp;
> +   struct amdgpu_ras_block_object *obj;
> +   int r;
> +
> +   /* Guest side doesn't need init ras feature */
> +   if (amdgpu_sriov_vf(adev))
> +   return 0;
> +
> +   list_for_each_entry_safe(node, tmp, >ras_list, node) {
> +   if (!node->ras_obj) {
> +   dev_warn(adev->dev, "Warning: abnormal ras list 
> node.\n");
> +   continue;
> +   }
> +
> +   obj = node->ras_obj;
> +
> +   if (!(obj->ras_comm.block == AMDGPU_RAS_BLOCK__SDMA ||
> + obj->ras_comm.block == AMDGPU_RAS_BLOCK__GFX))
> +   continue;
> +
> +   if (obj->ras_late_init) {
> +   r = obj->ras_late_init(adev, >ras_comm);
> +   if (r) {
> +   dev_err(adev->dev, "%s failed to execute 
> ras_late_init!
> ret:%d\n",
> +   obj->ras_comm.name, r);
> +   return r;
> +   }
> +   } else
> +   amdgpu_ras_block_late_init_default(adev, 
> >ras_comm);
> +   }
> +
> +   return 0;
> +}
> +
>  /* do some fini work before IP fini as dependence */  int
> amdgpu_ras_pre_fini(struct amdgpu_device *adev)  { diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 6a941eb8fb8f..5c1ffc5a6899 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @

RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired

2023-12-18 Thread Yang, Stanley
[AMD Official Use Only - General]

Yes, we can only call gfx/sdma ras late init in aldebaran_mode2_restore_ip, 
will update.

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, December 18, 2023 8:37 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable
> unpaired
>
> [AMD Official Use Only - General]
>
> In such case, can we call amdgpu_gfx_ras_late_init and
> amdgpu_sdma_ras_late_init in aldebaran_mode2_restore_ip?
>
> Regards,
> Hawking
>
> -Original Message-
> From: Yang, Stanley 
> Sent: Monday, December 18, 2023 17:30
> To: Zhang, Hawking ; amd-
> g...@lists.freedesktop.org
> Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable
> unpaired
>
> [AMD Official Use Only - General]
>
> For mode2 reset, only call SDMA/GFX suspend to disable SDMA/GFX ecc_irq,
> driver just need enable SDMA/GFX ecc_irq during resume process.
> Think about below scenario on aqua vanjaram, user modprobe amdgpu with
> reset_method=3, driver will do GPU recovery if the SDMA uncorrectable error
> is triggered, It's difficult to distinguish whether need resume gmc ecc_irq, 
> nbio
> ras_controller_irq, nbio ras_err_event_athub_irq since driver do full gpu
> reset.
>
> Regards,
> Stanley
> > -Original Message-
> > From: Zhang, Hawking 
> > Sent: Monday, December 18, 2023 3:03 PM
> > To: Yang, Stanley ;
> > amd-gfx@lists.freedesktop.org
> > Cc: Yang, Stanley 
> > Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable
> > unpaired
> >
> > [AMD Official Use Only - General]
> >
> > Can we put the irq resume in amdgpu_ras_resume?
> >
> > Regards,
> > Hawking
> >
> > -Original Message-
> > From: amd-gfx  On Behalf Of
> > Stanley.Yang
> > Sent: Saturday, December 16, 2023 00:50
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable
> > unpaired
> >
> > The ecc_irq is disabled while GPU mode2 reset suspending process, but
> > not be enabled during GPU mode2 reset resume process.
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/aldebaran.c  |  6 
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37
> > +
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
> > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c  |  3 ++
> > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  |  4 +++
> >  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  3 ++
> >  6 files changed, 54 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> > b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> > index 02f4c6f9d4f6..ba9238a93064 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> > @@ -358,6 +358,12 @@ aldebaran_mode2_restore_hwcontext(struct
> > amdgpu_reset_control *reset_ctl,
> > /* Resume RAS */
> > amdgpu_ras_resume(tmp_adev);
> >
> > +   r = amdgpu_ras_late_resume(tmp_adev);
> > +   if (r) {
> > +   dev_err(tmp_adev->dev, "amdgpu_ras_late_resume
> > failed %d\n", r);
> > +   goto end;
> > +   }
> > +
> > /* Update PSP FW topology after reset */
> > if (reset_context->hive &&
> > tmp_adev->gmc.xgmi.num_physical_nodes > 1) diff
> > --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 8a04fb6c7c1f..318e77c493f2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -3164,6 +3164,43 @@ int amdgpu_ras_late_init(struct amdgpu_device
> > *adev)
> > return 0;
> >  }
> >
> > +/* Handle mode 2 reset, resume ecc irq state */ int
> > +amdgpu_ras_late_resume(struct amdgpu_device *adev) {
> > +   struct amdgpu_ras_block_list *node, *tmp;
> > +   struct amdgpu_ras_block_object *obj;
> > +   int r;
> > +
> > +   /* Guest side doesn't need init ras feature */
> > +   if (amdgpu_sriov_vf(adev))
> > +   return 0;
> > +
> > +   list_for_each_entry_safe(node, tmp, >ras_list, node) {
> > +   if (!node->ras_obj) {
> > +   dev_warn(adev->dev, "Warning: abnormal ras list 
> >

RE: [PATCH] drm/amdgpu: Switch to aca bank for xgmi pcs err cnt

2023-12-12 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Tuesday, December 12, 2023 10:03 PM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ;
> Wang, Yang(Kevin) 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Switch to aca bank for xgmi pcs err cnt
>
> Instead of software managed counters.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h  | 2 ++
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 6 --
>  2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> index e51e8918e667..b399f1b62887 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> @@ -46,6 +46,8 @@
>  #define MCA_REG__STATUS__ERRORCODEEXT(x) MCA_REG_FIELD(x,
> 21, 16)
>  #define MCA_REG__STATUS__ERRORCODE(x)MCA_REG_FIELD(x,
> 15, 0)
>
> +#define MCA_REG__MISC0__ERRCNT(x)MCA_REG_FIELD(x,
> 43, 32)
> +
>  #define MCA_REG__SYND__ERRORINFORMATION(x)   MCA_REG_FIELD(x,
> 17, 0)
>
>  enum amdgpu_mca_ip {
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index ddd782fbee7a..3998c9b31d07 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -2537,13 +2537,15 @@ static int
> mca_pcs_xgmi_mca_get_err_count(const struct mca_ras_info *mca_ras, st
> uint32_t *count)
>  {
>   u32 ext_error_code;
> + u32 err_cnt;
>
>   ext_error_code = MCA_REG__STATUS__ERRORCODEEXT(entry-
> >regs[MCA_REG_IDX_STATUS]);
> + err_cnt = MCA_REG__MISC0__ERRCNT(entry-
> >regs[MCA_REG_IDX_MISC0]);
>
>   if (type == AMDGPU_MCA_ERROR_TYPE_UE && ext_error_code == 0)
> - *count = 1;
> + *count = err_cnt;
>   else if (type == AMDGPU_MCA_ERROR_TYPE_CE && ext_error_code
> == 6)
> - *count = 1;
> + *count = err_cnt;
>
>   return 0;
>  }
> --
> 2.17.1



RE: [PATCH 3/3] drm/amdgpu: Update fw version for boot time error query

2023-11-19 Thread Yang, Stanley
[AMD Official Use Only - General]

The series is Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, November 20, 2023 10:49 AM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ;
> Li, Candice ; Wang, Yang(Kevin)
> 
> Cc: Zhang, Hawking 
> Subject: [PATCH 3/3] drm/amdgpu: Update fw version for boot time error
> query
>
> Boot time error query is not available till a10109
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/psp_v13_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> index 3cf4684d0d3f..5f46877f78cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> @@ -821,7 +821,7 @@ static int psp_v13_0_query_boot_status(struct
> psp_context *psp)
>   if (amdgpu_ip_version(adev, MP0_HWIP, 0) != IP_VERSION(13, 0, 6))
>   return 0;
>
> - if (RREG32_SOC15(MP0, 0, regMP0_SMN_C2PMSG_59) <
> 0x00a10007)
> + if (RREG32_SOC15(MP0, 0, regMP0_SMN_C2PMSG_59) <
> 0x00a10109)
>   return 0;
>
>   for_each_inst(i, inst_mask) {
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: Don't warn for unsupported set_xgmi_plpd_mode

2023-11-01 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of Tao
> Zhou
> Sent: Tuesday, October 31, 2023 3:08 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Lazar, Lijo ; Zhou1, Tao 
> Subject: [PATCH] drm/amdgpu: Don't warn for unsupported
> set_xgmi_plpd_mode
>
> set_xgmi_plpd_mode may be unsupported and this isn't error, no need to
> print warning for it.
>
> Suggested-by: lijo.la...@amd.com
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 0533f873001b..c9b09bddbcdc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -1138,7 +1138,8 @@ static int amdgpu_ras_error_inject_xgmi(struct
> amdgpu_device *adev,
>   if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
>   dev_warn(adev->dev, "Failed to disallow df cstate");
>
> - if (amdgpu_dpm_set_xgmi_plpd_mode(adev,
> XGMI_PLPD_DISALLOW))
> + ret = amdgpu_dpm_set_xgmi_plpd_mode(adev,
> XGMI_PLPD_DISALLOW);
> + if (ret && ret != -EOPNOTSUPP)
>   dev_warn(adev->dev, "Failed to disallow XGMI power down");
>
>   ret = psp_ras_trigger_error(>psp, block_info, instance_mask);
> @@ -1146,7 +1147,8 @@ static int amdgpu_ras_error_inject_xgmi(struct
> amdgpu_device *adev,
>   if (amdgpu_ras_intr_triggered())
>   return ret;
>
> - if (amdgpu_dpm_set_xgmi_plpd_mode(adev, XGMI_PLPD_DEFAULT))
> + ret = amdgpu_dpm_set_xgmi_plpd_mode(adev,
> XGMI_PLPD_DEFAULT);
> + if (ret && ret != -EOPNOTSUPP)
>   dev_warn(adev->dev, "Failed to allow XGMI power down");
>
>   if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_ALLOW))
> --
> 2.35.1



RE: [PATCH] drm/amdgpu: handle extra UE register entries for gfx v9_4_3

2023-10-31 Thread Yang, Stanley
[AMD Official Use Only - General]

Is it better to handle CE and UE list separately?
Anyway Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of Tao
> Zhou
> Sent: Tuesday, October 31, 2023 3:09 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhou1, Tao
> 
> Subject: [PATCH] drm/amdgpu: handle extra UE register entries for gfx v9_4_3
>
> The UE registe list is larger than CE list.
>
> Reported-by: yipeng.c...@amd.com
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 38
> +
>  1 file changed, 38 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 41bbabd9ad4d..046ae95b366a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -3799,6 +3799,27 @@ static void
> gfx_v9_4_3_inst_query_ras_err_count(struct amdgpu_device *adev,
>   }
>   }
>
> + /* handle extra register entries of UE */
> + for (; i < ARRAY_SIZE(gfx_v9_4_3_ue_reg_list); i++) {
> + for (j = 0; j < gfx_v9_4_3_ue_reg_list[i].se_num; j++) {
> + for (k = 0; k <
> gfx_v9_4_3_ue_reg_list[i].reg_entry.reg_inst; k++) {
> + /* no need to select if instance number is 1 */
> + if (gfx_v9_4_3_ue_reg_list[i].se_num > 1 ||
> +
>   gfx_v9_4_3_ue_reg_list[i].reg_entry.reg_inst > 1)
> + gfx_v9_4_3_xcc_select_se_sh(adev, j,
> 0, k, xcc_id);
> +
> +
>   amdgpu_ras_inst_query_ras_error_count(adev,
> +
>   &(gfx_v9_4_3_ue_reg_list[i].reg_entry),
> + 1,
> +
>   gfx_v9_4_3_ras_mem_list_array[gfx_v9_4_3_ue_reg_list[i].mem_id_t
> ype].mem_id_ent,
> +
>   gfx_v9_4_3_ras_mem_list_array[gfx_v9_4_3_ue_reg_list[i].mem_id_t
> ype].size,
> + GET_INST(GC, xcc_id),
> +
>   AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE,
> + _count);
> + }
> + }
> + }
> +
>   gfx_v9_4_3_xcc_select_se_sh(adev, 0x, 0x, 0x,
>   xcc_id);
>   mutex_unlock(>grbm_idx_mutex);
> @@ -3838,6 +3859,23 @@ static void
> gfx_v9_4_3_inst_reset_ras_err_count(struct amdgpu_device *adev,
>   }
>   }
>
> + /* handle extra register entries of UE */
> + for (; i < ARRAY_SIZE(gfx_v9_4_3_ue_reg_list); i++) {
> + for (j = 0; j < gfx_v9_4_3_ue_reg_list[i].se_num; j++) {
> + for (k = 0; k <
> gfx_v9_4_3_ue_reg_list[i].reg_entry.reg_inst; k++) {
> + /* no need to select if instance number is 1 */
> + if (gfx_v9_4_3_ue_reg_list[i].se_num > 1 ||
> +
>   gfx_v9_4_3_ue_reg_list[i].reg_entry.reg_inst > 1)
> + gfx_v9_4_3_xcc_select_se_sh(adev, j,
> 0, k, xcc_id);
> +
> +
>   amdgpu_ras_inst_reset_ras_error_count(adev,
> +
>   &(gfx_v9_4_3_ue_reg_list[i].reg_entry),
> + 1,
> + GET_INST(GC, xcc_id));
> + }
> + }
> + }
> +
>   gfx_v9_4_3_xcc_select_se_sh(adev, 0x, 0x, 0x,
>   xcc_id);
>   mutex_unlock(>grbm_idx_mutex);
> --
> 2.35.1



RE: [PATCH] drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_count

2023-10-31 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of Tao
> Zhou
> Sent: Tuesday, October 31, 2023 3:13 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao ; Zhang, Hawking
> 
> Subject: [PATCH] drm/amdgpu: check recovery status of xgmi hive in
> ras_reset_error_count
>
> Handle xgmi hive case.
>
> Suggested-by: Hawking Zhang 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 753260745554..0093c28f4343 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1226,6 +1226,8 @@ int amdgpu_ras_reset_error_count(struct
> amdgpu_device *adev,
>   struct amdgpu_ras_block_object *block_obj =
> amdgpu_ras_get_ras_block(adev, block, 0);
>   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>   const struct amdgpu_mca_smu_funcs *mca_funcs = adev-
> >mca.mca_funcs;
> + struct amdgpu_hive_info *hive;
> + int hive_ras_recovery = 0;
>
>   if (!block_obj || !block_obj->hw_ops) {
>   dev_dbg_once(adev->dev, "%s doesn't config RAS
> function\n", @@ -1237,8 +1239,15 @@ int
> amdgpu_ras_reset_error_count(struct amdgpu_device *adev,
>   !amdgpu_ras_get_mca_debug_mode(adev))
>   return -EOPNOTSUPP;
>
> + hive = amdgpu_get_xgmi_hive(adev);
> + if (hive) {
> + hive_ras_recovery = atomic_read(>ras_recovery);
> + amdgpu_put_xgmi_hive(hive);
> + }
> +
>   /* skip ras error reset in gpu reset */
> - if ((amdgpu_in_reset(adev) || atomic_read(>in_recovery)) &&
> + if ((amdgpu_in_reset(adev) || atomic_read(>in_recovery) ||
> + hive_ras_recovery) &&
>   mca_funcs && mca_funcs->mca_set_debug_mode)
>   return -EOPNOTSUPP;
>
> --
> 2.35.1



RE: [PATCH] drm/amdgpu: use mode-2 reset for RAS poison consumption

2023-10-27 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of Tao
> Zhou
> Sent: Friday, October 27, 2023 12:04 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH] drm/amdgpu: use mode-2 reset for RAS poison
> consumption
>
> Switch from mode-1 reset to mode-2 for poison consumption.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index f74347cc087a..d65e21914d8c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -166,8 +166,12 @@ static int amdgpu_umc_do_page_retirement(struct
> amdgpu_device *adev,
>   }
>   }
>
> - if (reset)
> + if (reset) {
> + /* use mode-2 reset for poison consumption */
> + if (!entry)
> + con->gpu_reset_flags |=
> AMDGPU_RAS_GPU_RESET_MODE2_RESET;
>   amdgpu_ras_reset_gpu(adev);
> + }
>   }
>
>   kfree(err_data->err_addr);
> --
> 2.35.1



RE: [PATCH] drm/amdgpu: enable RAS poison mode for APU

2023-10-20 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of Tao
> Zhou
> Sent: Friday, October 20, 2023 6:26 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH] drm/amdgpu: enable RAS poison mode for APU
>
> Enable it by default on APU platform.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 95c181cd1fea..a41cab0a2f9c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2710,7 +2710,8 @@ static void
> amdgpu_ras_query_poison_mode(struct amdgpu_device *adev)
>   return;
>
>   /* Init poison supported flag, the default value is false */
> - if (adev->gmc.xgmi.connected_to_cpu) {
> + if (adev->gmc.xgmi.connected_to_cpu ||
> + adev->gmc.is_app_apu) {
>   /* enabled by default when GPU is connected to CPU */
>   con->poison_supported = true;
>   } else if (adev->df.funcs &&
> --
> 2.35.1



RE: [PATCH 6/6] drm/amdgpu: drop status reset for GCEA 9.4.3 and MMEA 1.8

2023-10-18 Thread Yang, Stanley
[AMD Official Use Only - General]

PMfw doesn't reset any ce/ue status and count in debug mode, who takes 
responsible for it if in debug mode.

Regards,
Stanley
> -Original Message-
> From: Zhou1, Tao 
> Sent: Tuesday, October 17, 2023 8:46 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Li,
> Candice ; Chai, Thomas ;
> Lazar, Lijo ; Wang, Yang(Kevin)
> 
> Cc: Zhou1, Tao 
> Subject: [PATCH 6/6] drm/amdgpu: drop status reset for GCEA 9.4.3 and
> MMEA 1.8
>
> PMFW will be responsible for it.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 22 ---
> drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c | 86 -
>  2 files changed, 108 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index a1c2c952d882..65da72735e52 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -3996,27 +3996,6 @@ static void
> gfx_v9_4_3_inst_reset_utc_err_status(struct amdgpu_device *adev,
>   WREG32_SOC15(GC, GET_INST(GC, xcc_id),
> regVML2_WALKER_MEM_ECC_STATUS, 0x3);  }
>
> -static void gfx_v9_4_3_inst_reset_ea_err_status(struct amdgpu_device
> *adev,
> - int xcc_id)
> -{
> - uint32_t i, j;
> - uint32_t value;
> -
> - mutex_lock(>grbm_idx_mutex);
> - for (i = 0; i < gfx_v9_4_3_ea_err_status_regs.se_num; i++) {
> - for (j = 0; j < gfx_v9_4_3_ea_err_status_regs.instance; j++) {
> - gfx_v9_4_3_xcc_select_se_sh(adev, i, 0, j, xcc_id);
> - value = RREG32_SOC15(GC, GET_INST(GC, xcc_id),
> regGCEA_ERR_STATUS);
> - value = REG_SET_FIELD(value, GCEA_ERR_STATUS,
> - CLEAR_ERROR_STATUS, 0x1);
> - WREG32_SOC15(GC, GET_INST(GC, xcc_id),
> regGCEA_ERR_STATUS, value);
> - }
> - }
> - gfx_v9_4_3_xcc_select_se_sh(adev, 0x, 0x, 0x,
> - xcc_id);
> - mutex_unlock(>grbm_idx_mutex);
> -}
> -
>  static void gfx_v9_4_3_inst_reset_sq_timeout_status(struct amdgpu_device
> *adev,
>   int xcc_id)
>  {
> @@ -4042,7 +4021,6 @@ static void
> gfx_v9_4_3_inst_reset_ras_err_status(struct amdgpu_device *adev,
>   void *ras_error_status, int xcc_id)  {
>   gfx_v9_4_3_inst_reset_utc_err_status(adev, xcc_id);
> - gfx_v9_4_3_inst_reset_ea_err_status(adev, xcc_id);
>   gfx_v9_4_3_inst_reset_sq_timeout_status(adev, xcc_id);  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> index aa00483e7b37..616d75add087 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> @@ -756,96 +756,10 @@ static void
> mmhub_v1_8_query_ras_error_status(struct amdgpu_device *adev)
>   mmhub_v1_8_inst_query_ras_err_status(adev, i);  }
>
> -static void mmhub_v1_8_inst_reset_ras_err_status(struct amdgpu_device
> *adev,
> -  uint32_t mmhub_inst)
> -{
> - uint32_t mmea_cgtt_clk_cntl_addr_dist;
> - uint32_t mmea_err_status_addr_dist;
> - uint32_t reg_value;
> - uint32_t i;
> -
> - /* reset mmea ras err status */
> - mmea_cgtt_clk_cntl_addr_dist = regMMEA1_CGTT_CLK_CTRL -
> regMMEA0_CGTT_CLK_CTRL;
> - mmea_err_status_addr_dist = regMMEA1_ERR_STATUS -
> regMMEA0_ERR_STATUS;
> - for (i = 0; i < ARRAY_SIZE(mmhub_v1_8_mmea_err_status_reg); i++) {
> - /* force clk branch on for response path
> -  * set MMEA0_CGTT_CLK_CTRL.SOFT_OVERRIDE_RETURN = 1
> -  */
> - reg_value = RREG32_SOC15_OFFSET(MMHUB, mmhub_inst,
> - regMMEA0_CGTT_CLK_CTRL,
> - i *
> mmea_cgtt_clk_cntl_addr_dist);
> - reg_value = REG_SET_FIELD(reg_value,
> MMEA0_CGTT_CLK_CTRL,
> -   SOFT_OVERRIDE_RETURN, 1);
> - WREG32_SOC15_OFFSET(MMHUB, mmhub_inst,
> - regMMEA0_CGTT_CLK_CTRL,
> - i * mmea_cgtt_clk_cntl_addr_dist,
> - reg_value);
> -
> - /* set MMEA0_ERR_STATUS.CLEAR_ERROR_STATUS = 1 */
> - reg_value = RREG32_SOC15_OFFSET(MMHUB, mmhub_inst,
> - regMMEA0_ERR_STATUS,
> - i *
> 

RE: [PATCH 4/6] drm/amdgpu: bypass RAS error reset in some conditions

2023-10-18 Thread Yang, Stanley
[AMD Official Use Only - General]

> -Original Message-
> From: Zhou1, Tao 
> Sent: Tuesday, October 17, 2023 8:46 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Li,
> Candice ; Chai, Thomas ;
> Lazar, Lijo ; Wang, Yang(Kevin)
> 
> Cc: Zhou1, Tao 
> Subject: [PATCH 4/6] drm/amdgpu: bypass RAS error reset in some conditions
>
> PMFW is responsible for RAS error reset in some conditions, driver can skip 
> the
> operation.
>
> v2: add check for ras->in_recovery, it's set earlier than amdgpu_in_reset.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 95c7ba889e2d..806c6d4deb63 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1178,11 +1178,19 @@ int amdgpu_ras_reset_error_count(struct
> amdgpu_device *adev,
>   enum amdgpu_ras_block block)
>  {
>   struct amdgpu_ras_block_object *block_obj =
> amdgpu_ras_get_ras_block(adev, block, 0);
> + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> + const struct amdgpu_mca_smu_funcs *mca_funcs = adev-
> >mca.mca_funcs;
>
>   if (!block_obj || !block_obj->hw_ops)
>   return 0;
>
> - if (!amdgpu_ras_is_supported(adev, block))
> + /* skip ras error reset in gpu reset */
> + if (amdgpu_in_reset(adev) && atomic_read(>in_recovery) &&
> + mca_funcs && mca_funcs->mca_set_debug_mode)

[Stanley]: The check condition amdgpu_in_reset(adev) && 
atomic_read(>in_recovery) should be modify to (amdgpu_in_reset(adev) || 
atomic_read(>in_recovery)),
Can we check ras->is_mca_debug_mode directly since patch#2 and patch#3 set it.

Regards,
Stanley
> + return 0;
> +
> + if (!amdgpu_ras_is_supported(adev, block) ||
> + !amdgpu_ras_get_mca_debug_mode(adev))
>   return 0;
>
>   if (block_obj->hw_ops->reset_ras_error_count)
> @@ -1195,6 +1203,8 @@ int amdgpu_ras_reset_error_status(struct
> amdgpu_device *adev,
>   enum amdgpu_ras_block block)
>  {
>   struct amdgpu_ras_block_object *block_obj =
> amdgpu_ras_get_ras_block(adev, block, 0);
> + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> + const struct amdgpu_mca_smu_funcs *mca_funcs = adev-
> >mca.mca_funcs;
>
>   if (!block_obj || !block_obj->hw_ops) {
>   dev_dbg_once(adev->dev, "%s doesn't config RAS
> function\n", @@ -1202,7 +1212,13 @@ int
> amdgpu_ras_reset_error_status(struct amdgpu_device *adev,
>   return 0;
>   }
>
> - if (!amdgpu_ras_is_supported(adev, block))
> + /* skip ras error reset in gpu reset */
> + if (amdgpu_in_reset(adev) && atomic_read(>in_recovery) &&
> + mca_funcs && mca_funcs->mca_set_debug_mode)
[Stanley]: Same as above.

> + return 0;
> +
> + if (!amdgpu_ras_is_supported(adev, block) ||
> + !amdgpu_ras_get_mca_debug_mode(adev))
>   return 0;
>
>   if (block_obj->hw_ops->reset_ras_error_count)
> --
> 2.35.1



RE: [PATCH Review 1/1] drm/amdgpu: Workaround to skip kiq ring test during ras gpu recovery

2023-10-18 Thread Yang, Stanley
[AMD Official Use Only - General]

Thanks, will update.

Regards,
Stanley
> -Original Message-
> From: Zhou1, Tao 
> Sent: Wednesday, October 18, 2023 11:00 AM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: RE: [PATCH Review 1/1] drm/amdgpu: Workaround to skip kiq ring
> test during ras gpu recovery
>
> [AMD Official Use Only - General]
>
> > -Original Message-
> > From: amd-gfx  On Behalf Of
> > Stanley.Yang
> > Sent: Tuesday, October 17, 2023 10:37 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 1/1] drm/amdgpu: Workaround to skip kiq ring
> > test during ras gpu recovery
> >
> > This is workaround, kiq ring test failed in suspend stage when do ras
> > recovery for gfx v9_4_3.
> >
> > Change-Id: I8de9900aa76706f59bc029d4e9e8438c6e1db8e0
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 21
> +
> >  1 file changed, 21 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > index 9a158018ae16..902e60203809 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> > @@ -29,6 +29,7 @@
> >  #include "amdgpu_rlc.h"
> >  #include "amdgpu_ras.h"
> >  #include "amdgpu_xcp.h"
> > +#include "amdgpu_xgmi.h"
> >
> >  /* delay 0.1 second to enable gfx off feature */
> >  #define GFX_OFF_DELAY_ENABLE msecs_to_jiffies(100)
> > @@ -501,6 +502,9 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device
> > *adev, int xcc_id)  {
> >   struct amdgpu_kiq *kiq = >gfx.kiq[xcc_id];
> >   struct amdgpu_ring *kiq_ring = >ring;
> > + struct amdgpu_hive_info *hive;
> > + struct amdgpu_ras *ras;
> > + int hive_ras_recovery;
> >   int i, r = 0;
> >   int j;
> >
> > @@ -521,6 +525,23 @@ int amdgpu_gfx_disable_kcq(struct
> amdgpu_device
> > *adev, int xcc_id)
> >  RESET_QUEUES, 0, 0);
> >   }
> >
> > + /**
> > +  * This is workaround: only skip kiq_ring test
> > +  * during ras recovery in suspend stage for gfx v9_4_3
> > +  */
> > + hive = amdgpu_get_xgmi_hive(adev);
> > + if (hive) {
> [Tao] the hive_ras_recovery should has default value if !hive.
> With that fixed, the patch is:
>
> Reviewed-by: Tao Zhou 
>
> > + hive_ras_recovery = atomic_read(>ras_recovery);
> > + amdgpu_put_xgmi_hive(hive);
> > + }
> > +
> > + ras = amdgpu_ras_get_context(adev);
> > + if ((amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3)) &&
> > + ras && (atomic_read(>in_recovery) ||
> > + hive_ras_recovery))
> > {
> > + spin_unlock(>ring_lock);
> > + return 0;
> > + }
> > +
> >   if (kiq_ring->sched.ready && !adev->job_hang)
> >   r = amdgpu_ring_test_helper(kiq_ring);
> >   spin_unlock(>ring_lock);
> > --
> > 2.25.1
>



RE: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions

2023-10-16 Thread Yang, Stanley
[AMD Official Use Only - General]

The in_gpu_reset is set after reset error count and reset error status function 
call, so we can't use  amdgpu_in_reset(), please check ras->in_recovery flag.

Regards,
Stanley
From: Zhou1, Tao 
Sent: Friday, October 13, 2023 5:06 PM
To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org; 
Yang, Stanley ; Li, Candice ; Chai, 
Thomas ; Wang, Yang(Kevin) 
Subject: Re: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions


[AMD Official Use Only - General]

How about this condition:

if ((amdgpu_in_reset(adev) || amdgpu_ras_intr_triggered()) &&
   mca_funcs && mca_funcs->mca_set_debug_mode)

I use amdgpu_in_reset to skip touching it in all gpu resets, not only for the 
resets triggered by ras fatal error.

Regards,
Tao


From: Zhang, Hawking mailto:hawking.zh...@amd.com>>
Sent: Thursday, October 12, 2023 9:14 PM
To: Zhou1, Tao mailto:tao.zh...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Yang, 
Stanley mailto:stanley.y...@amd.com>>; Li, Candice 
mailto:candice...@amd.com>>; Chai, Thomas 
mailto:yipeng.c...@amd.com>>; Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>
Subject: RE: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions

[AMD Official Use Only - General]

-   if (!amdgpu_ras_is_supported(adev, block))
+   /* skip ras error reset in gpu reset */
+   if (amdgpu_in_reset(adev) &&
+   mca_funcs && mca_funcs->mca_set_debug_mode)
+   return 0;

We should check RAS in_recovery flag in such case. Reset domain is locked in 
relative late phase, at least *after* error counter harvest. Please double 
check.

Regards,
Hawking
-Original Message-
From: Zhou1, Tao mailto:tao.zh...@amd.com>>
Sent: Thursday, October 12, 2023 17:01
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Yang, 
Stanley mailto:stanley.y...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>; Li, Candice 
mailto:candice...@amd.com>>; Chai, Thomas 
mailto:yipeng.c...@amd.com>>; Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>
Cc: Zhou1, Tao mailto:tao.zh...@amd.com>>
Subject: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions

PMFW is responsible for RAS error reset in some conditions, driver can skip the 
operation.

Signed-off-by: Tao Zhou mailto:tao.zh...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 91ed4fd96ee1..6dddb0423411 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1105,11 +1105,18 @@ int amdgpu_ras_reset_error_count(struct amdgpu_device 
*adev,
enum amdgpu_ras_block block)
 {
struct amdgpu_ras_block_object *block_obj = 
amdgpu_ras_get_ras_block(adev, block, 0);
+   const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs;

if (!block_obj || !block_obj->hw_ops)
return 0;

-   if (!amdgpu_ras_is_supported(adev, block))
+   /* skip ras error reset in gpu reset */
+   if (amdgpu_in_reset(adev) &&
+   mca_funcs && mca_funcs->mca_set_debug_mode)
+   return 0;
+
+   if (!amdgpu_ras_is_supported(adev, block) ||
+   !amdgpu_ras_get_mca_debug_mode(adev))
return 0;

if (block_obj->hw_ops->reset_ras_error_count)
@@ -1122,6 +1129,7 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
enum amdgpu_ras_block block)
 {
struct amdgpu_ras_block_object *block_obj = 
amdgpu_ras_get_ras_block(adev, block, 0);
+   const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs;

if (!block_obj || !block_obj->hw_ops) {
dev_dbg_once(adev->dev, "%s doesn't config RAS function\n", @@ 
-1129,7 +1137,13 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
return 0;
}

-   if (!amdgpu_ras_is_supported(adev, block))
+   /* skip ras error reset in gpu reset */
+   if (amdgpu_in_reset(adev) &&
+   mca_funcs && mca_funcs->mca_set_debug_mode)
+   return 0;
+
+   if (!amdgpu_ras_is_supported(adev, block) ||
+   !amdgpu_ras_get_mca_debug_mode(adev))
return 0;

if (block_obj->hw_ops->reset_ras_error_count)
--
2.35.1


RE: [PATCH Review 1/1] drm/amdgpu: Fix false positive error log

2023-09-15 Thread Yang, Stanley
[AMD Official Use Only - General]

Please ignore this patch, will send V2.

> -Original Message-
> From: Stanley.Yang 
> Sent: Friday, September 15, 2023 6:57 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Fix false positive error log
>
> It should first check block ras obj whether be set, it should return directly 
> if
> block ras obj is not set.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 4a6df4e24243..ee62f5fa4456 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1105,10 +1105,13 @@ int amdgpu_ras_reset_error_status(struct
> amdgpu_device *adev,  {
>   struct amdgpu_ras_block_object *block_obj =
> amdgpu_ras_get_ras_block(adev, block, 0);
>
> + if (!block_obj)
> + return 0;
> +
>   if (!amdgpu_ras_is_supported(adev, block))
>   return -EINVAL;
>
> - if (!block_obj || !block_obj->hw_ops)   {
> + if (!block_obj->hw_ops)   {
>   dev_dbg_once(adev->dev, "%s doesn't config RAS
> function\n",
>ras_block_str(block));
>   return -EINVAL;
> --
> 2.25.1



RE: [PATCH 2/2] drm/amd/pm: enable smu_v13_0_6 mca debug mode when UMC RAS feature is enabled

2023-09-08 Thread Yang, Stanley
[AMD Official Use Only - General]

> -Original Message-
> From: amd-gfx  On Behalf Of Yang
> Wang
> Sent: Friday, September 8, 2023 2:34 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Wang, Yang(Kevin) ; Zhang, Hawking
> 
> Subject: [PATCH 2/2] drm/amd/pm: enable smu_v13_0_6 mca debug mode
> when UMC RAS feature is enabled
>
> enable smu_v13_0_6 mca debug mode when UMC RAS feature is enabled.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h  |  3 ++-
>   .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 26
> +++
>  2 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h
> b/drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h
> index ebc789e7a289..f762c01b98a5 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h
> +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h
> @@ -247,7 +247,8 @@
>   __SMU_DUMMY_MAP(Mode2Reset),\
>   __SMU_DUMMY_MAP(RequestI2cTransaction), \
>   __SMU_DUMMY_MAP(GetMetricsTable), \
> - __SMU_DUMMY_MAP(DALNotPresent),
> + __SMU_DUMMY_MAP(DALNotPresent), \
> + __SMU_DUMMY_MAP(ClearMcaOnRead),
>
>  #undef __SMU_DUMMY_MAP
>  #define __SMU_DUMMY_MAP(type)SMU_MSG_##type
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index ff58ee14a68f..5ecc90e6af10 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -133,6 +133,7 @@ static const struct cmn2asic_msg_mapping
> smu_v13_0_6_message_map[SMU_MSG_MAX_COU
>   MSG_MAP(SetSoftMaxGfxClk,
> PPSMC_MSG_SetSoftMaxGfxClk,0),
>   MSG_MAP(PrepareMp1ForUnload,
> PPSMC_MSG_PrepareForDriverUnload,  0),
>   MSG_MAP(GetCTFLimit, PPSMC_MSG_GetCTFLimit,
> 0),
> + MSG_MAP(ClearMcaOnRead,
> PPSMC_MSG_ClearMcaOnRead,  0),
>  };
>
>  static const struct cmn2asic_mapping
> smu_v13_0_6_clk_map[SMU_CLK_COUNT] = { @@ -1393,6 +1394,20 @@
> static int smu_v13_0_6_notify_unload(struct smu_context *smu)
>   return 0;
>  }
>
> +static int smu_v13_0_6_mca_set_debug_mode(struct smu_context *smu,
> bool
> +enable) {
> + uint32_t smu_version;
> +
> + /* NOTE: this ClearMcaOnRead message is only supported for smu
> version 85.72.0 or higher */
> + smu_cmn_get_smc_version(smu, NULL, _version);
> + if (smu_version < 0x554800)
> + return 0;
> +
> + return smu_cmn_send_smc_msg_with_param(smu,
> SMU_MSG_ClearMcaOnRead,
> +enable ? 0 :
> ClearMcaOnRead_UE_FLAG_MASK | ClearMcaOnRead_CE_POLL_MASK,
> +NULL);
> +}
> +
>  static int smu_v13_0_6_system_features_control(struct smu_context *smu,
>  bool enable)
>  {
> @@ -2182,6 +2197,16 @@ static int
> smu_v13_0_6_smu_send_hbm_bad_page_num(struct smu_context *smu,
>   return ret;
>  }
>
> +static int smu_v13_0_6_post_init(struct smu_context *smu) {
> + struct amdgpu_device *adev = smu->adev;
> +
> + if (!amdgpu_sriov_vf(adev) && (adev->ras_enabled &
> BIT(AMDGPU_RAS_BLOCK__UMC)))
[Stanley]: is there any reason only check AMDGPU_RAS_BLOCK__UMC bit? If HBM ECC 
is not active but SRAM ECC is active,
the AMDGPU_RAS_BLOCK__UMC bit is not set, is it necessary to set debug mode for 
this scenario?

Regards,
Stanley
> + return smu_v13_0_6_mca_set_debug_mode(smu, true);
> +
> + return 0;
> +}
> +
>  static const struct pptable_funcs smu_v13_0_6_ppt_funcs = {
>   /* init dpm */
>   .get_allowed_feature_mask =
> smu_v13_0_6_get_allowed_feature_mask,
> @@ -2235,6 +2260,7 @@ static const struct pptable_funcs
> smu_v13_0_6_ppt_funcs = {
>   .i2c_init = smu_v13_0_6_i2c_control_init,
>   .i2c_fini = smu_v13_0_6_i2c_control_fini,
>   .send_hbm_bad_pages_num =
> smu_v13_0_6_smu_send_hbm_bad_page_num,
> + .post_init = smu_v13_0_6_post_init,
>  };
>
>  void smu_v13_0_6_set_ppt_funcs(struct smu_context *smu)
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: Free ras cmd input buffer properly

2023-08-29 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Tuesday, August 29, 2023 10:55 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ;
> Yang, Stanley 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Free ras cmd input buffer properly
>
> Do not access the pointer for ras input cmd buffer if it is even not 
> allocated.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e47600a8e88e..16c5fe487ea0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -804,13 +804,13 @@ int amdgpu_ras_feature_enable(struct
> amdgpu_device *adev,
>
>   amdgpu_ras_is_poison_mode_supported(adev), ret);
>   goto out;
>   }
> +
> + kfree(info);
>   }
>
>   /* setup the obj */
>   __amdgpu_ras_feature_enable(adev, head, enable);
> -out:
> - if (head->block == AMDGPU_RAS_BLOCK__GFX)
> - kfree(info);
> +
>   return ret;
>  }
>
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: Enable ras for mp0 v13_0_6 sriov

2023-08-15 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, August 16, 2023 10:26 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li,
> Candice ; Yang, Stanley ;
> Chai, Thomas 
> Subject: [PATCH] drm/amdgpu: Enable ras for mp0 v13_0_6 sriov
>
> Enable ras for mp0 v13_0_6 sriov
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 7689395e44fd..378478cf9c21 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2399,6 +2399,7 @@ static bool amdgpu_ras_asic_supported(struct
> amdgpu_device *adev)
>   if (amdgpu_sriov_vf(adev)) {
>   switch (adev->ip_versions[MP0_HWIP][0]) {
>   case IP_VERSION(13, 0, 2):
> + case IP_VERSION(13, 0, 6):
>   return true;
>   default:
>   return false;
> --
> 2.34.1



RE: [PATCH Review 1/1] drm/amdgpu: Remove redundant poison consumption handler function

2023-06-19 Thread Yang, Stanley
[AMD Official Use Only - General]

Please ignore this patch, I will send V2.

Regards,
Stanley
> -Original Message-
> From: Stanley.Yang 
> Sent: Monday, June 19, 2023 4:24 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Zhou1, Tao ; Chai,
> Thomas 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Remove redundant poison
> consumption handler function
>
> The function callback handle_poison_consumption and callback function
> poison_consumption_handler are almost same to handle poison
> consumption, remove poison_consumption_handler.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c  |  9 -
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h  |  4 
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  |  6 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h  |  3 ++-
> drivers/gpu/drm/amd/amdgpu/gfx_v11_0_3.c | 12 +---
>  5 files changed, 13 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a33d4bc34cee..c15dbdb2e0f9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -840,15 +840,6 @@ int amdgpu_gfx_ras_sw_init(struct amdgpu_device
> *adev)
>   return 0;
>  }
>
> -int amdgpu_gfx_poison_consumption_handler(struct amdgpu_device *adev,
> - struct amdgpu_iv_entry
> *entry)
> -{
> - if (adev->gfx.ras && adev->gfx.ras->poison_consumption_handler)
> - return adev->gfx.ras->poison_consumption_handler(adev,
> entry);
> -
> - return 0;
> -}
> -
>  int amdgpu_gfx_process_ras_data_cb(struct amdgpu_device *adev,
>   void *err_data,
>   struct amdgpu_iv_entry *entry)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index d0c3f2955821..95b80bc8cdb9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -273,8 +273,6 @@ struct amdgpu_gfx_ras {
>   int (*rlc_gc_fed_irq)(struct amdgpu_device *adev,
>   struct amdgpu_irq_src *source,
>   struct amdgpu_iv_entry *entry);
> - int (*poison_consumption_handler)(struct amdgpu_device *adev,
> - struct amdgpu_iv_entry
> *entry);
>  };
>
>  struct amdgpu_gfx_shadow_info {
> @@ -538,8 +536,6 @@ int amdgpu_gfx_get_num_kcq(struct amdgpu_device
> *adev);  void amdgpu_gfx_cp_init_microcode(struct amdgpu_device *adev,
> uint32_t ucode_id);
>
>  int amdgpu_gfx_ras_sw_init(struct amdgpu_device *adev); -int
> amdgpu_gfx_poison_consumption_handler(struct amdgpu_device *adev,
> - struct amdgpu_iv_entry
> *entry);
>
>  bool amdgpu_gfx_is_master_xcc(struct amdgpu_device *adev, int xcc_id);  int
> amdgpu_gfx_sysfs_init(struct amdgpu_device *adev); diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 5b6525d8dace..7be289473034 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1694,15 +1694,13 @@ static void
> amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
>   amdgpu_umc_poison_handler(adev, false);
>
>   if (block_obj->hw_ops && block_obj->hw_ops-
> >handle_poison_consumption)
> - poison_stat = block_obj->hw_ops-
> >handle_poison_consumption(adev);
> + poison_stat = block_obj->hw_ops-
> >handle_poison_consumption(adev,
> +entry);
>
>   /* gpu reset is fallback for failed and default cases */
> - if (poison_stat) {
> + if (poison_stat != true) {
>   dev_info(adev->dev, "GPU reset for %s RAS poison
> consumption is issued!\n",
>   block_obj->ras_comm.name);
>   amdgpu_ras_reset_gpu(adev);
> - } else {
> - amdgpu_gfx_poison_consumption_handler(adev, entry);
>   }
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 46bf1889a9d7..03f3b3774b85 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -564,7 +564,8 @@ struct amdgpu_ras_block_hw_ops {
>   void (*reset_ras_error_count)(struct amdgpu_device *adev);
>   void (*reset_ras_error_status)(struct amdgpu_device *adev);
>   bool (*query_poison_status)(struct amdgpu_device *adev);
> - bool (*handle_poison_consumption)(struct amdgpu_d

RE: [PATCH Review 1/2] drm/amdgpu: Optimze checking ras supported

2023-06-13 Thread Yang, Stanley
[AMD Official Use Only - General]

> -Original Message-
> From: Zhou1, Tao 
> Sent: Tuesday, June 13, 2023 3:08 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org;
> Zhang, Hawking 
> Cc: Yang, Stanley 
> Subject: RE: [PATCH Review 1/2] drm/amdgpu: Optimze checking ras
> supported
>
> [AMD Official Use Only - General]
>
> [Tao] typo in title: Optimze -> Optimize

[Stanley]: Thanks Tao, will update before submitting.

Regards,
Stanley
>
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Tuesday, June 13, 2023 11:53 AM
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Zhou1, Tao 
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 1/2] drm/amdgpu: Optimze checking ras supported
> >
> > Using "is_app_apu" to identify device in the native APU mode or carveout
> mode.
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c |  2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  8 +++---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 34 ++-
> --
> >  3 files changed, 23 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index 78bacea951a9..352e958b190a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -1653,7 +1653,7 @@ int psp_ras_initialize(struct psp_context *psp)
> >
> >   if (amdgpu_ras_is_poison_mode_supported(adev))
> >   ras_cmd->ras_in_message.init_flags.poison_mode_en = 1;
> > - if (!adev->gmc.xgmi.connected_to_cpu)
> > + if (!adev->gmc.xgmi.connected_to_cpu && !adev->gmc.is_app_apu)
> >   ras_cmd->ras_in_message.init_flags.dgpu_mode = 1;
> >   ras_cmd->ras_in_message.init_flags.xcc_mask =
> >   adev->gfx.xcc_mask;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 7a0924469e4f..56bb0db207b9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -1689,8 +1689,7 @@ static void
> > amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
> >   }
> >   }
> >
> > - if (!adev->gmc.xgmi.connected_to_cpu)
> > - amdgpu_umc_poison_handler(adev, false);
> > + amdgpu_umc_poison_handler(adev, false);
> >
> >   if (block_obj->hw_ops && block_obj->hw_ops-
> > >handle_poison_consumption)
> >   poison_stat = block_obj->hw_ops-
> > >handle_poison_consumption(adev);
> > @@ -2458,11 +2457,10 @@ static void
> amdgpu_ras_check_supported(struct
> > amdgpu_device *adev)  {
> >   adev->ras_hw_enabled = adev->ras_enabled = 0;
> >
> > - if (!adev->is_atom_fw ||
> > - !amdgpu_ras_asic_supported(adev))
> > + if (!amdgpu_ras_asic_supported(adev))
> >   return;
> >
> > - if (!adev->gmc.xgmi.connected_to_cpu) {
> > + if (!adev->gmc.xgmi.connected_to_cpu && !adev-
>
> [Tao] the tab should be replaced with space.
>
> > >gmc.is_app_apu) {
> >   if (amdgpu_atomfirmware_mem_ecc_supported(adev)) {
> >   dev_info(adev->dev, "MEM ECC is active.\n");
> >   adev->ras_hw_enabled |= (1 <<
> > AMDGPU_RAS_BLOCK__UMC | diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 1edf8e6aeb16..db0d94ca4ffc 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -169,27 +169,31 @@ int amdgpu_umc_poison_handler(struct
> > amdgpu_device *adev, bool reset)  {
> >   int ret = AMDGPU_RAS_SUCCESS;
> >
> > - if (!amdgpu_sriov_vf(adev)) {
> > - if (!adev->gmc.xgmi.connected_to_cpu) {
> > - struct ras_err_data err_data = {0, 0, 0, NULL};
> > - struct ras_common_if head = {
> > - .block = AMDGPU_RAS_BLOCK__UMC,
> > - };
> > - struct ras_manager *obj = amdgpu_ras_find_obj(adev,
> > );
> > -
> > - ret = amdgpu_umc_do_page_retirement(adev,
> > _data, NULL, reset);
> > -
> > - if (ret == AMDGPU_RAS_SUCCESS && o

RE: [PATCH 2/2] drm/amdgpu: Enable gfx v11_0_3 ras if poison mode is supported

2023-06-12 Thread Yang, Stanley
[AMD Official Use Only - General]

> -Original Message-
> From: Zhang, Hawking 
> Sent: Sunday, June 11, 2023 6:46 PM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ;
> Li, Candice ; Chai, Thomas ;
> Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH 2/2] drm/amdgpu: Enable gfx v11_0_3 ras if poison mode is
> supported
>
> GFX v11_0_3 ras needs to be enabled if poison mode is supported. Driver
> doesn't need issue an feature enable call in gfx_v11_0 late init phase. The 
> ras
> late init call is already centralized to amdgpu_ras_late_init.
> In addition, move poison_mode check out of common helper like
> amdgpu_ras_is_supported and amdgpu_ras_is_feature_allowed ensure only
> GFX RAS is enabled when poison mode is supported.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 49 -
> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c  | 26 -
>  2 files changed, 16 insertions(+), 59 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dd7cdc234d7e..35e70860d628 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -126,6 +126,7 @@ static bool
> amdgpu_ras_check_bad_page_unlock(struct amdgpu_ras *con,
>   uint64_t addr);
>  static bool amdgpu_ras_check_bad_page(struct amdgpu_device *adev,
>   uint64_t addr);
> +static void amdgpu_ras_query_poison_mode(struct amdgpu_device *adev);
>  #ifdef CONFIG_X86_MCE_AMD
>  static void amdgpu_register_bad_pages_mca_notifier(struct amdgpu_device
> *adev);  struct mce_notifier_adev_list { @@ -757,16 +758,6 @@ static int
> __amdgpu_ras_feature_enable(struct amdgpu_device *adev,
>   return 0;
>  }
>
> -static int amdgpu_ras_check_feature_allowed(struct amdgpu_device *adev,
> - struct ras_common_if *head)
> -{
> - if (amdgpu_ras_is_feature_allowed(adev, head) ||
> - amdgpu_ras_is_poison_mode_supported(adev))
> - return 1;
> - else
> - return 0;
> -}
> -
>  /* wrapper of psp_ras_enable_features */  int
> amdgpu_ras_feature_enable(struct amdgpu_device *adev,
>   struct ras_common_if *head, bool enable) @@ -797,7 +788,7
> @@ int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
>   }
>
>   /* Do not enable if it is not allowed. */
> - if (enable && !amdgpu_ras_check_feature_allowed(adev, head))
> + if (enable && !amdgpu_ras_is_feature_allowed(adev, head))
>   goto out;
>
>   /* Only enable ras feature operation handle on host side */ @@ -
> 2420,9 +2411,9 @@ static bool amdgpu_ras_asic_supported(struct
> amdgpu_device *adev)  }
>
>  /*
> - * this is workaround for vega20 workstation sku,
> - * force enable gfx ras, ignore vbios gfx ras flag
> - * due to GC EDC can not write
> + * Common helpers for device or IP specific RAS quirks including
> + * a). Enable gfx ras on D16406 or D36002 board
> + * b). Enable gfx ras in gfx_v11_0_3 if poison mode is supported
>   */
>  static void amdgpu_ras_get_quirks(struct amdgpu_device *adev)  { @@ -
> 2431,10 +2422,16 @@ static void amdgpu_ras_get_quirks(struct
> amdgpu_device *adev)
>   if (!ctx)
>   return;
>
> + /* Enable gfx ras on specific board */
>   if (strnstr(ctx->vbios_version, "D16406",
>   sizeof(ctx->vbios_version)) ||
> - strnstr(ctx->vbios_version, "D36002",
> - sizeof(ctx->vbios_version)))
> + strnstr(ctx->vbios_version, "D36002",
> + sizeof(ctx->vbios_version)))
> + adev->ras_hw_enabled |= (1 <<
> AMDGPU_RAS_BLOCK__GFX);
> +
> + /* Enable gfx ras on gfx_v11_0_3 if poison mode is supported */
> + if (adev->ip_versions[GC_HWIP][0] == IP_VERSION(11, 0, 3) &&
> + amdgpu_ras_is_poison_mode_supported(adev))
>   adev->ras_hw_enabled |= (1 <<
> AMDGPU_RAS_BLOCK__GFX);  }

[Stanley]: For GC 11.0.3, it's better not expose AMDGPU_RAS_BLOCK__GFX to kfd, 
may be with below is more reasonable.
{
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 8e4124dcb6e4..84030289ac96 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1996,9 +1996,10 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
}

/* kfd only concerns sram ecc on GFX and HBM ecc on UMC */
-   dev->node_props.capability |=
-   ((dev->gpu->adev->ras_enabled & BIT(AMD

RE: [PATCH Review V2 2/2] drm/amdgpu: correct ras enabled flag

2023-04-11 Thread Yang, Stanley
[AMD Official Use Only - General]

Thanks, It's a typo, I have fixed it before be merged.

Regards,
Stanley
> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, April 11, 2023 6:08 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org;
> Zhang, Hawking ; Zhou1, Tao
> ; Chen, Guchun 
> Cc: Yang, Stanley 
> Subject: RE: [PATCH Review V2 2/2] drm/amdgpu: correct ras enabled flag
> 
> [AMD Official Use Only - General]
> 
> >  if (adev->gmc.xmgi.
> 
> This looks like a typo. Should be gmc.xgmi
> 
> Thanks,
> Lijo
> 
> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Tuesday, April 11, 2023 3:03 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Zhou1, Tao ; Chen,
> Guchun 
> Cc: Yang, Stanley 
> Subject: [PATCH Review V2 2/2] drm/amdgpu: correct ras enabled flag
> 
> XGMI RAS should be according to the gmc xgmi physical nodes number,
> XGMI RAS should not be enabled if xgmi num_physical_nodes is zero.
> 
> Change-Id: Idf3600b30584b10b528e7237d103d84d5097b7e0
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 4069bce9479f..c2c4d978896c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2430,6 +2430,13 @@ static void amdgpu_ras_check_supported(struct
> amdgpu_device *adev)
>   else
>   adev->ras_hw_enabled &= ~(1 <<
> AMDGPU_RAS_BLOCK__VCN |
>   1 <<
> AMDGPU_RAS_BLOCK__JPEG);
> +
> + /*
> +  * XGMI RAS is not supported if xgmi num physical
> nodes
> +  * is zero
> +  */
> + if (adev->gmc.xmgi.num_physical_nodes == 0)
> + adev->ras_hw_enabled &= ~(1 <<
> AMDGPU_RAS_BLOCK__XGMI_WAFL);
>   } else {
>   dev_info(adev->dev, "SRAM ECC is not
> presented.\n");
>   }
> --
> 2.17.1


RE: [PATCH Review 2/2] drm/amdgpu: correct ras enabled flag

2023-04-11 Thread Yang, Stanley
[AMD Official Use Only - General]

Thanks Hawking. I will update.

Regards,
Stanley
From: Zhang, Hawking 
Sent: Tuesday, April 11, 2023 12:05 PM
To: Yang, Stanley ; amd-gfx@lists.freedesktop.org; Zhou1, 
Tao 
Cc: Yang, Stanley 
Subject: Re: [PATCH Review 2/2] drm/amdgpu: correct ras enabled flag


[AMD Official Use Only - General]

Just check gmc.xmgi.num_physical_nodes == 0 should be good enough for the case 
that we only have single ALDEBRAN/ARCURUS available in system.

In such case, there is no need to expose xgmi_wafl ras node.

Regards,
Hawking

From: Stanley.Yang mailto:stanley.y...@amd.com>>
Date: Monday, April 10, 2023 at 19:48
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>, Zhang, 
Hawking mailto:hawking.zh...@amd.com>>, Zhou1, Tao 
mailto:tao.zh...@amd.com>>
Cc: Yang, Stanley mailto:stanley.y...@amd.com>>
Subject: [PATCH Review 2/2] drm/amdgpu: correct ras enabled flag
XGMI RAS should be according to the gmc xmgi supported flag
and xgmi physical nodes number.

Change-Id: Idf3600b30584b10b528e7237d103d84d5097b7e0
Signed-off-by: Stanley.Yang mailto:stanley.y...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4069bce9479f..d26a93272bf2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2430,6 +2430,14 @@ static void amdgpu_ras_check_supported(struct 
amdgpu_device *adev)
 else
 adev->ras_hw_enabled &= ~(1 << 
AMDGPU_RAS_BLOCK__VCN |
 1 << 
AMDGPU_RAS_BLOCK__JPEG);
+
+   /*
+* XGMI RAS is determined by xgmi supported flags
+* and xgmi num physical nodes
+*/
+   if (!adev->gmc.xgmi.supported ||
+   adev->gmc.xmgi.num_physical_nodes == 0)
+   adev->ras_hw_enabled &= ~(1 << 
AMDGPU_RAS_BLOCK__XGMI_WAFL);
 } else {
 dev_info(adev->dev, "SRAM ECC is not presented.\n");
 }
--
2.17.1


RE: [PATCH Review 1/2] drm/admgpu: fix unexpected block id

2023-04-11 Thread Yang, Stanley
[AMD Official Use Only - General]

Hi Guchun,

Thanks, will update.

Regards,
Stanley

> -Original Message-
> From: Chen, Guchun 
> Sent: Tuesday, April 11, 2023 12:48 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org;
> Zhang, Hawking ; Zhou1, Tao
> 
> Cc: Yang, Stanley 
> Subject: RE: [PATCH Review 1/2] drm/admgpu: fix unexpected block id
> 
> A spelling typo in subject, s/admgpu/amdgpu :)
> 
> Also maybe it's necessary to add a body text in commit message.
> 
> Regards,
> Guchun
> 
> > -Original Message-
> > From: amd-gfx  On Behalf Of
> > Stanley.Yang
> > Sent: Monday, April 10, 2023 7:48 PM
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Zhou1, Tao 
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 1/2] drm/admgpu: fix unexpected block id
> >
> > Change-Id: Icceb43556eec802f11c2077c1c58a1e92c9df599
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 
> > drivers/gpu/drm/amd/amdgpu/ta_ras_if.h  | 2 ++
> >  2 files changed, 6 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index ef38f4c93df0..17b3d1992e80 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -583,6 +583,10 @@ amdgpu_ras_block_to_ta(enum
> amdgpu_ras_block
> > block) {
> > return TA_RAS_BLOCK__FUSE;
> > case AMDGPU_RAS_BLOCK__MCA:
> > return TA_RAS_BLOCK__MCA;
> > +   case AMDGPU_RAS_BLOCK__VCN:
> > +   return TA_RAS_BLOCK__VCN;
> > +   case AMDGPU_RAS_BLOCK__JPEG:
> > +   return TA_RAS_BLOCK__JPEG;
> > default:
> > WARN_ONCE(1, "RAS ERROR: unexpected block id %d\n",
> block);
> > return TA_RAS_BLOCK__UMC;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > index 509d8a1945eb..30d0482ac466 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > @@ -84,6 +84,8 @@ enum ta_ras_block {
> > TA_RAS_BLOCK__MP1,
> > TA_RAS_BLOCK__FUSE,
> > TA_RAS_BLOCK__MCA,
> > +   TA_RAS_BLOCK__VCN,
> > +   TA_RAS_BLOCK__JPEG,
> > TA_NUM_BLOCK_MAX
> >  };
> >
> > --
> > 2.17.1


RE: [PATCH] drm/amdgpu: Enable GFX11 SDMA context empty interrupt

2023-04-10 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley Yang 

Regards,
Stanley
> -Original Message-
> From: Sider, Graham 
> Sent: Wednesday, April 5, 2023 11:42 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley ; Sider, Graham
> 
> Subject: [PATCH] drm/amdgpu: Enable GFX11 SDMA context empty interrupt
> 
> Enable SDMA queue empty context switching. SDMA context switch due to
> quantum programming no longer done here (as of sdma v6), so re-name
> sdma_v6_0_ctx_switch_enable to sdma_v6_0_ctxempty_int_enable to
> reflect this.
> 
> Also program SDMAx_QUEUEx_SCHEDULE_CNTL for context switch due to
> quantum in KFD. Set to amdgpu_sdma_phase_quantum (defaults to 32 i.e.
> 3200us).
> 
> Signed-off-by: Graham Sider 
> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c| 28 ---
>  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c  |  4 +++
>  2 files changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> index 40e6b22daa22..f45f7469dd32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
> @@ -403,15 +403,26 @@ static void sdma_v6_0_rlc_stop(struct
> amdgpu_device *adev)  }
> 
>  /**
> - * sdma_v6_0_ctx_switch_enable - stop the async dma engines context
> switch
> + * sdma_v6_0_ctxempty_int_enable - enable or disable context empty
> + interrupts
>   *
>   * @adev: amdgpu_device pointer
> - * @enable: enable/disable the DMA MEs context switch.
> + * @enable: enable/disable context switching due to queue empty
> + conditions
>   *
> - * Halt or unhalt the async dma engines context switch.
> + * Enable or disable the async dma engines queue empty context switch.
>   */
> -static void sdma_v6_0_ctx_switch_enable(struct amdgpu_device *adev,
> bool enable)
> +static void sdma_v6_0_ctxempty_int_enable(struct amdgpu_device *adev,
> +bool enable)
>  {
> + u32 f32_cntl;
> + int i;
> +
> + if (!amdgpu_sriov_vf(adev)) {
> + for (i = 0; i < adev->sdma.num_instances; i++) {
> + f32_cntl = RREG32(sdma_v6_0_get_reg_offset(adev,
> i, regSDMA0_CNTL));
> + f32_cntl = REG_SET_FIELD(f32_cntl, SDMA0_CNTL,
> + CTXEMPTY_INT_ENABLE, enable ? 1 :
> 0);
> + WREG32(sdma_v6_0_get_reg_offset(adev, i,
> regSDMA0_CNTL), f32_cntl);
> + }
> + }
>  }
> 
>  /**
> @@ -579,10 +590,8 @@ static int sdma_v6_0_gfx_resume(struct
> amdgpu_device *adev)
> 
>   ring->sched.ready = true;
> 
> - if (amdgpu_sriov_vf(adev)) { /* bare-metal sequence
> doesn't need below to lines */
> - sdma_v6_0_ctx_switch_enable(adev, true);
> + if (amdgpu_sriov_vf(adev))
>   sdma_v6_0_enable(adev, true);
> - }
> 
>   r = amdgpu_ring_test_helper(ring);
>   if (r) {
> @@ -778,7 +787,6 @@ static int sdma_v6_0_start(struct amdgpu_device
> *adev)
>   int r = 0;
> 
>   if (amdgpu_sriov_vf(adev)) {
> - sdma_v6_0_ctx_switch_enable(adev, false);
>   sdma_v6_0_enable(adev, false);
> 
>   /* set RB registers */
> @@ -799,7 +807,7 @@ static int sdma_v6_0_start(struct amdgpu_device
> *adev)
>   /* unhalt the MEs */
>   sdma_v6_0_enable(adev, true);
>   /* enable sdma ring preemption */
> - sdma_v6_0_ctx_switch_enable(adev, true);
> + sdma_v6_0_ctxempty_int_enable(adev, true);
> 
>   /* start the gfx rings and rlc compute queues */
>   r = sdma_v6_0_gfx_resume(adev);
> @@ -1319,7 +1327,7 @@ static int sdma_v6_0_hw_fini(void *handle)
>   return 0;
>   }
> 
> - sdma_v6_0_ctx_switch_enable(adev, false);
> + sdma_v6_0_ctxempty_int_enable(adev, false);
>   sdma_v6_0_enable(adev, false);
> 
>   return 0;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
> index 4a9af800b1f1..85d5782eccd2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
> @@ -350,6 +350,10 @@ static void update_mqd_sdma(struct mqd_manager
> *mm, void *mqd,
>   m->sdmax_rlcx_doorbell_offset =
>   q->doorbell_off <<
> SDMA0_QUEUE0_DOORBELL_OFFSET__OFFSET__SHIFT;
> 
> + m->sdmax_rlcx_sched_cntl = (amdgpu_sdma_phase_quantum
> + <<
> SDMA0_QUEUE0_SCHEDULE_CNTL__CONTEXT_QUANTUM__SHIFT)
> +  &
> SDMA0_QUEUE0_SCHEDULE_CNTL__CONTEXT_QUANTUM_MASK;
> +
>   m->sdma_engine_id = q->sdma_engine_id;
>   m->sdma_queue_id = q->sdma_queue_id;
>   m->sdmax_rlcx_dummy_reg = SDMA_RLC_DUMMY_DEFAULT;
> --
> 2.25.1


RE: [PATCH 3/3] drm/amdgpu: resume ras for gfx v11_0_3 during reset on SRIOV

2023-03-21 Thread Yang, Stanley
[AMD Official Use Only - General]

The series is Reviewed-by: Stanley Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of
> YiPeng Chai
> Sent: Tuesday, March 21, 2023 10:40 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao ; Zhang, Hawking
> ; Chai, Thomas ; Chai,
> Thomas 
> Subject: [PATCH 3/3] drm/amdgpu: resume ras for gfx v11_0_3 during reset
> on SRIOV
> 
> Gfx v11_0_3 supports ras on SRIOV, so need to resume ras during reset.
> 
> Signed-off-by: YiPeng Chai 
> Reviewed-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d74d05802566..14d756caf839 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5313,8 +5313,9 @@ int amdgpu_device_gpu_recover(struct
> amdgpu_device *adev,
>   if (r)
>   adev->asic_reset_res = r;
> 
> - /* Aldebaran supports ras in SRIOV, so need resume ras
> during reset */
> - if (adev->ip_versions[GC_HWIP][0] == IP_VERSION(9, 4, 2))
> + /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need
> resume ras during reset */
> + if (adev->ip_versions[GC_HWIP][0] == IP_VERSION(9, 4, 2)
> ||
> + adev->ip_versions[GC_HWIP][0] == IP_VERSION(11, 0, 3))
>   amdgpu_ras_resume(adev);
>   } else {
>   r = amdgpu_do_asic_reset(device_list_handle,
> reset_context);
> --
> 2.34.1


RE: [PATCH] drm/amdgpu: Initialize umc ras callback

2023-03-20 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of
> Hawking Zhang
> Sent: Monday, March 20, 2023 5:37 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Initialize umc ras callback
> 
> Fix a coding error which results to null interrupt handler for umc ras.
> 
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index da68ceaa024c..9e2e97207e53 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -232,7 +232,7 @@ int amdgpu_umc_ras_sw_init(struct amdgpu_device
> *adev)
>   if (!ras->ras_block.ras_late_init)
>   ras->ras_block.ras_late_init = amdgpu_umc_ras_late_init;
> 
> - if (ras->ras_block.ras_cb)
> + if (!ras->ras_block.ras_cb)
>   ras->ras_block.ras_cb = amdgpu_umc_process_ras_data_cb;
> 
>   return 0;
> --
> 2.17.1


RE: [PATCH 09/10] drm/amdgpu: Rework pcie_bif ras sw_init

2023-03-13 Thread Yang, Stanley
[AMD Official Use Only - General]

Without the inline comments, the series looks fine to me.

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, March 13, 2023 9:44 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ;
> Yang, Stanley ; Li, Candice ;
> Chai, Thomas 
> Cc: Zhang, Hawking 
> Subject: [PATCH 09/10] drm/amdgpu: Rework pcie_bif ras sw_init
> 
> pcie_bif ras blocks needs to be initialized as early as possible to handle 
> fatal
> error detected in hw_init phase. also align the pcie_bif ras sw_init with 
> other
> ras blocks
> 
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c | 23
> +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h |  1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 17 ++---
>  3 files changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c
> index 37d779b8e4a6..a3bc00577a7c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c
> @@ -22,6 +22,29 @@
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
> 
> +int amdgpu_nbio_ras_sw_init(struct amdgpu_device *adev) {
> + int err;
> + struct amdgpu_nbio_ras *ras;
> +
> + if (!adev->nbio.ras)
> + return 0;
> +
> + ras = adev->nbio.ras;
> + err = amdgpu_ras_register_ras_block(adev, >ras_block);
> + if (err) {
> + dev_err(adev->dev, "Failed to register pcie_bif ras block!\n");
> + return err;
> + }
> +
> + strcpy(ras->ras_block.ras_comm.name, "pcie_bif");
> + ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__PCIE_BIF;
> + ras->ras_block.ras_comm.type =
> AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
> + adev->nbio.ras_if = >ras_block.ras_comm;
> +
> + return 0;
> +}
> +
>  int amdgpu_nbio_ras_late_init(struct amdgpu_device *adev, struct
> ras_common_if *ras_block)  {
>   int r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
> index a240336bbc6b..c686ff4bcc39 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
> @@ -106,5 +106,6 @@ struct amdgpu_nbio {
>   struct amdgpu_nbio_ras  *ras;
>  };
> 
> +int amdgpu_nbio_ras_sw_init(struct amdgpu_device *adev);
>  int amdgpu_nbio_ras_late_init(struct amdgpu_device *adev, struct
> ras_common_if *ras_block);  #endif diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 63dfcc98152d..834092099bff 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2555,20 +2555,23 @@ int amdgpu_ras_init(struct amdgpu_device
> *adev)
>* ras functions so hardware fatal error interrupt
>* can be enabled as early as possible */
>   switch (adev->asic_type) {

[Stanley]: The judgement condition should be changed to ip_versions[][].

> - case CHIP_VEGA20:
> - case CHIP_ARCTURUS:
> - case CHIP_ALDEBARAN:
> - if (!adev->gmc.xgmi.connected_to_cpu) {
> + case IP_VERSION(7, 4, 0):
> + case IP_VERSION(7, 4, 1):
> + case IP_VERSION(7, 4, 4):
> + if (!adev->gmc.xgmi.connected_to_cpu)
>   adev->nbio.ras = _v7_4_ras;
> - amdgpu_ras_register_ras_block(adev, 
> >nbio.ras->ras_block);
> - adev->nbio.ras_if = >nbio.ras-
> >ras_block.ras_comm;
> - }
>   break;
>   default:
>   /* nbio ras is not available */
>   break;
>   }
> 
> + /* nbio ras block needs to be enabled ahead of other ras blocks
> +  * to handle fatal error */
> + r = amdgpu_nbio_ras_sw_init(adev);
> + if (r)
> + return r;
> +
>   if (adev->nbio.ras &&
>   adev->nbio.ras->init_ras_controller_interrupt) {
>   r = adev->nbio.ras->init_ras_controller_interrupt(adev);
> --
> 2.17.1


RE: [PATCH 10/11] drm/amdgpu: Rework pcie_bif ras sw_init

2023-03-05 Thread Yang, Stanley



> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, March 6, 2023 10:32 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ;
> Yang, Stanley ; Li, Candice ;
> Chai, Thomas 
> Cc: Zhang, Hawking 
> Subject: [PATCH 10/11] drm/amdgpu: Rework pcie_bif ras sw_init
> 
> pcie_bif ras blocks needs to be initialized as early as possible to handle 
> fatal
> error detected in hw_init phase. also align the pcie_bif ras sw_init with 
> other
> ras blocks
> 
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c | 23
> +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h |  1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 16 
>  3 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c
> index 37d779b8e4a6..a3bc00577a7c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.c
> @@ -22,6 +22,29 @@
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
> 
> +int amdgpu_nbio_ras_sw_init(struct amdgpu_device *adev) {
> + int err;
> + struct amdgpu_nbio_ras *ras;
> +
> + if (!adev->nbio.ras)
> + return 0;
> +
> + ras = adev->nbio.ras;
> + err = amdgpu_ras_register_ras_block(adev, >ras_block);
> + if (err) {
> + dev_err(adev->dev, "Failed to register pcie_bif ras block!\n");
> + return err;
> + }
> +
> + strcpy(ras->ras_block.ras_comm.name, "pcie_bif");
> + ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__PCIE_BIF;
> + ras->ras_block.ras_comm.type =
> AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
> + adev->nbio.ras_if = >ras_block.ras_comm;
> +
> + return 0;
> +}
> +
>  int amdgpu_nbio_ras_late_init(struct amdgpu_device *adev, struct
> ras_common_if *ras_block)  {
>   int r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
> index a240336bbc6b..c686ff4bcc39 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_nbio.h
> @@ -106,5 +106,6 @@ struct amdgpu_nbio {
>   struct amdgpu_nbio_ras  *ras;
>  };
> 
> +int amdgpu_nbio_ras_sw_init(struct amdgpu_device *adev);
>  int amdgpu_nbio_ras_late_init(struct amdgpu_device *adev, struct
> ras_common_if *ras_block);  #endif diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 63dfcc98152d..f42480b8a8d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2558,17 +2558,25 @@ int amdgpu_ras_init(struct amdgpu_device
> *adev)
>   case CHIP_VEGA20:
>   case CHIP_ARCTURUS:
>   case CHIP_ALDEBARAN:
> - if (!adev->gmc.xgmi.connected_to_cpu) {
> + if (!adev->gmc.xgmi.connected_to_cpu)

[Stanley]: Same as patch#8 and patch#9.

Regards,
Stanley
>   adev->nbio.ras = _v7_4_ras;
> - amdgpu_ras_register_ras_block(adev, 
> >nbio.ras->ras_block);
> - adev->nbio.ras_if = >nbio.ras-
> >ras_block.ras_comm;
> - }
>   break;
>   default:
>   /* nbio ras is not available */
>   break;
>   }
> 
> + /* nbio ras block needs to be enabled ahead of other ras blocks
> +  * to handle fatal error */
> + if (!adev->gmc.xgmi.connected_to_cpu &&
> + amdgpu_ras_is_supported(adev,
> AMDGPU_RAS_BLOCK__PCIE_BIF)) {

[Stanley]: Do we need to check gmc.xgmi.connected_to_cpu here? The 
AMDGPU_RAS_BLOCK__PCIE_BIF bit flag is not set when xgmi.connected_to_cpu is set
according to amdgpu_ras_check_supported function.

Regards,
Stanley
> + r = amdgpu_nbio_ras_sw_init(adev);
> + if (r) {
> + dev_err(adev->dev, "Failed to initialize pcie_bif ras
> block!\n");
> + return r;
> + }
> + }
> +
>   if (adev->nbio.ras &&
>   adev->nbio.ras->init_ras_controller_interrupt) {
>   r = adev->nbio.ras->init_ras_controller_interrupt(adev);
> --
> 2.17.1



RE: [PATCH 09/11] drm/amdgpu: Rework xgmi_wafl_pcs ras sw_init

2023-03-05 Thread Yang, Stanley



> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, March 6, 2023 10:32 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ;
> Yang, Stanley ; Li, Candice ;
> Chai, Thomas 
> Cc: Zhang, Hawking 
> Subject: [PATCH 09/11] drm/amdgpu: Rework xgmi_wafl_pcs ras sw_init
> 
> To align with other IP blocks.
> 
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  |  9 +---
> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 28
> +++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c|  7 ++
>  4 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 524e2c9b3012..d4685d22be60 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -500,9 +500,12 @@ int amdgpu_gmc_ras_sw_init(struct amdgpu_device
> *adev)
> 
>   /* xgmi ras block */
>   if (amdgpu_ras_is_supported(adev,
> AMDGPU_RAS_BLOCK__XGMI_WAFL)) {
> - adev->gmc.xgmi.ras = _ras;
> - amdgpu_ras_register_ras_block(adev, 
> >gmc.xgmi.ras->ras_block);
> - adev->gmc.xgmi.ras_if = >gmc.xgmi.ras-
> >ras_block.ras_comm;
> + r = amdgpu_xgmi_ras_sw_init(adev);
> + if (r) {
> + dev_err(adev->dev, "Failed to initialize
> xgmi_wafl_pcs ras block!\n");
> + return r;
> + }
> +
>   }
> 
>   return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index fef1575cd0cf..3fe24348d199 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -1048,12 +1048,30 @@ struct amdgpu_ras_block_hw_ops
> xgmi_ras_hw_ops = {
> 
>  struct amdgpu_xgmi_ras xgmi_ras = {
>   .ras_block = {
> - .ras_comm = {
> - .name = "xgmi_wafl",
> - .block = AMDGPU_RAS_BLOCK__XGMI_WAFL,
> - .type =
> AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE,
> - },
>   .hw_ops = _ras_hw_ops,
>   .ras_late_init = amdgpu_xgmi_ras_late_init,
>   },
>  };
> +
> +int amdgpu_xgmi_ras_sw_init(struct amdgpu_device *adev) {
> + int err;
> + struct amdgpu_xgmi_ras *ras;
> +
> + if (!adev->gmc.xgmi.ras)
> + return 0;
> +
> + ras = adev->gmc.xgmi.ras;
> + err = amdgpu_ras_register_ras_block(adev, >ras_block);
> + if (err) {
> + dev_err(adev->dev, "Failed to register xgmi_wafl_pcs ras
> block!\n");
> + return err;
> + }
> +
> + strcpy(ras->ras_block.ras_comm.name, "xgmi_wafl_pcs");
> + ras->ras_block.ras_comm.block =
> AMDGPU_RAS_BLOCK__XGMI_WAFL;
> + ras->ras_block.ras_comm.type =
> AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
> + adev->gmc.xgmi.ras_if = >ras_block.ras_comm;
> +
> + return 0;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> index 30dcc1681b4e..86fbf56938f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> @@ -73,5 +73,6 @@ static inline bool amdgpu_xgmi_same_hive(struct
> amdgpu_device *adev,
>   adev->gmc.xgmi.hive_id &&
>   adev->gmc.xgmi.hive_id == bo_adev->gmc.xgmi.hive_id);  }
> +int amdgpu_xgmi_ras_sw_init(struct amdgpu_device *adev);
> 
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 67c2a5186b8a..2a8dc9b52c2d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -1381,6 +1381,12 @@ static void gmc_v9_0_set_mca_ras_funcs(struct
> amdgpu_device *adev)
>   }
>  }
> 
> +static void gmc_v9_0_set_xgmi_ras_funcs(struct amdgpu_device *adev) {
> + if (!adev->gmc.xgmi.connected_to_cpu)

[Stanley]: Can we use if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__ 
XGMI_WAFL)) instead of if (!adev->gmc.xgmi.connected_to_cpu)
to keep the ip ras judgment uniform.

Regards,
Stanley
> + adev->gmc.xgmi.ras = _ras;
> +}
> +
>  static int gmc_v9_0_early_init(void *handle)  {
>   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> @@ -1404,6 +1410,7 @@ static int gmc_v9_0_early_init(void *handle)
>   gmc_v9_0_set_gfxhub_funcs(adev);
>   gmc_v9_0_set_hdp_ras_funcs(adev);
>   gmc_v9_0_set_mca_ras_funcs(adev);
> + gmc_v9_0_set_xgmi_ras_funcs(adev);
> 
>   adev->gmc.shared_aperture_start = 0x2000ULL;
>   adev->gmc.shared_aperture_end =
> --
> 2.17.1



RE: [PATCH 08/11] drm/amdgpu: Rework mca ras sw_init

2023-03-05 Thread Yang, Stanley



> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, March 6, 2023 10:32 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ;
> Yang, Stanley ; Li, Candice ;
> Chai, Thomas 
> Cc: Zhang, Hawking 
> Subject: [PATCH 08/11] drm/amdgpu: Rework mca ras sw_init
> 
> To align with other IP blocks
> 
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 21 
> drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c | 72
> +
> drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h |  9 ++--
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   | 15 +++---
>  drivers/gpu/drm/amd/amdgpu/mca_v3_0.c   | 44 ++-
>  drivers/gpu/drm/amd/amdgpu/mca_v3_0.h   |  4 +-
>  6 files changed, 111 insertions(+), 54 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 087a75374610..524e2c9b3012 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -477,6 +477,27 @@ int amdgpu_gmc_ras_sw_init(struct amdgpu_device
> *adev)
>   }
>   }
> 
> + /* mca.x ras block */
> + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__MCA))
> {
> + r = amdgpu_mca_mp0_ras_sw_init(adev);
> + if (r) {
> + dev_err(adev->dev, "Failed to initialize mca.mp0 ras
> block!\n");
> + return r;
> + }
> +
> + r = amdgpu_mca_mp1_ras_sw_init(adev);
> + if (r) {
> + dev_err(adev->dev, "Failed to initialize mca.mp1 ras
> block!\n");
> + return r;
> + }
> +
> + r = amdgpu_mca_mpio_ras_sw_init(adev);
> + if (r) {
> + dev_err(adev->dev, "Failed to initialize mca.mpio ras
> block!\n");
> + return r;
> + }
> + }
> +
>   /* xgmi ras block */
>   if (amdgpu_ras_is_supported(adev,
> AMDGPU_RAS_BLOCK__XGMI_WAFL)) {
>   adev->gmc.xgmi.ras = _ras;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> index 51c2a82e2fa4..0b545bdcd636 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> @@ -70,3 +70,75 @@ void amdgpu_mca_query_ras_error_count(struct
> amdgpu_device *adev,
> 
>   amdgpu_mca_reset_error_count(adev, mc_status_addr);  }
> +
> +int amdgpu_mca_mp0_ras_sw_init(struct amdgpu_device *adev) {
> + int err;
> + struct amdgpu_mca_ras_block *ras;
> +
> + if (!adev->mca.mp0.ras)
> + return 0;
> +
> + ras = adev->mca.mp0.ras;
> +
> + err = amdgpu_ras_register_ras_block(adev, >ras_block);
> + if (err) {
> + dev_err(adev->dev, "Failed to register mca.mp0 ras
> block!\n");
> + return err;
> + }
> +
> + strcpy(ras->ras_block.ras_comm.name, "mca.mp0");
> + ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__MCA;
> + ras->ras_block.ras_comm.type =
> AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
> + adev->mca.mp0.ras_if = >ras_block.ras_comm;
> +
> + return 0;
> +}
> +
> +int amdgpu_mca_mp1_ras_sw_init(struct amdgpu_device *adev) {
> +int err;
> +struct amdgpu_mca_ras_block *ras;
> +
> +if (!adev->mca.mp1.ras)
> +return 0;
> +
> +ras = adev->mca.mp1.ras;
> +
> +err = amdgpu_ras_register_ras_block(adev, >ras_block);
> +if (err) {
> +dev_err(adev->dev, "Failed to register mca.mp1 ras 
> block!\n");
> +return err;
> +}
> +
> +strcpy(ras->ras_block.ras_comm.name, "mca.mp1");
> +ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__MCA;
> +ras->ras_block.ras_comm.type =
> AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
> +adev->mca.mp1.ras_if = >ras_block.ras_comm;
> +
> +return 0;
> +}
> +
> +int amdgpu_mca_mpio_ras_sw_init(struct amdgpu_device *adev) {
> +int err;
> +struct amdgpu_mca_ras_block *ras;
> +
> +if (!adev->mca.mpio.ras)
> +return 0;
> +
> +ras = adev->mca.mpio.ras;
> +
> +err = amdgpu_ras_register_ras_block(adev, >ras_block);
> +if (err) {
> +dev_err(adev->dev, "Failed to register mca.mpio ras 
> block!\n");
> +return err;
> +}
> +
> +strcpy(ras->ras_block.ras_c

RE: [PATCH 03/11] drm/amdgpu: Init gfx ras block when ras is supported

2023-03-05 Thread Yang, Stanley
I have one additional suggestion, the amdgpu_gfx_ras_sw_init is declared twice 
in amdgpu_gfx.h file, it can be removed one in this patch.

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, March 6, 2023 10:32 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ;
> Yang, Stanley ; Li, Candice ;
> Chai, Thomas 
> Cc: Zhang, Hawking 
> Subject: [PATCH 03/11] drm/amdgpu: Init gfx ras block when ras is supported
> 
> Initialize gfx ras block only when gfx ip block supports ras features.
> 
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 9 ++---
> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c  | 9 ++---
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index 3bf697a80cf2..d7d4847e2644 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -1409,9 +1409,12 @@ static int gfx_v11_0_sw_init(void *handle)
>   if (r)
>   return r;
> 
> - if (amdgpu_gfx_ras_sw_init(adev)) {
> - dev_err(adev->dev, "Failed to initialize gfx ras block!\n");
> - return -EINVAL;
> + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
> + r = amdgpu_gfx_ras_sw_init(adev);
> + if (r) {
> + dev_err(adev->dev, "Failed to initialize gfx ras
> block!\n");
> + return r;
> + }
>   }
> 
>   return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index ae09fc1cfe6b..c9657e89d40e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -2177,9 +2177,12 @@ static int gfx_v9_0_sw_init(void *handle)
>   if (r)
>   return r;
> 
> - if (amdgpu_gfx_ras_sw_init(adev)) {
> - dev_err(adev->dev, "Failed to initialize gfx ras block!\n");
> - return -EINVAL;
> + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
> + r = amdgpu_gfx_ras_sw_init(adev);
> + if (r) {
> + dev_err(adev->dev, "Failed to initialize gfx ras
> block!\n");
> + return r;
> + }
>   }
> 
>   return 0;
> --
> 2.17.1



RE: [PATCH] drm/amd/pm: Enable ecc_info table support for smu v13_0_10

2023-03-01 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of
> Candice Li
> Sent: Wednesday, March 1, 2023 2:10 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Li, Candice 
> Subject: [PATCH] drm/amd/pm: Enable ecc_info table support for smu
> v13_0_10
> 
> Support EccInfoTable which includes umc ras error count and error address.
> 
> Signed-off-by: Candice Li 
> Reviewed-by: Evan Quan 
> ---
>  .../drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c  | 75
> +++
>  1 file changed, 75 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> index 923a9fb3c8873c..27448ffe60a439 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> @@ -46,6 +46,7 @@
>  #include "asic_reg/mp/mp_13_0_0_sh_mask.h"
>  #include "smu_cmn.h"
>  #include "amdgpu_ras.h"
> +#include "umc_v8_10.h"
> 
>  /*
>   * DO NOT use these for err/warn/info/debug messages.
> @@ -90,6 +91,12 @@
> 
>  #define DEBUGSMC_MSG_Mode1Reset  2
> 
> +/*
> + * SMU_v13_0_10 supports ECCTABLE since version 80.34.0,
> + * use this to check ECCTABLE feature whether support  */ #define
> +SUPPORT_ECCTABLE_SMU_13_0_10_VERSION 0x00502200
> +
>  static struct cmn2asic_msg_mapping
> smu_v13_0_0_message_map[SMU_MSG_MAX_COUNT] = {
>   MSG_MAP(TestMessage,
>   PPSMC_MSG_TestMessage, 1),
>   MSG_MAP(GetSmuVersion,
>   PPSMC_MSG_GetSmuVersion,   1),
> @@ -229,6 +236,7 @@ static struct cmn2asic_mapping
> smu_v13_0_0_table_map[SMU_TABLE_COUNT] = {
>   TAB_MAP(ACTIVITY_MONITOR_COEFF),
>   [SMU_TABLE_COMBO_PPTABLE] = {1, TABLE_COMBO_PPTABLE},
>   TAB_MAP(I2C_COMMANDS),
> + TAB_MAP(ECCINFO),
>  };
> 
>  static struct cmn2asic_mapping
> smu_v13_0_0_pwr_src_map[SMU_POWER_SOURCE_COUNT] = { @@ -462,6
> +470,8 @@ static int smu_v13_0_0_tables_init(struct smu_context *smu)
>  AMDGPU_GEM_DOMAIN_VRAM);
>   SMU_TABLE_INIT(tables, SMU_TABLE_COMBO_PPTABLE,
> MP0_MP1_DATA_REGION_SIZE_COMBOPPTABLE,
>   PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> + SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO,
> sizeof(EccInfoTable_t),
> + PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> 
>   smu_table->metrics_table = kzalloc(sizeof(SmuMetricsExternal_t),
> GFP_KERNEL);
>   if (!smu_table->metrics_table)
> @@ -477,8 +487,14 @@ static int smu_v13_0_0_tables_init(struct
> smu_context *smu)
>   if (!smu_table->watermarks_table)
>   goto err2_out;
> 
> + smu_table->ecc_table = kzalloc(tables[SMU_TABLE_ECCINFO].size,
> GFP_KERNEL);
> + if (!smu_table->ecc_table)
> + goto err3_out;
> +
>   return 0;
> 
> +err3_out:
> + kfree(smu_table->watermarks_table);
>  err2_out:
>   kfree(smu_table->gpu_metrics_table);
>  err1_out:
> @@ -2036,6 +2052,64 @@ static int
> smu_v13_0_0_send_bad_mem_channel_flag(struct smu_context *smu,
>   return ret;
>  }
> 
> +static int smu_v13_0_0_check_ecc_table_support(struct smu_context
> *smu)
> +{
> + struct amdgpu_device *adev = smu->adev;
> + uint32_t if_version = 0xff, smu_version = 0xff;
> + int ret = 0;
> +
> + ret = smu_cmn_get_smc_version(smu, _version, _version);
> + if (ret)
> + return -EOPNOTSUPP;
> +
> + if ((adev->ip_versions[MP1_HWIP][0] == IP_VERSION(13, 0, 10)) &&
> + (smu_version >=
> SUPPORT_ECCTABLE_SMU_13_0_10_VERSION))
> + return ret;
> + else
> + return -EOPNOTSUPP;
> +}
> +
> +static ssize_t smu_v13_0_0_get_ecc_info(struct smu_context *smu,
> + void
> *table)
> +{
> + struct smu_table_context *smu_table = >smu_table;
> + struct amdgpu_device *adev = smu->adev;
> + EccInfoTable_t *ecc_table = NULL;
> + struct ecc_info_per_ch *ecc_info_per_channel = NULL;
> + int i, ret = 0;
> + struct umc_ecc_info *eccinfo = (struct umc_ecc_info *)table;
> +
> + ret = smu_v13_0_0_check_ecc_table_support(smu);
> + if (ret)
> + return ret;
> +
> + ret = smu_cmn_update_table(smu,
> + SMU_TABLE_ECCINFO,
> + 0,
> + smu_table->ecc_table,
> + false);
> + if (ret) {
> + dev_info(adev->dev, "Failed to export SMU ecc table!\n");
> + return ret;
> + }
> +
> + ecc_table = (EccInfoTable_t *)smu_table->ecc_table;
> +
> + for (i = 0; i < UMC_V8_10_TOTAL_CHANNEL_NUM(adev); i++) {
> + ecc_info_per_channel = &(eccinfo->ecc[i]);
> + ecc_info_per_channel->ce_count_lo_chip =
> + ecc_table->EccInfo[i].ce_count_lo_chip;
> + ecc_info_per_channel->ce_count_hi_chip =
> 

RE: [PATCH 2/2] drm/amdgpu: Add ecc info query interface for umc v8_10

2023-02-21 Thread Yang, Stanley
[AMD Official Use Only - General]

The series is Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of
> Candice Li
> Sent: Wednesday, February 22, 2023 12:35 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Li, Candice 
> Subject: [PATCH 2/2] drm/amdgpu: Add ecc info query interface for umc
> v8_10
> 
> Support ecc info query for umc v8_10.
> 
> v2: Simplied by convert_error_address.
> v3: Remove unused variable and invalid checking.
> 
> Signed-off-by: Candice Li 
> Reviewed-by: Tao Zhou 
> Reviewed-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/umc_v8_10.c | 134
> +
>  1 file changed, 134 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v8_10.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v8_10.c
> index 293ba39c8a2fda..66158219f791cb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v8_10.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v8_10.c
> @@ -360,6 +360,138 @@ static bool
> umc_v8_10_query_ras_poison_mode(struct amdgpu_device *adev)
>   return true;
>  }
> 
> +static void umc_v8_10_ecc_info_query_correctable_error_count(struct
> amdgpu_device *adev,
> +   uint32_t node_inst, uint32_t umc_inst,
> uint32_t ch_inst,
> +   unsigned long *error_count)
> +{
> + uint64_t mc_umc_status;
> + uint32_t eccinfo_table_idx;
> + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> +
> + eccinfo_table_idx = node_inst * adev->umc.umc_inst_num *
> +   adev->umc.channel_inst_num +
> +   umc_inst * adev->umc.channel_inst_num +
> +   ch_inst;
> +
> + /* check the MCUMC_STATUS */
> + mc_umc_status = ras-
> >umc_ecc.ecc[eccinfo_table_idx].mca_umc_status;
> + if (REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 &&
> + REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 1) {
> + *error_count += 1;
> + }
> +}
> +
> +static void umc_v8_10_ecc_info_query_uncorrectable_error_count(struct
> amdgpu_device *adev,
> +   uint32_t node_inst, uint32_t umc_inst,
> uint32_t ch_inst,
> +   unsigned long *error_count)
> +{
> + uint64_t mc_umc_status;
> + uint32_t eccinfo_table_idx;
> + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> +
> + eccinfo_table_idx = node_inst * adev->umc.umc_inst_num *
> +   adev->umc.channel_inst_num +
> +   umc_inst * adev->umc.channel_inst_num +
> +   ch_inst;
> +
> + /* check the MCUMC_STATUS */
> + mc_umc_status = ras-
> >umc_ecc.ecc[eccinfo_table_idx].mca_umc_status;
> + if ((REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1) &&
> + (REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, Deferred) == 1 ||
> + REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, UECC) == 1 ||
> + REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, PCC) == 1 ||
> + REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, UC) == 1 ||
> + REG_GET_FIELD(mc_umc_status,
> MCA_UMC_UMC0_MCUMC_STATUST0, TCC) == 1)) {
> + *error_count += 1;
> + }
> +}
> +
> +static void umc_v8_10_ecc_info_query_ras_error_count(struct
> amdgpu_device *adev,
> + void *ras_error_status)
> +{
> + struct ras_err_data *err_data = (struct ras_err_data
> *)ras_error_status;
> +
> + uint32_t node_inst   = 0;
> + uint32_t umc_inst= 0;
> + uint32_t ch_inst = 0;
> +
> + /* TODO: driver needs to toggle DF Cstate to ensure
> +  * safe access of UMC registers. Will add the protection
> +  */
> + LOOP_UMC_EACH_NODE_INST_AND_CH(node_inst, umc_inst,
> ch_inst) {
> + umc_v8_10_ecc_info_query_correctable_error_count(adev,
> + node_inst, umc_inst,
> ch_inst,
> + &(err_data-
> >ce_count));
> +
>   umc_v8_10_ecc_info_query_uncorrectable_error_count(adev,
> + node_inst, umc_inst,
> ch_inst,
> + &(err_data-
> >ue_count));
> + }
> +}
> +
> +static void umc_v8_10_ecc_info_query_error_address(struct
> amdgpu_device *adev,
> + struct ras_err_data *err_data,
> + uint32_t ch_inst,
> + uint32_t umc_inst,
> + uint32_t node_inst)
> +{
> + uint32_t eccinfo_table_idx, channel_index;
> + uint64_t mc_umc_status, err_addr;
> +
> + struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> +
> + eccinfo_table_idx = node_inst * 

RE: [PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS UE count

2023-02-21 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Zhou1, Tao 
> Sent: Wednesday, February 22, 2023 10:53 AM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas ; Li, Candice ; Lazar,
> Lijo 
> Subject: RE: [PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC
> RAS UE count
> 
> Ping...
> 
> > -Original Message-
> > From: Zhou1, Tao 
> > Sent: Monday, February 20, 2023 11:17 AM
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Yang, Stanley ;
> Chai,
> > Thomas ; Li, Candice ;
> Lazar,
> > Lijo 
> > Cc: Zhou1, Tao 
> > Subject: [PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS
> > UE count
> >
> > If a UMC bad page is reserved but not freed by an application, the
> > application may trigger uncorrectable error repeatly by accessing the page.
> >
> > v2: add specific function to do the check.
> > v3: remove duplicate pages, calculate new added bad page number.
> > v4: reuse save_bad_pages to calculate new added bad page number.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 +---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  3 ++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c |  5 +++--
> >  3 files changed, 18 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 6e543558386d..5c02c6c9f773 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -176,7 +176,7 @@ static int amdgpu_reserve_page_direct(struct
> > amdgpu_device *adev, uint64_t addre
> > if (amdgpu_bad_page_threshold != 0) {
> > amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> >  err_data.err_addr_cnt);
> > -   amdgpu_ras_save_bad_pages(adev);
> > +   amdgpu_ras_save_bad_pages(adev, NULL);
> > }
> >
> > dev_warn(adev->dev, "WARNING: THIS IS ONLY FOR TEST PURPOSES
> AND
> > WILL CORRUPT RAS EEPROM\n"); @@ -2084,22 +2084,32 @@ int
> > amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
> >  /*
> >   * write error record array to eeprom, the function should be
> >   * protected by recovery_lock
> > + * new_cnt: new added UE count, excluding reserved bad pages, can be
> > + NULL
> >   */
> > -int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> > +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev,
> > +   unsigned long *new_cnt)
> >  {
> > struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> > struct ras_err_handler_data *data;
> > struct amdgpu_ras_eeprom_control *control;
> > int save_count;
> >
> > -   if (!con || !con->eh_data)
> > +   if (!con || !con->eh_data) {
> > +   if (new_cnt)
> > +   *new_cnt = 0;
> > +
> > return 0;
> > +   }
> >
> > mutex_lock(>recovery_lock);
> > control = >eeprom_control;
> > data = con->eh_data;
> > save_count = data->count - control->ras_num_recs;
> > mutex_unlock(>recovery_lock);
> > +
> > +   if (new_cnt)
> > +   *new_cnt = save_count / adev->umc.retire_unit;
> > +
> > /* only new entries are saved */
> > if (save_count > 0) {
> > if (amdgpu_ras_eeprom_append(control,
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index f2ad93f6..ef38f4c93df0 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -547,7 +547,8 @@ int amdgpu_ras_query_error_count(struct
> > amdgpu_device *adev,  int amdgpu_ras_add_bad_pages(struct
> > amdgpu_device *adev,
> > struct eeprom_table_record *bps, int pages);
> >
> > -int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> > +int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev,
> > +   unsigned long *new_cnt);
> >
> >  static inline enum ta_ras_block
> >  amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) { diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 1c7fcb4f2380..7c6fc3214339 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>

RE: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err

2023-02-21 Thread Yang, Stanley
[AMD Official Use Only - General]

The series is Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Zhou1, Tao 
> Sent: Wednesday, February 22, 2023 10:52 AM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas ; Li, Candice ; Lazar,
> Lijo 
> Cc: Zhou1, Tao 
> Subject: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> ras_eeprom_check_err
> 
> bad_page_threshold controls page retirement behavior and it should be also
> checked.
> 
> v2: simplify the condition of bad page handling path.
> 
> Signed-off-by: Tao Zhou 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 19 ++-
> 
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 9d370465b08d..2e08fce87521 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -417,7 +417,8 @@ bool
> amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> 
> - if (!__is_ras_eeprom_supported(adev))
> + if (!__is_ras_eeprom_supported(adev) ||
> + !amdgpu_bad_page_threshold)
>   return false;
> 
>   /* skip check eeprom table for VEGA20 Gaming */ @@ -428,10
> +429,18 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct
> amdgpu_device *adev)
>   return false;
> 
>   if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
> - dev_warn(adev->dev, "This GPU is in BAD status.");
> - dev_warn(adev->dev, "Please retire it or set a larger "
> -  "threshold value when reloading driver.\n");
> - return true;
> + if (amdgpu_bad_page_threshold == -1) {
> + dev_warn(adev->dev, "RAS records:%d exceed
> threshold:%d",
> + con->eeprom_control.ras_num_recs, con-
> >bad_page_cnt_threshold);
> + dev_warn(adev->dev,
> + "But GPU can be operated due to
> bad_page_threshold = -1.\n");
> + return false;
> + } else {
> + dev_warn(adev->dev, "This GPU is in BAD status.");
> + dev_warn(adev->dev, "Please retire it or set a larger
> "
> +  "threshold value when reloading driver.\n");
> + return true;
> + }
>   }
> 
>   return false;
> --
> 2.35.1


RE: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err

2023-02-21 Thread Yang, Stanley
[AMD Official Use Only - General]



> -Original Message-
> From: Zhou1, Tao 
> Sent: Tuesday, February 21, 2023 4:29 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas ; Li, Candice 
> Cc: Zhou1, Tao 
> Subject: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> ras_eeprom_check_err
> 
> bad_page_threshold controls page retirement behavior and it should be also
> checked.
> 
> Signed-off-by: Tao Zhou 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 20 ++-
> 
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 9d370465b08d..c88123896fe8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -417,7 +417,8 @@ bool
> amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> 
> - if (!__is_ras_eeprom_supported(adev))
> + if (!__is_ras_eeprom_supported(adev) ||
> + !amdgpu_bad_page_threshold)
>   return false;
> 
>   /* skip check eeprom table for VEGA20 Gaming */ @@ -428,10
> +429,19 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct
> amdgpu_device *adev)
>   return false;
> 
>   if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
> - dev_warn(adev->dev, "This GPU is in BAD status.");
> - dev_warn(adev->dev, "Please retire it or set a larger "
> -  "threshold value when reloading driver.\n");
> - return true;
> + if (amdgpu_bad_page_threshold == -1) {
> + dev_warn(adev->dev, "RAS records:%d exceed
> threshold:%d",
> + con->eeprom_control.ras_num_recs, con-
> >bad_page_cnt_threshold);
> + dev_warn(adev->dev,
> + "But GPU can be operated due to
> bad_page_threshold = -1.\n");
> + return false;
> + } else if (amdgpu_bad_page_threshold > 0 ||
> + amdgpu_bad_page_threshold == -2) {

Stanley: it can't guarantee use to set amdgpu_bad_page_threshold value as 
expected for example -3, how about set this if condition as below
else if (amdgpu_bad_page_threshold) {
...
}
And in patch#1 the value -2 isn't need anymore.

Regards,
Stanley
> + dev_warn(adev->dev, "This GPU is in BAD status.");
> + dev_warn(adev->dev, "Please retire it or set a larger
> "
> +  "threshold value when reloading driver.\n");
> + return true;
> + }
>   }
> 
>   return false;
> --
> 2.35.1


RE: [PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS UE count

2023-02-16 Thread Yang, Stanley
[AMD Official Use Only - General]



> -Original Message-
> From: Zhou1, Tao 
> Sent: Friday, February 17, 2023 11:53 AM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas ; Li, Candice ; Lazar,
> Lijo 
> Cc: Zhou1, Tao 
> Subject: [PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS
> UE count
> 
> If a UMC bad page is reserved but not freed by an application, the application
> may trigger uncorrectable error repeatly by accessing the page.
> 
> v2: add specific function to do the check.
> v3: remove duplicate pages, calculate new added bad page number.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 23
> +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c |  2 ++
>  3 files changed, 27 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6e543558386d..777f85f3e5eb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2115,6 +2115,29 @@ int amdgpu_ras_save_bad_pages(struct
> amdgpu_device *adev)
>   return 0;
>  }
> 
> +/* Remove duplicate pages, calculate new added bad page number.
> + * Note: the function should be called between
> amdgpu_ras_add_bad_pages
> + * and amdgpu_ras_save_bad_pages.
> + */
> +int amdgpu_ras_umc_new_ue_count(struct amdgpu_device *adev) {
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + struct ras_err_handler_data *data;
> + struct amdgpu_ras_eeprom_control *control;
> + int save_count;
> +
> + if (!con || !con->eh_data)
> + return 0;
> +
> + mutex_lock(>recovery_lock);
> + control = >eeprom_control;
> + data = con->eh_data;
> + save_count = data->count - control->ras_num_recs;
> + mutex_unlock(>recovery_lock);
> +
> + return (save_count / adev->umc.retire_unit); }

Stanley: It's better add comments about the return value.
Without above concern the patch is Reviewed-by: Stanley.Yang 


> +
>  /*
>   * read error record array in eeprom and reserve enough space for
>   * storing new bad pages
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index f2ad93f6..e89c95438a88 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -549,6 +549,8 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
> 
>  int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> 
> +int amdgpu_ras_umc_new_ue_count(struct amdgpu_device *adev);
> +
>  static inline enum ta_ras_block
>  amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) {
>   switch (block) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1c7fcb4f2380..45b6be7277dd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -147,6 +147,8 @@ static int amdgpu_umc_do_page_retirement(struct
> amdgpu_device *adev,
>   err_data->err_addr_cnt) {
>   amdgpu_ras_add_bad_pages(adev, err_data-
> >err_addr,
>   err_data->err_addr_cnt);
> + err_data->ue_count =
> amdgpu_ras_umc_new_ue_count(adev);
> +
>   amdgpu_ras_save_bad_pages(adev);
> 
>   amdgpu_dpm_send_hbm_bad_pages_num(adev,
> con->eeprom_control.ras_num_recs);
> --
> 2.35.1


RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

2023-02-16 Thread Yang, Stanley
[AMD Official Use Only - General]



> -Original Message-
> From: Zhou1, Tao 
> Sent: Thursday, February 16, 2023 3:58 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas ; Li, Candice ; Lazar,
> Lijo 
> Cc: Zhou1, Tao 
> Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
> 
> If a UMC bad page is reserved but not freed by an application, the application
> may trigger uncorrectable error repeatly by accessing the page.
> 
> v2: add specific function to do the check.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 24
> 
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c |  4 
>  3 files changed, 30 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6e543558386d..5214034e1b16 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2115,6 +2115,30 @@ int amdgpu_ras_save_bad_pages(struct
> amdgpu_device *adev)
>   return 0;
>  }
> 
> +/* Return false if all pages have been reserved before, no new bad page
> + * is found, otherwise return true.
> + * Note: the function should be called between
> amdgpu_ras_add_bad_pages
> + * and amdgpu_ras_save_bad_pages.
> + */
> +bool amdgpu_ras_new_bad_page_is_added(struct amdgpu_device *adev)
> {
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + struct ras_err_handler_data *data;
> + struct amdgpu_ras_eeprom_control *control;
> + int save_count;
> +
> + if (!con || !con->eh_data)
> + return false;
> +
> + mutex_lock(>recovery_lock);
> + control = >eeprom_control;
> + data = con->eh_data;
> + save_count = data->count - control->ras_num_recs;
> + mutex_unlock(>recovery_lock);
> +
> + return (save_count ? true : false);
> +}
> +
>  /*
>   * read error record array in eeprom and reserve enough space for
>   * storing new bad pages
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index f2ad93f6..606b75c36848 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -549,6 +549,8 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
> 
>  int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
> 
> +bool amdgpu_ras_new_bad_page_is_added(struct amdgpu_device *adev);
> +
>  static inline enum ta_ras_block
>  amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) {
>   switch (block) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1c7fcb4f2380..1146e65c22be 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -147,6 +147,10 @@ static int amdgpu_umc_do_page_retirement(struct
> amdgpu_device *adev,
>   err_data->err_addr_cnt) {
>   amdgpu_ras_add_bad_pages(adev, err_data-
> >err_addr,
>   err_data->err_addr_cnt);
> + /* if no new bad page is found, no need to increase
> ue count */
> + if (!amdgpu_ras_new_bad_page_is_added(adev))
> + err_data->ue_count = 0;

[Stanley]: There is a scenario, a UMC bad page is reserved but not freed by an 
application, the application accesses the above reserved page and it also
accesses a new bad page, driver read 2 ue count but save one new bad page, the 
err_data->ue_count should be set to 1.

> +
>   amdgpu_ras_save_bad_pages(adev);
> 
>   amdgpu_dpm_send_hbm_bad_pages_num(adev,
> con->eeprom_control.ras_num_recs);
> --
> 2.35.1


RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

2023-02-13 Thread Yang, Stanley
[AMD Official Use Only - General]



> -Original Message-
> From: Zhou1, Tao 
> Sent: Monday, February 13, 2023 4:25 PM
> To: Zhang, Hawking ; amd-
> g...@lists.freedesktop.org; Yang, Stanley ; Chai,
> Thomas ; Li, Candice 
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new bad page
> 
> [AMD Official Use Only - General]
> 
> > -Original Message-
> > From: Zhang, Hawking 
> > Sent: Friday, February 10, 2023 11:02 PM
> > To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org;
> > Yang, Stanley ; Chai, Thomas
> > ; Li, Candice 
> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> > new bad page
> >
> > [AMD Official Use Only - General]
> >
> > +   /* if no new bad page is found, no need to increase 
> > ue count
> */
> > +   if (ret == -EEXIST)
> > +   err_data->ue_count = 0;
> >
> > Returning EEXIST in such case is not reasonable. Might consider return
> > a bool for
> > amdgpu_ras_add_bad_pages: true means it does add some new bad page;
> > false means it doesn't change anything.
> >
> > Regards,
> > Hawking
> 
> [Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and
> amdgpu_ras_recovery_init also need to check the return value. I'd like to
> keep the type of return value unchanged.
> How about -EINVAL?

Stanley: How about return -EALREADY?

Regards,
Stanley
> 
> >
> > -Original Message-
> > From: Zhou1, Tao 
> > Sent: Friday, February 10, 2023 16:45
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Yang, Stanley ;
> Chai,
> > Thomas ; Li, Candice 
> > Cc: Zhou1, Tao 
> > Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new
> > bad page
> >
> > If a UMC bad page is reserved but not freed by an application, the
> > application may trigger uncorrectable error repeatly by accessing the page.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 -
> > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +-
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e85c4689ce2c..eafe01a24349 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> > amdgpu_device *adev,  {
> > struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> > struct ras_err_handler_data *data;
> > -   int ret = 0;
> > +   int ret = 0, old_cnt;
> > uint32_t i;
> >
> > if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
> > +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> > if (!data)
> > goto out;
> >
> > +   old_cnt = data->count;
> > +
> > for (i = 0; i < pages; i++) {
> > if (amdgpu_ras_check_bad_page_unlock(con,
> > bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
> > @@ -2079,6
> > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> > data->count++;
> > data->space_left--;
> > }
> > +
> > +   /* all pages have been reserved before, no new bad page */
> > +   if (old_cnt == data->count)
> > +   ret = -EEXIST;
> > +
> >  out:
> > mutex_unlock(>recovery_lock);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 1c7fcb4f2380..772c431e4065 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -145,8 +145,12 @@ static int
> amdgpu_umc_do_page_retirement(struct
> > amdgpu_device *adev,
> >
> > if ((amdgpu_bad_page_threshold != 0) &&
> > err_data->err_addr_cnt) {
> > -   amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> > +   ret = amdgpu_ras_add_bad_pages(adev,
> > + err_data->err_addr,
> >
> > err_data->err_addr_cnt);
> > +   /* if no new bad page is found, no need to increase 
> > ue count
> */
> > +   if (ret == -EEXIST)
> > +   err_data->ue_count = 0;
> > +
> > amdgpu_ras_save_bad_pages(adev);
> >
> > amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> > >eeprom_control.ras_num_recs);
> > --
> > 2.35.1
> >
> 


RE: [PATCH] drm/amdgpu: move convert_error_address out of umc_ras

2022-10-14 Thread Yang, Stanley
[AMD Official Use Only - General]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: amd-gfx  On Behalf Of
> Hawking Zhang
> Sent: Friday, October 14, 2022 2:19 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ;
> Yang, Stanley 
> Cc: Russell, Kent ; Zhang, Hawking
> 
> Subject: [PATCH] drm/amdgpu: move convert_error_address out of umc_ras
> 
> RAS error address translation algorithm is common across dGPU and A + A
> platform as along as the SOC integrates the same generation of UMC IP.
> 
> UMC RAS is managed by x86 MCA on A + A platform, umc_ras in GPU driver
> is not initialized at all on A + A platform. In such case, any umc_ras 
> callback
> implemented for dGPU config shouldn't be invoked from A + A specific
> callback.
> 
> The change moves convert_error_address out of dGPU umc_ras structure
> and makes it share between A + A and dGPU config.
> 
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 15 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |  3 ---
>  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   |  7 +++
>  drivers/gpu/drm/amd/amdgpu/umc_v6_7.h   |  4 +++-
>  4 files changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 75f1402101f4..ff92ea99d513 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -36,6 +36,7 @@
>  #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
>  #include "atom.h"
>  #include "amdgpu_reset.h"
> +#include "umc_v6_7.h"
> 
>  #ifdef CONFIG_X86_MCE_AMD
>  #include 
> @@ -2885,10 +2886,16 @@ static int amdgpu_bad_page_notifier(struct
> notifier_block *nb,
>   /*
>* Translate UMC channel address to Physical address
>*/
> - if (adev->umc.ras &&
> - adev->umc.ras->convert_ras_error_address)
> - adev->umc.ras->convert_ras_error_address(adev,
> - _data, m->addr, ch_inst, umc_inst);
> + switch (adev->ip_versions[UMC_HWIP][0]) {
> + case IP_VERSION(6, 7, 0):
> + umc_v6_7_convert_error_address(adev,
> + _data, m->addr, ch_inst, umc_inst);
> + break;
> + default:
> + dev_warn(adev->dev,
> +  "UMC address to Physical address translation is not
> supported\n");
> + return NOTIFY_DONE;
> + }
> 
>   if (amdgpu_bad_page_threshold != 0) {
>   amdgpu_ras_add_bad_pages(adev, err_data.err_addr, diff --
> git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> index e46439274f3a..3629d8f292ef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> @@ -51,9 +51,6 @@ struct amdgpu_umc_ras {
>   struct amdgpu_ras_block_object ras_block;
>   void (*err_cnt_init)(struct amdgpu_device *adev);
>   bool (*query_ras_poison_mode)(struct amdgpu_device *adev);
> - void (*convert_ras_error_address)(struct amdgpu_device *adev,
> - struct ras_err_data *err_data, uint64_t
> err_addr,
> - uint32_t ch_inst, uint32_t umc_inst);
>   void (*ecc_info_query_ras_error_count)(struct amdgpu_device
> *adev,
> void *ras_error_status);
>   void (*ecc_info_query_ras_error_address)(struct amdgpu_device
> *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> index 5d5d031c9e7d..72fd963f178b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> @@ -187,9 +187,9 @@ static void
> umc_v6_7_ecc_info_query_ras_error_count(struct amdgpu_device *adev,
>   }
>  }
> 
> -static void umc_v6_7_convert_error_address(struct amdgpu_device *adev,
> - struct ras_err_data *err_data,
> uint64_t err_addr,
> - uint32_t ch_inst, uint32_t umc_inst)
> +void umc_v6_7_convert_error_address(struct amdgpu_device *adev,
> + struct ras_err_data *err_data, uint64_t
> err_addr,
> + uint32_t ch_inst, uint32_t umc_inst)
>  {
>   uint32_t channel_index;
>   uint64_t soc_pa, retired_page, column; @@ -553,5 +553,4 @@ struct
> amdgpu_umc_ras umc_v6_7_ras = {
>   .query_ras_poison_mode = umc_v6_7_query_ras_poison_mode,
>   .ecc_info_query_ras_error_count =
> umc_v6_7_ecc_info_query_ras_error_count,
>   .ecc_

RE: [PATCH 1/4] drm/amdgpu: export umc error address translation interface

2022-09-25 Thread Yang, Stanley
[AMD Official Use Only - General]

Hi Tao,

> -Original Message-
> From: Zhou1, Tao 
> Sent: Friday, September 23, 2022 5:21 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley 
> Cc: Zhou1, Tao 
> Subject: [PATCH 1/4] drm/amdgpu: export umc error address translation
> interface
> 
> Make it globally so we can convert specific mca address.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |  6 ++
>  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   | 11 +--
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> index 3629d8f292ef..31fbefaaf676 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> @@ -22,6 +22,8 @@
>  #define __AMDGPU_UMC_H__
>  #include "amdgpu_ras.h"
> 
> +#define UMC_INVALID_ADDR 0x1ULL
> +
>  /*
>   * (addr / 256) * 4096, the higher 26 bits in ErrorAddr
>   * is the index of 4KB block
> @@ -51,6 +53,10 @@ struct amdgpu_umc_ras {
>   struct amdgpu_ras_block_object ras_block;
>   void (*err_cnt_init)(struct amdgpu_device *adev);
>   bool (*query_ras_poison_mode)(struct amdgpu_device *adev);
> + void (*query_error_address_per_channel)(struct amdgpu_device
> *adev,
> +  struct ras_err_data
> *err_data,
> +  uint32_t umc_reg_offset,
> uint32_t ch_inst,
> +  uint32_t umc_inst, uint64_t
> mca_addr);
>   void (*ecc_info_query_ras_error_count)(struct amdgpu_device
> *adev,
> void *ras_error_status);
>   void (*ecc_info_query_ras_error_address)(struct amdgpu_device
> *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> index bf7524f16b66..0f1b215653f3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> @@ -452,9 +452,8 @@ static void umc_v6_7_query_ras_error_count(struct
> amdgpu_device *adev,
> 
>  static void umc_v6_7_query_error_address(struct amdgpu_device *adev,
>struct ras_err_data *err_data,
> -  uint32_t umc_reg_offset,
> -  uint32_t ch_inst,
> -  uint32_t umc_inst)
> +  uint32_t umc_reg_offset, uint32_t
> ch_inst,
> +  uint32_t umc_inst, uint64_t
> mca_addr)
>  {
>   uint32_t mc_umc_status_addr;
>   uint32_t channel_index;
> @@ -540,9 +539,8 @@ static void
> umc_v6_7_query_ras_error_address(struct amdgpu_device *adev,
>ch_inst);
>   umc_v6_7_query_error_address(adev,
>err_data,
> -  umc_reg_offset,
> -  ch_inst,
> -  umc_inst);
> +  umc_reg_offset, ch_inst,
> +  umc_inst, UMC_INVALID_ADDR);
>   }
>  }
> 
> @@ -583,4 +581,5 @@ struct amdgpu_umc_ras umc_v6_7_ras = {
>   .query_ras_poison_mode = umc_v6_7_query_ras_poison_mode,
>   .ecc_info_query_ras_error_count =
> umc_v6_7_ecc_info_query_ras_error_count,
>   .ecc_info_query_ras_error_address =
> umc_v6_7_ecc_info_query_ras_error_address,
> + .query_error_address_per_channel =
> umc_v6_7_query_error_address,

Stanley: According to patch#3, it's better to rename 
query_error_address_per_channel to 
covert/query_error_address_at_specific_channel due to the channel_instance and 
umc_instance get form the mce structure, using per_channel may cause 
misunderstanding.

>  };
> --
> 2.35.1


RE: [PATCH 1/2] drm/amdgpu: support RAS error inject for SRIOV

2022-08-31 Thread Yang, Stanley
[AMD Official Use Only - General]

The series is fine for me, these patches also need to be reviewed by the 
virtualization group.

Regards,
Stanley
> -Original Message-
> From: Zhou1, Tao 
> Sent: Wednesday, August 31, 2022 4:39 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Liu,
> Monk ; Skvortsov, Victor
> ; Chang, HaiJun ;
> Chander, Vignesh ; Wan, Gavin
> ; Liu, Shaoyun 
> Cc: Zhou1, Tao 
> Subject: [PATCH 1/2] drm/amdgpu: support RAS error inject for SRIOV
> 
> In SRIOV, RAS error injection request will be sent to PF via mailbox, the
> injection input information should also be transferred to PF.
> 
> Generally, the error injection is operated on PF side directly, but for RAS
> poison test, since workload is launched on VF side, VF has to tell PF about 
> the
> injection information.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 26 --
> --  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h |  2 ++
>  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c| 24
> ++
>  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h|  9 
>  4 files changed, 53 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ab9ba5a9c33d..498642eb5fb7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1103,15 +1103,25 @@ int amdgpu_ras_error_inject(struct
> amdgpu_device *adev,
> block_info.address);
>   }
> 
> - if (info->head.block == AMDGPU_RAS_BLOCK__GFX) {
> - if (block_obj->hw_ops->ras_error_inject)
> - ret = block_obj->hw_ops->ras_error_inject(adev,
> info);
> + if (!amdgpu_sriov_vf(adev)) {
> + if (info->head.block == AMDGPU_RAS_BLOCK__GFX) {
> + if (block_obj->hw_ops->ras_error_inject)
> + ret = block_obj->hw_ops-
> >ras_error_inject(adev, info);
> + } else {
> + /* If defined special ras_error_inject(e.g: xgmi),
> implement special ras_error_inject */
> + if (block_obj->hw_ops->ras_error_inject)
> + ret = block_obj->hw_ops-
> >ras_error_inject(adev, _info);
> + else  /*If not defined .ras_error_inject, use default
> ras_error_inject*/
> + ret = psp_ras_trigger_error(>psp,
> _info);
> + }
>   } else {
> - /* If defined special ras_error_inject(e.g: xgmi), implement
> special ras_error_inject */
> - if (block_obj->hw_ops->ras_error_inject)
> - ret = block_obj->hw_ops->ras_error_inject(adev,
> _info);
> - else  /*If not defined .ras_error_inject, use default
> ras_error_inject*/
> - ret = psp_ras_trigger_error(>psp,
> _info);
> + if (adev->virt.ops && adev->virt.ops->ras_trigger_error) {
> + adev->virt.ops->ras_trigger_error(adev, _info);
> + ret = 0;
> + } else {
> + dev_warn(adev->dev,
> + "No ras_trigger_error interface in SRIOV!\n");
> + }
>   }
> 
>   if (ret)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index 239f232f9c02..4534e6f70a4b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -84,6 +84,8 @@ struct amdgpu_virt_ops {
>   int (*reset_gpu)(struct amdgpu_device *adev);
>   int (*wait_reset)(struct amdgpu_device *adev);
>   void (*trans_msg)(struct amdgpu_device *adev, u32 req, u32 data1,
> u32 data2, u32 data3);
> + void (*ras_trigger_error)(struct amdgpu_device *adev,
> + struct ta_ras_trigger_error_input *info);
>  };
> 
>  /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index a2f04b249132..3b4c5162a237 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -33,6 +33,7 @@
>  #include "mxgpu_ai.h"
> 
>  #include "amdgpu_reset.h"
> +#include "ta_ras_if.h"
> 
>  static void xgpu_ai_mailbox_send_ack(struct amdgpu_device *adev)  { @@ -
> 405,6 +406,28 @@ static int xgpu_ai_request_init_data(struct
> amdgpu_device *adev)
>   return xgpu_ai_send_access_req

回复: [PATCH Review 1/1] drm/amdgpu/pm: adjust EccInfo_t struct

2022-06-15 Thread Yang, Stanley
[AMD Official Use Only - General]

Thanks hawking, in the previous patch it has checked pmfw version and the 
EccInfo_t struct is consistent on driver side and pmfw side with pmfw debug 
version 68.54.136 during develop this feature, but it's changed in the official 
release version 68.55.0, so driver side has to adjust it.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhang, Hawking 
> 发送时间: Thursday, June 16, 2022 11:50 AM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhou1, Tao ; Li, Candice
> ; Quan, Evan 
> 抄送: Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu/pm: adjust EccInfo_t struct
>
> [AMD Official Use Only - General]
>
> For the structure itself, the change is okay to me. But you'll have to apply 
> pmfw
> version check in the implementation to make data matches with fw structure
>
> The patch is
>
> Reviewed-by: Hawking Zhang 
>
> Regards,
> Hawking
>
> -Original Message-
> From: Stanley.Yang 
> Sent: Thursday, June 16, 2022 11:22
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Quan, Evan 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu/pm: adjust EccInfo_t struct
>
> The EccInfo_t struct in driver_if.h is as below in official release verion 
> 68.55.0
> typedef struct {
>uint64_t mca_umc_status;
>uint64_t mca_umc_addr;
>
>uint16_t ce_count_lo_chip;
>uint16_t ce_count_hi_chip;
>
>uint32_t eccPadding;
>
>uint64_t mca_ceumc_addr;
>  } EccInfo_t;
> It's different from the debug version druing develop print correctable error
> address, so adjust EccInfo_t struct.
>
> Signed-off-by: Stanley.Yang 
> ---
>  .../drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h   | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
> index 6f92038470ec..7a6075daa7b2 100644
> ---
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
> +++
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran
> +++ .h
> @@ -521,12 +521,13 @@ typedef struct {
>  typedef struct {
> uint64_t mca_umc_status;
> uint64_t mca_umc_addr;
> -   uint64_t mca_ceumc_addr;
>
> uint16_t ce_count_lo_chip;
> uint16_t ce_count_hi_chip;
>
> uint32_t eccPadding;
> +
> +   uint64_t mca_ceumc_addr;
>  } EccInfo_V2_t;
>
>  typedef struct {
> --
> 2.17.1
>



答复: [PATCH Review v3 2/2] drm/amdgpu: print umc correctable error address

2022-05-25 Thread Yang, Stanley
[AMD Official Use Only - General]


[AMD Official Use Only - General]

发件人: Lazar, Lijo 
日期: 星期三, 2022年5月25日 下午8:38
收件人: Yang, Stanley , amd-gfx@lists.freedesktop.org 
, Zhang, Hawking , Zhou1, 
Tao , Quan, Evan 
主题: Re: [PATCH Review v3 2/2] drm/amdgpu: print umc correctable error address


On 5/25/2022 11:40 AM, Stanley.Yang wrote:
> Changed from V1:
>remove unnecessary same row physical address calculation
>
> Changed from V2:
>move record_ce_addr_supported to umc_ecc_info struct
>
> Signed-off-by: Stanley.Yang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  5 ++
>   drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 50 ++-
>   .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  1 +
>   3 files changed, 54 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 28e603243b67..bf5a95104ec1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -333,6 +333,11 @@ struct ecc_info_per_ch {
>
>   struct umc_ecc_info {
>struct ecc_info_per_ch ecc[MAX_UMC_CHANNEL_NUM];
> +
> + /* Determine smu ecctable whether support
> +  * record correctable error address
> +  */
> + int record_ce_addr_supported;
>   };
>
>   struct amdgpu_ras {
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c 
> b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> index 606892dbea1c..bf7524f16b66 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> @@ -119,6 +119,24 @@ static void 
> umc_v6_7_ecc_info_query_correctable_error_count(struct amdgpu_device
>*error_count += 1;
>
>umc_v6_7_query_error_status_helper(adev, mc_umc_status, 
> umc_reg_offset);
> +
> + if (ras->umc_ecc.record_ce_addr_supported)  {
> + uint64_t err_addr, soc_pa;
> + uint32_t channel_index =
> + adev->umc.channel_idx_tbl[umc_inst * 
> adev->umc.channel_inst_num + ch_inst];
> +
> + err_addr = 
> ras->umc_ecc.ecc[eccinfo_table_idx].mca_ceumc_addr;
> + err_addr = REG_GET_FIELD(err_addr, 
> MCA_UMC_UMC0_MCUMC_ADDRT0, ErrorAddr);
> + /* translate umc channel address to soc pa, 3 parts are 
> included */
> + soc_pa = ADDR_OF_8KB_BLOCK(err_addr) |
> + ADDR_OF_256B_BLOCK(channel_index) |
> + OFFSET_IN_256B_BLOCK(err_addr);
> +
> + /* The umc channel bits are not original values, they 
> are hashed */
> + SET_CHANNEL_HASH(channel_index, soc_pa);
> +

UMC address to PA conversion is common regardless of UE/CE error
addresses. You may want to pack it in a small function.

Regardless,
Acked-by: Lijo Lazar 

Thanks,
Lijo
Stanley: These lines are indeed redundant. I'll make a patch to simplify it.

Reagards,
Stanley


> + dev_info(adev->dev, "Error Address(PA): 0x%llx\n", 
> soc_pa);
> + }
>}
>   }
>
> @@ -251,7 +269,9 @@ static void 
> umc_v6_7_ecc_info_query_ras_error_address(struct amdgpu_device *adev
>
>   static void umc_v6_7_query_correctable_error_count(struct amdgpu_device 
> *adev,
>   uint32_t umc_reg_offset,
> -unsigned long *error_count)
> +unsigned long *error_count,
> +uint32_t ch_inst,
> +uint32_t umc_inst)
>   {
>uint32_t ecc_err_cnt_sel, ecc_err_cnt_sel_addr;
>uint32_t ecc_err_cnt, ecc_err_cnt_addr;
> @@ -295,6 +315,31 @@ static void 
> umc_v6_7_query_correctable_error_count(struct amdgpu_device *adev,
>*error_count += 1;
>
>umc_v6_7_query_error_status_helper(adev, mc_umc_status, 
> umc_reg_offset);
> +
> + {
> + uint64_t err_addr, soc_pa;
> + uint32_t mc_umc_addrt0;
> + uint32_t channel_index;
> +
> + mc_umc_addrt0 =
> + SOC15_REG_OFFSET(UMC, 0, 
> regMCA_UMC_UMC0_MCUMC_ADDRT0);
> +
> + channel_index =
> + adev->umc.channel_idx_tbl[umc_inst * 
> adev->umc.channel_inst_num + ch_inst];
> +
> + err_addr = RREG64_PCIE((mc_umc_addrt0 + umc_reg_offset) 
> *

答复: [PATCH Review v3 2/2] drm/amdgpu: print umc correctable error address

2022-05-25 Thread Yang, Stanley
[AMD Official Use Only - General]


[AMD Official Use Only - General]


发件人: Wang, Yang(Kevin) 
日期: 星期三, 2022年5月25日 下午2:52
收件人: Yang, Stanley , amd-gfx@lists.freedesktop.org 
, Zhang, Hawking , Zhou1, 
Tao , Quan, Evan , Lazar, Lijo 

主题: Re: [PATCH Review v3 2/2] drm/amdgpu: print umc correctable error address

[AMD Official Use Only - General]

From: amd-gfx  on behalf of Stanley.Yang 

Sent: Wednesday, May 25, 2022 2:10 PM
To: amd-gfx@lists.freedesktop.org ; Zhang, 
Hawking ; Zhou1, Tao ; Quan, Evan 
; Lazar, Lijo 
Cc: Yang, Stanley 
Subject: [PATCH Review v3 2/2] drm/amdgpu: print umc correctable error address

Changed from V1:
remove unnecessary same row physical address calculation

Changed from V2:
move record_ce_addr_supported to umc_ecc_info struct

Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  5 ++
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 50 ++-
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  1 +
 3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 28e603243b67..bf5a95104ec1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -333,6 +333,11 @@ struct ecc_info_per_ch {

 struct umc_ecc_info {
 struct ecc_info_per_ch ecc[MAX_UMC_CHANNEL_NUM];
+
+   /* Determine smu ecctable whether support
+* record correctable error address
+*/
+   int record_ce_addr_supported;
 };

[kevin]:

  1.  the new field of record_ce_addr_supported is not set on sienna_cichlid 
chip.
Stanley: Sienna_cichild not support this feature, so do not set 
record_ce_addr_supported.

  1.  and this field is better to renamed to others when this ecc table(pmfw 
side) update again in the furture. .e.g: ecc_table_version

Stanley: To name record_ce_addr_supported is more intuitive then using 
ecc_table_version or others.

Best Regards
Kevin

 struct amdgpu_ras {
diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c 
b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
index 606892dbea1c..bf7524f16b66 100644
--- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
+++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
@@ -119,6 +119,24 @@ static void 
umc_v6_7_ecc_info_query_correctable_error_count(struct amdgpu_device
 *error_count += 1;

 umc_v6_7_query_error_status_helper(adev, mc_umc_status, 
umc_reg_offset);
+
+   if (ras->umc_ecc.record_ce_addr_supported)  {
+   uint64_t err_addr, soc_pa;
+   uint32_t channel_index =
+   adev->umc.channel_idx_tbl[umc_inst * 
adev->umc.channel_inst_num + ch_inst];
+
+   err_addr = 
ras->umc_ecc.ecc[eccinfo_table_idx].mca_ceumc_addr;
+   err_addr = REG_GET_FIELD(err_addr, 
MCA_UMC_UMC0_MCUMC_ADDRT0, ErrorAddr);
+   /* translate umc channel address to soc pa, 3 parts are 
included */
+   soc_pa = ADDR_OF_8KB_BLOCK(err_addr) |
+   ADDR_OF_256B_BLOCK(channel_index) |
+   OFFSET_IN_256B_BLOCK(err_addr);
+
+   /* The umc channel bits are not original values, they 
are hashed */
+   SET_CHANNEL_HASH(channel_index, soc_pa);
+
+   dev_info(adev->dev, "Error Address(PA): 0x%llx\n", 
soc_pa);
+   }
 }
 }

@@ -251,7 +269,9 @@ static void 
umc_v6_7_ecc_info_query_ras_error_address(struct amdgpu_device *adev

 static void umc_v6_7_query_correctable_error_count(struct amdgpu_device *adev,
uint32_t umc_reg_offset,
-  unsigned long *error_count)
+  unsigned long *error_count,
+  uint32_t ch_inst,
+  uint32_t umc_inst)
 {
 uint32_t ecc_err_cnt_sel, ecc_err_cnt_sel_addr;
 uint32_t ecc_err_cnt, ecc_err_cnt_addr;
@@ -295,6 +315,31 @@ static void umc_v6_7_query_correctable_error_count(struct 
amdgpu_device *adev,
 *error_count += 1;

 umc_v6_7_query_error_status_helper(adev, mc_umc_status, 
umc_reg_offset);
+
+   {
+   uint64_t err_addr, soc_pa;
+   uint32_t mc_umc_addrt0;
+   uint32_t channel_index;
+
+   mc_umc_addrt0 =
+   SOC15_REG_OFFSET(UMC, 0, 
regMCA_UMC_UMC0_MCUMC_ADDRT0);
+
+   channel_index =
+   adev->umc.channel_idx_tbl[umc_inst * 
adev->umc.channel_inst_num + ch_inst];
+
+   err_addr = RREG64_PCIE((mc_umc_addrt0 + umc_reg_offset) 
* 4);
+

答复: [PATCH Review v2 2/2] drm/amdgpu: print umc correctable error address

2022-05-24 Thread Yang, Stanley
[AMD Official Use Only - General]


[AMD Official Use Only - General]


发件人: Lazar, Lijo 
日期: 星期三, 2022年5月25日 上午12:03
收件人: Yang, Stanley , amd-gfx@lists.freedesktop.org 
, Zhang, Hawking , Zhou1, 
Tao , Quan, Evan 
主题: Re: [PATCH Review v2 2/2] drm/amdgpu: print umc correctable error address


On 5/24/2022 8:00 PM, Stanley.Yang wrote:
> Changed from V1:
>remove unnecessary same row physical address calculation
>
> Signed-off-by: Stanley.Yang 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  5 ++
>   drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 52 ++-
>   .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  1 +
>   3 files changed, 56 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 3f23f9ad3249..985b8cddb5a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1108,6 +1108,11 @@ struct amdgpu_device {
>
>boolscpm_enabled;
>uint32_tscpm_status;
> +
> + /* Determine smu ecctable whether support
> +  * record correctable error address
> +  */
> + int record_ce_addr_supported;

Why not keep this in umc_ecc_info passed back from FW?

Thanks,
Lijo

Stanley: Good point, this can keep the overall logic of the RAS ecctableinfo, 
thanks Lijo.


>   };
>
>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c 
> b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> index 606892dbea1c..91bdc5e048c2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> @@ -119,6 +119,24 @@ static void 
> umc_v6_7_ecc_info_query_correctable_error_count(struct amdgpu_device
>*error_count += 1;
>
>umc_v6_7_query_error_status_helper(adev, mc_umc_status, 
> umc_reg_offset);
> +
> + if (adev->record_ce_addr_supported) {
> + uint64_t err_addr, soc_pa;
> + uint32_t channel_index =
> + adev->umc.channel_idx_tbl[umc_inst * 
> adev->umc.channel_inst_num + ch_inst];
> +
> + err_addr = 
> ras->umc_ecc.ecc[eccinfo_table_idx].mca_ceumc_addr;
> + err_addr = REG_GET_FIELD(err_addr, 
> MCA_UMC_UMC0_MCUMC_ADDRT0, ErrorAddr);
> + /* translate umc channel address to soc pa, 3 parts are 
> included */
> + soc_pa = ADDR_OF_8KB_BLOCK(err_addr) |
> + ADDR_OF_256B_BLOCK(channel_index) |
> + OFFSET_IN_256B_BLOCK(err_addr);
> +
> + /* The umc channel bits are not original values, they 
> are hashed */
> + SET_CHANNEL_HASH(channel_index, soc_pa);
> +
> + dev_info(adev->dev, "Error Address(PA): 0x%llx\n", 
> soc_pa);
> + }
>}
>   }
>
> @@ -251,7 +269,9 @@ static void 
> umc_v6_7_ecc_info_query_ras_error_address(struct amdgpu_device *adev
>
>   static void umc_v6_7_query_correctable_error_count(struct amdgpu_device 
> *adev,
>   uint32_t umc_reg_offset,
> -unsigned long *error_count)
> +unsigned long *error_count,
> +uint32_t ch_inst,
> +uint32_t umc_inst)
>   {
>uint32_t ecc_err_cnt_sel, ecc_err_cnt_sel_addr;
>uint32_t ecc_err_cnt, ecc_err_cnt_addr;
> @@ -295,6 +315,33 @@ static void 
> umc_v6_7_query_correctable_error_count(struct amdgpu_device *adev,
>*error_count += 1;
>
>umc_v6_7_query_error_status_helper(adev, mc_umc_status, 
> umc_reg_offset);
> +
> + {
> + uint64_t err_addr, soc_pa;
> + uint32_t mc_umc_addrt0;
> + uint32_t channel_index;
> +
> + mc_umc_addrt0 =
> + SOC15_REG_OFFSET(UMC, 0, 
> regMCA_UMC_UMC0_MCUMC_ADDRT0);
> +
> + channel_index =
> + adev->umc.channel_idx_tbl[umc_inst * 
> adev->umc.channel_inst_num + ch_inst];
> +
> + err_addr = RREG64_PCIE((mc_umc_addrt0 + umc_reg_offset) 
> * 4);
> + err_addr = REG_GET_FIELD(err_addr, 
> MCA_UMC_UMC0_MCUMC_ADDRT0, ErrorAddr);
> +
> + /* translate umc channel address t

答复: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

2022-05-23 Thread Yang, Stanley
[AMD Official Use Only - General]


[AMD Official Use Only - General]
Hi Kevin,

Please ignore above mail, thanks your suggestion, I will try it.

Regards,
Stanley
发件人: amd-gfx  代表 Yang, Stanley 

日期: 星期一, 2022年5月23日 下午6:16
收件人: Wang, Yang(Kevin) , amd-gfx@lists.freedesktop.org 
, Zhang, Hawking , Zhou1, 
Tao , Quan, Evan , Lazar, Lijo 

主题: 回复: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

[AMD Official Use Only - General]


[AMD Official Use Only - General]

Hi Kevin,

发件人: Wang, Yang(Kevin) 
发送时间: Monday, May 23, 2022 4:49 PM
收件人: Yang, Stanley ; amd-gfx@lists.freedesktop.org; 
Zhang, Hawking ; Zhou1, Tao ; Quan, 
Evan ; Lazar, Lijo 
主题: Re: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable


[AMD Official Use Only - General]



From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 on behalf of Stanley.Yang mailto:stanley.y...@amd.com>>
Sent: Monday, May 23, 2022 4:17 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>; Zhou1, Tao 
mailto:tao.zh...@amd.com>>; Quan, Evan 
mailto:evan.q...@amd.com>>; Lazar, Lijo 
mailto:lijo.la...@amd.com>>
Cc: Yang, Stanley mailto:stanley.y...@amd.com>>
Subject: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

SMU add a new variable mca_ceumc_addr to record
umc correctable error address in EccInfo table,
driver side add ecctable_v2 to support this feature

Signed-off-by: Stanley.Yang mailto:stanley.y...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |   1 +
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |   2 +
 .../inc/pmfw_if/smu13_driver_if_aldebaran.h   |  15 +++
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 101 ++
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|   2 +
 5 files changed, 98 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index b9a6fac2b8b2..28e603243b67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -328,6 +328,7 @@ struct ecc_info_per_ch {
 uint16_t ce_count_hi_chip;
 uint64_t mca_umc_status;
 uint64_t mca_umc_addr;
+   uint64_t mca_ceumc_addr;
 };

 struct umc_ecc_info {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
index a6a7b6c33683..9f7257ada437 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
@@ -322,6 +322,7 @@ enum smu_table_id
 SMU_TABLE_PACE,
 SMU_TABLE_ECCINFO,
 SMU_TABLE_COMBO_PPTABLE,
+   SMU_TABLE_ECCINFO_V2,
 SMU_TABLE_COUNT,
 };

@@ -340,6 +341,7 @@ struct smu_table_context
 void*driver_pptable;
 void*combo_pptable;
 void*ecc_table;
+   void*ecc_table_v2;  // adapt to smu support 
record mca_ceumc_addr
 void*driver_smu_config_table;
 struct smu_tabletables[SMU_TABLE_COUNT];
 /*
diff --git 
a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
index 0f67c56c2863..2868604eff49 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
@@ -522,6 +522,21 @@ typedef struct {
 EccInfo_t  EccInfo[ALDEBARAN_UMC_CHANNEL_NUM];
 } EccInfoTable_t;

+typedef struct {
+   uint64_t mca_umc_status;
+   uint64_t mca_umc_addr;
+   uint64_t mca_ceumc_addr;
+
+   uint16_t ce_count_lo_chip;
+   uint16_t ce_count_hi_chip;
+
+   uint32_t eccPadding;
+} EccInfo_t_v2;
+
+typedef struct {
+   EccInfo_t_v2  EccInfo[ALDEBARAN_UMC_CHANNEL_NUM];
+} EccInfoTable_t_v2;
+
 // These defines are used with the following messages:
 // SMC_MSG_TransferTableDram2Smu
 // SMC_MSG_TransferTableSmu2Dram
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 38af648cb857..e58df9490cec 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -82,6 +82,12 @@
  */
 #define SUPPORT_ECCTABLE_SMU_VERSION 0x00442a00

+/*
+ * SMU support mca_ceumc_addr in ECCTABLE since version 68.55.0,
+ * use this to check mca_ceumc_addr record whether support
+ */
+#define SUPPORT_ECCTABLE_V2_SMU_VERSION 0x00443700
+
 /*
  * SMU support BAD CHENNEL info MSG since version 68.51.00,
  * use this to check ECCTALE feature whether support
@@ -239,6 +245,9 @@ static int aldebaran_tables_init(struct smu_context *smu)
 SMU

回复: [PATCH Review 2/2] drm/amdgpu: print umc correctable error address

2022-05-23 Thread Yang, Stanley
Thanks tao, will update.

> -邮件原件-
> 发件人: Zhou1, Tao 
> 发送时间: Monday, May 23, 2022 6:22 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ; Quan,
> Evan ; Lazar, Lijo 
> 抄送: Yang, Stanley 
> 主题: RE: [PATCH Review 2/2] drm/amdgpu: print umc correctable error address
> 
> [AMD Official Use Only - General]
> 
> 
> 
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Monday, May 23, 2022 4:17 PM
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Zhou1, Tao ; Quan,
> Evan
> > ; Lazar, Lijo 
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 2/2] drm/amdgpu: print umc correctable error
> > address
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  5 ++
> >  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 55
> > ++-
> >  .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c|  2 +
> >  3 files changed, 60 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 3f23f9ad3249..985b8cddb5a1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -1108,6 +1108,11 @@ struct amdgpu_device {
> >
> > boolscpm_enabled;
> > uint32_tscpm_status;
> > +
> > +   /* Determine smu ecctable whether support
> > +* record correctable error address
> > +*/
> > +   int record_ce_addr_supported;
> >  };
> >
> >  static inline struct amdgpu_device *drm_to_adev(struct drm_device
> > *ddev) diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > index 606892dbea1c..47bd39e52e9b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > @@ -119,6 +119,27 @@ static void
> > umc_v6_7_ecc_info_query_correctable_error_count(struct amdgpu_device
> > *error_count += 1;
> >
> > umc_v6_7_query_error_status_helper(adev,
> > mc_umc_status, umc_reg_offset);
> > +
> > +   if (adev->record_ce_addr_supported) {
> > +   uint64_t err_addr, soc_pa;
> > +   uint32_t channel_index =
> > +   adev->umc.channel_idx_tbl[umc_inst *
> > adev->umc.channel_inst_num +
> > +ch_inst];
> > +
> > +   err_addr = ras-
> > >umc_ecc.ecc[eccinfo_table_idx].mca_ceumc_addr;
> > +   err_addr = REG_GET_FIELD(err_addr,
> > MCA_UMC_UMC0_MCUMC_ADDRT0, ErrorAddr);
> > +   /* translate umc channel address to soc pa, 3 parts
> > are included */
> > +   soc_pa = ADDR_OF_8KB_BLOCK(err_addr) |
> > +
> > ADDR_OF_256B_BLOCK(channel_index) |
> > +   OFFSET_IN_256B_BLOCK(err_addr);
> > +
> > +   /* The umc channel bits are not original values, they
> > are hashed */
> > +   SET_CHANNEL_HASH(channel_index, soc_pa);
> > +
> > +   /* clear [C4 C3 C2] in soc physical address */
> > +   soc_pa &= ~(0x7ULL << UMC_V6_7_PA_C2_BIT);
> 
> [Tao] this clear is the preparation for looping all column bits in same row, 
> you
> only need physical address of one page, the code can be removed.
> 
> > +
> > +   dev_info(adev->dev, "Error Address(PA): 0x%llx\n",
> > soc_pa);
> > +   }
> > }
> >  }
> >
> > @@ -251,7 +272,9 @@ static void
> > umc_v6_7_ecc_info_query_ras_error_address(struct amdgpu_device *adev
> >
> >  static void umc_v6_7_query_correctable_error_count(struct
> > amdgpu_device *adev,
> >uint32_t umc_reg_offset,
> > -  unsigned long
> > *error_count)
> > +  unsigned long
> > *error_count,
> > +  uint32_t ch_inst,
> > +  uint32_t umc_inst)
> >  {
> > uint32_t ecc_err_cnt_sel, ecc_err_cnt_sel_addr;
> > uint32_t ecc_err_cnt, ecc_err_cnt_addr; @@ -295,6 +318,33 @@ static
> > void umc_v6_7_query_correctable_error_count(struct
> > amdgpu_device *adev,
> > *error_count += 1;
> >
> > umc_v6_7_query_error_s

回复: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

2022-05-23 Thread Yang, Stanley
[AMD Official Use Only - General]

Hi Kevin,

发件人: Wang, Yang(Kevin) 
发送时间: Monday, May 23, 2022 4:49 PM
收件人: Yang, Stanley ; amd-gfx@lists.freedesktop.org; 
Zhang, Hawking ; Zhou1, Tao ; Quan, 
Evan ; Lazar, Lijo 
主题: Re: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable


[AMD Official Use Only - General]



From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 on behalf of Stanley.Yang mailto:stanley.y...@amd.com>>
Sent: Monday, May 23, 2022 4:17 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>; Zhou1, Tao 
mailto:tao.zh...@amd.com>>; Quan, Evan 
mailto:evan.q...@amd.com>>; Lazar, Lijo 
mailto:lijo.la...@amd.com>>
Cc: Yang, Stanley mailto:stanley.y...@amd.com>>
Subject: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

SMU add a new variable mca_ceumc_addr to record
umc correctable error address in EccInfo table,
driver side add ecctable_v2 to support this feature

Signed-off-by: Stanley.Yang mailto:stanley.y...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |   1 +
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |   2 +
 .../inc/pmfw_if/smu13_driver_if_aldebaran.h   |  15 +++
 .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 101 ++
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|   2 +
 5 files changed, 98 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index b9a6fac2b8b2..28e603243b67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -328,6 +328,7 @@ struct ecc_info_per_ch {
 uint16_t ce_count_hi_chip;
 uint64_t mca_umc_status;
 uint64_t mca_umc_addr;
+   uint64_t mca_ceumc_addr;
 };

 struct umc_ecc_info {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
index a6a7b6c33683..9f7257ada437 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
@@ -322,6 +322,7 @@ enum smu_table_id
 SMU_TABLE_PACE,
 SMU_TABLE_ECCINFO,
 SMU_TABLE_COMBO_PPTABLE,
+   SMU_TABLE_ECCINFO_V2,
 SMU_TABLE_COUNT,
 };

@@ -340,6 +341,7 @@ struct smu_table_context
 void*driver_pptable;
 void*combo_pptable;
 void*ecc_table;
+   void*ecc_table_v2;  // adapt to smu support 
record mca_ceumc_addr
 void*driver_smu_config_table;
 struct smu_tabletables[SMU_TABLE_COUNT];
 /*
diff --git 
a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
index 0f67c56c2863..2868604eff49 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
@@ -522,6 +522,21 @@ typedef struct {
 EccInfo_t  EccInfo[ALDEBARAN_UMC_CHANNEL_NUM];
 } EccInfoTable_t;

+typedef struct {
+   uint64_t mca_umc_status;
+   uint64_t mca_umc_addr;
+   uint64_t mca_ceumc_addr;
+
+   uint16_t ce_count_lo_chip;
+   uint16_t ce_count_hi_chip;
+
+   uint32_t eccPadding;
+} EccInfo_t_v2;
+
+typedef struct {
+   EccInfo_t_v2  EccInfo[ALDEBARAN_UMC_CHANNEL_NUM];
+} EccInfoTable_t_v2;
+
 // These defines are used with the following messages:
 // SMC_MSG_TransferTableDram2Smu
 // SMC_MSG_TransferTableSmu2Dram
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 38af648cb857..e58df9490cec 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -82,6 +82,12 @@
  */
 #define SUPPORT_ECCTABLE_SMU_VERSION 0x00442a00

+/*
+ * SMU support mca_ceumc_addr in ECCTABLE since version 68.55.0,
+ * use this to check mca_ceumc_addr record whether support
+ */
+#define SUPPORT_ECCTABLE_V2_SMU_VERSION 0x00443700
+
 /*
  * SMU support BAD CHENNEL info MSG since version 68.51.00,
  * use this to check ECCTALE feature whether support
@@ -239,6 +245,9 @@ static int aldebaran_tables_init(struct smu_context *smu)
 SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO, sizeof(EccInfoTable_t),
PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);

+   SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO_V2, sizeof(EccInfoTable_t_v2),
+   PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
+
[kevin]:
this table mapping is not needed, the reason as below.

 smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t), GFP_KERNEL);
 if (!smu_table->metrics_table

回复: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in ecctable

2022-05-23 Thread Yang, Stanley
[AMD Official Use Only - General]

Hi Lijo,
+@Joo, Maria

> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Monday, May 23, 2022 5:12 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ; Zhou1,
> Tao ; Quan, Evan 
> 主题: Re: [PATCH Review 1/2] drm/amdgpu/pm: support mca_ceumc_addr in
> ecctable
>
>
>
> On 5/23/2022 1:47 PM, Stanley.Yang wrote:
> > SMU add a new variable mca_ceumc_addr to record umc correctable error
> > address in EccInfo table, driver side add ecctable_v2 to support this
> > feature
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |   1 +
> >   drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |   2 +
> >   .../inc/pmfw_if/smu13_driver_if_aldebaran.h   |  15 +++
> >   .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 101 ++
> >   .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|   2 +
> >   5 files changed, 98 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index b9a6fac2b8b2..28e603243b67 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -328,6 +328,7 @@ struct ecc_info_per_ch {
> > uint16_t ce_count_hi_chip;
> > uint64_t mca_umc_status;
> > uint64_t mca_umc_addr;
> > +   uint64_t mca_ceumc_addr;
> >   };
> >
> >   struct umc_ecc_info {
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > index a6a7b6c33683..9f7257ada437 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > @@ -322,6 +322,7 @@ enum smu_table_id
> > SMU_TABLE_PACE,
> > SMU_TABLE_ECCINFO,
> > SMU_TABLE_COMBO_PPTABLE,
> > +   SMU_TABLE_ECCINFO_V2,
>
> Hi Stanley,
>
> This is not the right approach. Need to ask FW team to fix this. There 
> shouldn't
> be any new id with each version of table. You may check Sienna Cichlid smu
> metrics table as an example and ask FW team to follow something similar. I
> don't see 68.55 being released, so it's not late anyway. We don't need to keep
> defining pointers in table context per version of ECC table.
>
> Thanks,
> Lijo
[Yang, Stanley] : There is not enough padding space in ecc_table, you can check 
below struct,
the new added variable is uint32_t type, I think smu can't add uint32_t type in 
ecc_table directly without change ecc_tabe size.
If you have any better approach, we can discuss a better method to complete it.
512 typedef struct {
513 uint64_t mca_umc_status;
514 uint64_t mca_umc_addr;
515 uint16_t ce_count_lo_chip;
516 uint16_t ce_count_hi_chip;
517
518 uint32_t eccPadding;
519 } EccInfo_t;

Thanks,
Stanley
>
> > SMU_TABLE_COUNT,
> >   };
> >
> > @@ -340,6 +341,7 @@ struct smu_table_context
> > void*driver_pptable;
> > void*combo_pptable;
> > void*ecc_table;
> > +   void*ecc_table_v2;  // adapt to smu support 
> > record
> mca_ceumc_addr
> > void*driver_smu_config_table;
> > struct smu_tabletables[SMU_TABLE_COUNT];
> > /*
> > diff --git
> >
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
> >
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
> > index 0f67c56c2863..2868604eff49 100644
> > ---
> >
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebaran.h
> > +++
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu13_driver_if_aldebar
> > +++ an.h
> > @@ -522,6 +522,21 @@ typedef struct {
> > EccInfo_t  EccInfo[ALDEBARAN_UMC_CHANNEL_NUM];
> >   } EccInfoTable_t;
> >
> > +typedef struct {
> > +   uint64_t mca_umc_status;
> > +   uint64_t mca_umc_addr;
> > +   uint64_t mca_ceumc_addr;
> > +
> > +   uint16_t ce_count_lo_chip;
> > +   uint16_t ce_count_hi_chip;
> > +
> > +   uint32_t eccPadding;
> > +} EccInfo_t_v2;
> > +
> > +typedef struct {
> > +   EccInfo_t_v2  EccInfo[ALDEBARAN_UMC_CHANNEL_NUM];
> > +} EccInfoTable_t_v2;
> > +
> >   // These defines are used with the following messages:
> >   // SMC_MSG_TransferTableDram2Smu
> >   // SMC_MSG_TransferTableSmu2Dram
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu1

答复: [PATCH Review 1/1] drm/amdgpu: support ras on SRIOV

2022-05-18 Thread Yang, Stanley
[AMD Official Use Only - General]


[AMD Official Use Only - General]
Thanks tao, will update before submit.

Regards,
Stanley
发件人: Zhou1, Tao 
日期: 星期四, 2022年5月19日 上午10:30
收件人: Yang, Stanley , amd-gfx@lists.freedesktop.org 
, Zhang, Hawking 
抄送: Yang, Stanley 
主题: RE: [PATCH Review 1/1] drm/amdgpu: support ras on SRIOV


> -Original Message-
> From: Stanley.Yang 
> Sent: Wednesday, May 18, 2022 11:44 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Zhou1, Tao 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: support ras on SRIOV
>
> support umc/gfx/sdma ras on guest side
>
> Changed from V1:
> move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler
>
> Change-Id: Ic7dda45d8f8cf2d5f1abc7705abc153d558da8a1
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 42 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c   |  4 +++
>  drivers/gpu/drm/amd/amdgpu/psp_v13_0.c |  9 +++--
>  4 files changed, 45 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b583026dc893..ba7990d0dc0e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5218,6 +5218,10 @@ int amdgpu_device_gpu_recover_imp(struct
> amdgpu_device *adev,
>r = amdgpu_device_reset_sriov(adev, job ? false : true);
>if (r)
>adev->asic_reset_res = r;
> +
> + /* Aldebaran supports ras in SRIOV, so need resume ras during
> reset */
> + if (adev->ip_versions[GC_HWIP][0] == IP_VERSION(9, 4, 2))
> + amdgpu_ras_resume(adev);
>} else {
>r = amdgpu_do_asic_reset(device_list_handle, _context);
>if (r && r == -EAGAIN)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index a653cf3b3d13..2b28210c4994 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -726,7 +726,9 @@ int amdgpu_ras_feature_enable(struct amdgpu_device
> *adev,
>/* Do not enable if it is not allowed. */
>WARN_ON(enable && !amdgpu_ras_is_feature_allowed(adev, head));
>
> - if (!amdgpu_ras_intr_triggered()) {
> + /* Only enable ras feature operation handle on host side */
> + if (!amdgpu_sriov_vf(adev) &&
> + !amdgpu_ras_intr_triggered()) {
>ret = psp_ras_enable_features(>psp, info, enable);
>if (ret) {
>dev_err(adev->dev, "ras %s %s failed poison:%d
> ret:%d\n", @@ -1523,6 +1525,10 @@ static int amdgpu_ras_fs_fini(struct
> amdgpu_device *adev)
>   */
>  void amdgpu_ras_interrupt_fatal_error_handler(struct amdgpu_device *adev)  {
> + /* Fatal error events are handled on host side */
> + if (amdgpu_sriov_vf(adev))
> + return;
> +
>if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__PCIE_BIF))
>return;

[Tao] The two conditions above can be merged, other than that the patch is:

Reviewed-by: Tao Zhou 

>
> @@ -2270,10 +2276,14 @@ static void amdgpu_ras_check_supported(struct
> amdgpu_device *adev)  {
>adev->ras_hw_enabled = adev->ras_enabled = 0;
>
> - if (amdgpu_sriov_vf(adev) || !adev->is_atom_fw ||
> + if (!adev->is_atom_fw ||
>!amdgpu_ras_asic_supported(adev))
>return;
>
> + if (!(amdgpu_sriov_vf(adev) &&
> + (adev->ip_versions[MP1_HWIP][0] == IP_VERSION(13, 0, 2
> + return;
> +
>if (!adev->gmc.xgmi.connected_to_cpu) {
>if (amdgpu_atomfirmware_mem_ecc_supported(adev)) {
>dev_info(adev->dev, "MEM ECC is active.\n"); @@ -
> 2285,15 +2295,21 @@ static void amdgpu_ras_check_supported(struct
> amdgpu_device *adev)
>
>if (amdgpu_atomfirmware_sram_ecc_supported(adev)) {
>dev_info(adev->dev, "SRAM ECC is active.\n");
> - adev->ras_hw_enabled |= ~(1 <<
> AMDGPU_RAS_BLOCK__UMC |
> - 1 <<
> AMDGPU_RAS_BLOCK__DF);
> -
> - if (adev->ip_versions[VCN_HWIP][0] == IP_VERSION(2,
> 6, 0))
> - adev->ras_hw_enabled |= (1 <<
> AMDGPU_RAS_BLOCK__VCN |
> - 1 <<
> AMDGPU_RAS_BLOCK__JPEG

回复: [PATCH 1/3] drm/amdgpu: add poison consumption flag for RAS IH

2022-04-12 Thread Yang, Stanley
The series is Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -邮件原件-
> 发件人: Zhou1, Tao 
> 发送时间: Tuesday, April 12, 2022 11:06 AM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Lazar, Lijo ; Ziya,
> Mohammad zafar ; Zhang, Hawking
> ; Chai, Thomas 
> 主题: RE: [PATCH 1/3] drm/amdgpu: add poison consumption flag for RAS IH
> 
> [AMD Official Use Only]
> 
> Hi Stanley,
> 
> The flag is set by RAS block poison irq handler, such as vcn/jpeg poison irq
> handler. It's not configured in RAS init.
> 
> Regards,
> Tao
> 
> > -Original Message-
> > From: Yang, Stanley 
> > Sent: Monday, April 11, 2022 10:12 PM
> > To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org;
> > Lazar, Lijo ; Ziya, Mohammad zafar
> > ; Zhang, Hawking
> ;
> > Chai, Thomas 
> > Subject: 回复: [PATCH 1/3] drm/amdgpu: add poison consumption flag for
> > RAS IH
> >
> > [AMD Official Use Only]
> >
> > Hi Tao,
> >
> > According to the series patches, I have one question, is the
> > ras_ih_flag set according to poison mode configuration, if yes, driver
> > will handle poison once get ecc_irq interrupt, but at this moment
> > there may no app to consumes it, this seems conflict the poison
> consumption definition.
> >
> > Regards,
> > Stanley
> > > -----邮件原件-
> > > 发件人: Zhou1, Tao 
> > > 发送时间: Monday, April 11, 2022 7:08 PM
> > > 收件人: amd-gfx@lists.freedesktop.org; Lazar, Lijo
> > > ; Ziya, Mohammad zafar
> > > ; Zhang, Hawking
> > > ; Yang, Stanley ;
> Chai,
> > > Thomas 
> > > 抄送: Zhou1, Tao 
> > > 主题: [PATCH 1/3] drm/amdgpu: add poison consumption flag for RAS IH
> > >
> > > So we can distinguish RAS poison consumption interrupt from UE
> interrupt.
> > >
> > > Signed-off-by: Tao Zhou 
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 7 +++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > > index 606df8869b89..380f4c3020c7 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > > @@ -314,6 +314,11 @@ enum amdgpu_ras_ret {
> > >   AMDGPU_RAS_PT,
> > >  };
> > >
> > > +enum amdgpu_ras_ih_flag {
> > > + AMDGPU_RAS_IH_POISON_CONSUMPTION = 0,
> > > + AMDGPU_RAS_IH_LAST,
> > > +};
> > > +
> > >  struct ras_common_if {
> > >   enum amdgpu_ras_block block;
> > >   enum amdgpu_ras_error_type type; @@ -419,6 +424,8 @@ struct
> > > ras_ih_data {
> > >   unsigned int aligned_element_size;
> > >   unsigned int rptr;
> > >   unsigned int wptr;
> > > + /* interrupt type flag */
> > > + unsigned int flag;
> > >  };
> > >
> > >  struct ras_manager {
> > > --
> > > 2.35.1
> >


回复: [PATCH 1/3] drm/amdgpu: add poison consumption flag for RAS IH

2022-04-11 Thread Yang, Stanley
[AMD Official Use Only]

Hi Tao,

According to the series patches, I have one question, is the ras_ih_flag set 
according to poison mode configuration, if yes, driver will handle poison once 
get ecc_irq interrupt, but at this moment there may no app to consumes it, this 
seems conflict the poison consumption definition.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhou1, Tao 
> 发送时间: Monday, April 11, 2022 7:08 PM
> 收件人: amd-gfx@lists.freedesktop.org; Lazar, Lijo ;
> Ziya, Mohammad zafar ; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas 
> 抄送: Zhou1, Tao 
> 主题: [PATCH 1/3] drm/amdgpu: add poison consumption flag for RAS IH
>
> So we can distinguish RAS poison consumption interrupt from UE interrupt.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 606df8869b89..380f4c3020c7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -314,6 +314,11 @@ enum amdgpu_ras_ret {
>   AMDGPU_RAS_PT,
>  };
>
> +enum amdgpu_ras_ih_flag {
> + AMDGPU_RAS_IH_POISON_CONSUMPTION = 0,
> + AMDGPU_RAS_IH_LAST,
> +};
> +
>  struct ras_common_if {
>   enum amdgpu_ras_block block;
>   enum amdgpu_ras_error_type type;
> @@ -419,6 +424,8 @@ struct ras_ih_data {
>   unsigned int aligned_element_size;
>   unsigned int rptr;
>   unsigned int wptr;
> + /* interrupt type flag */
> + unsigned int flag;
>  };
>
>  struct ras_manager {
> --
> 2.35.1



回复: [PATCH Review 1/1] drm/amdgpu: print more correctable error info

2022-04-07 Thread Yang, Stanley
Thanks for your suggestion, it’s better to centralize umc mca status check in a 
helper function, will update.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhang, Hawking 
> 发送时间: Friday, April 8, 2022 11:11 AM
> 收件人: Zhou1, Tao ; Yang, Stanley
> ; amd-gfx@lists.freedesktop.org; Li, Candice
> 
> 抄送: Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: print more correctable error
> info
> 
> We shall consider centralize UMC MCA status check in a helper function, at
> least, querying IPID, SYND, and MCA_STATUS should be the same for both ue
> and ce.
> 
> Regards,
> Hawking
> 
> -Original Message-
> From: Zhou1, Tao 
> Sent: Friday, April 8, 2022 10:59
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org;
> Zhang, Hawking ; Li, Candice
> 
> Cc: Yang, Stanley 
> Subject: RE: [PATCH Review 1/1] drm/amdgpu: print more correctable error
> info
> 
> [AMD Official Use Only]
> 
> 
> 
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Friday, April 8, 2022 10:18 AM
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Zhou1, Tao ; Li,
> Candice
> > 
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 1/1] drm/amdgpu: print more correctable error
> > info
> >
> 
> [Tao] it's better to add description for the patch.
> 
> > Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 62
> > ++-
> >  1 file changed, 60 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > index c45d9c14ecbc..803119f75e39 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > @@ -70,15 +70,46 @@ static void
> > umc_v6_7_ecc_info_query_correctable_error_count(struct amdgpu_device
> {
> > uint64_t mc_umc_status;
> > uint32_t eccinfo_table_idx;
> > +   uint32_t umc_reg_offset;
> > +   uint32_t mc_umc_addr;
> > +   uint64_t reg_value;
> > struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> >
> > +   umc_reg_offset = get_umc_v6_7_reg_offset(adev,
> > +   umc_inst, ch_inst);
> > +
> > eccinfo_table_idx = umc_inst * adev->umc.channel_inst_num +
> ch_inst;
> > /* check for SRAM correctable error
> >   MCUMC_STATUS is a 64 bit register */
> > mc_umc_status = ras-
> > >umc_ecc.ecc[eccinfo_table_idx].mca_umc_status;
> > if (REG_GET_FIELD(mc_umc_status,
> > MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 &&
> > -   REG_GET_FIELD(mc_umc_status,
> > MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 1)
> > +   REG_GET_FIELD(mc_umc_status,
> > MCA_UMC_UMC0_MCUMC_STATUST0, CECC) ==
> > +1) {
> > *error_count += 1;
> > +
> > +   if (mc_umc_status)
> > +   dev_info(adev->dev, "MCA STATUS 0x%llx,
> > umc_reg_offset 0x%x\n",
> > +mc_umc_status, umc_reg_offset);
> > +
> > +   /* print IPID registers value */
> > +   mc_umc_addr =
> > +   SOC15_REG_OFFSET(UMC, 0,
> > regMCA_UMC_UMC0_MCUMC_IPIDT0);
> > +   reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset)
> *
> > 4);
> > +   if (reg_value)
> > +   dev_info(adev->dev, "MCA IPID 0x%llx,
> umc_reg_offset
> > 0x%x\n",
> > +reg_value, umc_reg_offset);
> > +
> > +   /* print SYND registers value */
> > +   mc_umc_addr =
> > +   SOC15_REG_OFFSET(UMC, 0,
> > regMCA_UMC_UMC0_MCUMC_SYNDT0);
> > +   reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset)
> *
> > 4);
> > +   if (reg_value)
> > +   dev_info(adev->dev, "MCA SYND 0x%llx,
> > umc_reg_offset 0x%x\n",
> > +reg_value, umc_reg_offset);
> > +
> > +   /* print MISC0 registers value */
> > +   mc_umc_addr =
> > +   SOC15_REG_OFFSET(UMC, 0,
> > regMCA_UMC_UMC0_MCUMC_MISC0T0);
> > +   reg_value = RREG64_PCIE((mc_umc_addr + umc_reg_offset)
> *
> > 4);
> > +   if (reg_value)
> > +   dev_info(adev->dev, "MCA MISC0 0x%llx,
> > umc_reg_offset 0x%x\n", reg_value, umc_reg_offset);
> > +   }
> 
> [Tao] can we implement a query_error_status function and:
> 
> 1. call query_error_status in xxx_error_count function, like t

回复: [PATCH Review 1/1] drm/amdgpu: support send bad channel info to smu

2022-03-02 Thread Yang, Stanley


> -邮件原件-
> 发件人: Zhou1, Tao 
> 发送时间: Wednesday, March 2, 2022 3:45 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Joo, Maria 
> 抄送: Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: support send bad channel info
> to smu
> 
> [AMD Official Use Only]
> 
> 
> 
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Tuesday, March 1, 2022 9:30 PM
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Zhou1, Tao ; Joo,
> Maria
> > 
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 1/1] drm/amdgpu: support send bad channel info
> > to smu
> >
> > Message SMU bad channel information bitmap to update OOB table
> >
> > Change-Id: I49a79af64d5263c28db059ecb8b8405a471431b4
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  7 +++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  3 ++
> >  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 25 ++-
> >  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h|  4 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   |  5 +++
> >  drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 12 ++
> >  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  1 +
> >  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 10 +
> >  drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  7 +++
> >  .../pm/swsmu/inc/pmfw_if/aldebaran_ppsmc.h|  3 +-
> >  drivers/gpu/drm/amd/pm/swsmu/inc/smu_types.h  |  3 +-
> >  .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 43
> +++
> >  12 files changed, 119 insertions(+), 4 deletions(-)
> 
> [Tao] It's better to split the patch into two parts, one for amdgpu and one 
> for
> pm.
[Yang, Stanley] : yeah, it makes sense, will update.
> 
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index d3875618ebf5..f9104f99eb9c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2068,6 +2068,7 @@ int amdgpu_ras_recovery_init(struct
> > amdgpu_device
> > *adev)
> > mutex_init(>recovery_lock);
> > INIT_WORK(>recovery_work, amdgpu_ras_do_recovery);
> > atomic_set(>in_recovery, 0);
> > +   con->eeprom_control.bad_channel_bitmap = 0;
> >
> > max_eeprom_records_count =
> > amdgpu_ras_eeprom_max_record_count();
> > amdgpu_ras_validate_threshold(adev,
> max_eeprom_records_count); @@
> > -2092,6 +2093,11 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> > *adev)
> > goto free;
> >
> > amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> > >eeprom_control.ras_num_recs);
> > +
> > +   if (con->update_channel_flag == true) {
> [Tao] It can be simplified to "if (con->update_channel_flag)"
[Yang, Stanley] : Yeah, both the "if (con->update_channel_flag)" and "if 
(con->update_channel_flag == ture)" are feasible.
> 
> > +   amdgpu_dpm_send_hbm_bad_channel_flag(adev,
> con-
> > >eeprom_control.bad_channel_bitmap);
> 
> [Tao] do we need to check status of the function and stop recovery_init if it
> fails?
[Yang, Stanley] : No, it don't affect ras recovery process even message smu 
failed.
> 
> > +   con->update_channel_flag = false;
> > +   }
> > }
> >
> >  #ifdef CONFIG_X86_MCE_AMD
> > @@ -2285,6 +2291,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
> > goto release_con;
> > }
> >
> > +   con->update_channel_flag = false;
> > con->features = 0;
> > INIT_LIST_HEAD(>head);
> > /* Might need get this flag from vbios. */ diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index 7cddaad90d6d..9314fde81e68 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -374,6 +374,9 @@ struct amdgpu_ras {
> >
> > /* record umc error info queried from smu */
> > struct umc_ecc_info umc_ecc;
> > +
> > +   /* Indicates smu whether need update bad channel info */
> > +   bool update_channel_flag;
> >  };
> >
> >  struct ras_fs_data {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 2b844a5aafdb..ad5d8667756d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_ee

撤回: [PATCH] drm/amdgpu: Move common initialization operations of each ras block to one function

2022-03-01 Thread Yang, Stanley
Yang, Stanley 将撤回邮件“[PATCH] drm/amdgpu: Move common initialization operations 
of each ras block to one function”。

回复: [PATCH] drm/amdgpu: Move common initialization operations of each ras block to one function

2022-03-01 Thread Yang, Stanley
[AMD Official Use Only]

Hi yipe,

One suggestion for this patch, please check my comment.

Regards,
Stanley
> -邮件原件-
> 发件人: amd-gfx  代表 yipechai
> 发送时间: Tuesday, March 1, 2022 5:46 PM
> 收件人: amd-gfx@lists.freedesktop.org
> 抄送: Zhou1, Tao ; Zhang, Hawking
> ; Clements, John ;
> Chai, Thomas ; Chai, Thomas
> 
> 主题: [PATCH] drm/amdgpu: Move common initialization operations of each
> ras block to one function
>
> Define amdgpu_ras_sw_init function to initialize all ras blocks.
>
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c|   2 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 143
> -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h|   1 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c  |  21 ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |  16 ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  28 
>  drivers/gpu/drm/amd/amdgpu/mca_v3_0.c  |   6 -
>  drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c |  17 ---
>  9 files changed, 148 insertions(+), 92 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 6113ddc765a7..72550e9f6058 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2402,6 +2402,12 @@ static int amdgpu_device_ip_init(struct
> amdgpu_device *adev)
>   }
>   }
>
> + r = amdgpu_ras_sw_init(adev);
> + if (r) {
> + DRM_ERROR("amdgpu_ras_early_init failed (%d).\n", r);
> + goto init_failed;
> + }
[Yang, Stanley] : This is ras blocks early init, I  think it's more reasonable 
to move amdgpu_ras_sw_init before amdgpu_ras_init function.

> +
>   if (amdgpu_sriov_vf(adev))
>   amdgpu_virt_init_data_exchange(adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index ab75e189bc0b..544241f357b2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -440,8 +440,6 @@ int amdgpu_gmc_ras_early_init(struct
> amdgpu_device *adev)  {
>   if (!adev->gmc.xgmi.connected_to_cpu) {
>   adev->gmc.xgmi.ras = _ras;
> - amdgpu_ras_register_ras_block(adev, 
> >gmc.xgmi.ras->ras_block);
> - adev->gmc.xgmi.ras_if = >gmc.xgmi.ras-
> >ras_block.ras_comm;
>   }
>
>   return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d3875618ebf5..89075ab9e82e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2299,8 +2299,6 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>   case CHIP_ALDEBARAN:
>   if (!adev->gmc.xgmi.connected_to_cpu) {
>   adev->nbio.ras = _v7_4_ras;
> - amdgpu_ras_register_ras_block(adev, 
> >nbio.ras->ras_block);
> - adev->nbio.ras_if = >nbio.ras-
> >ras_block.ras_comm;
>   }
>   break;
>   default:
> @@ -2533,6 +2531,147 @@ void amdgpu_ras_suspend(struct
> amdgpu_device *adev)
>   amdgpu_ras_disable_all_features(adev, 1);  }
>
> +int amdgpu_ras_sw_init(struct amdgpu_device *adev) {
> + int err = 0;
> +
> + if (!amdgpu_ras_asic_supported(adev))
> + return 0;
> +
> + if (adev->nbio.ras) {
> + err = amdgpu_ras_register_ras_block(adev, 
> >nbio.ras->ras_block);
> + if (err) {
> + dev_err(adev->dev, "Failed to register nbio ras
> block!\n");
> + return err;
> + }
> + adev->nbio.ras_if = >nbio.ras->ras_block.ras_comm;
> + }
> +
> + if (adev->gmc.xgmi.ras) {
> + err = amdgpu_ras_register_ras_block(adev, 
> >gmc.xgmi.ras->ras_block);
> + if (err) {
> + dev_err(adev->dev, "Failed to register xgmi ras
> block!\n");
> + return err;
> + }
> + adev->gmc.xgmi.ras_if = >gmc.xgmi.ras-
> >ras_block.ras_comm;
> + }
> +
> + if (adev->gfx.ras) {
> + err = amdgpu_ras_register_ras_block(adev, >gfx.ras-
> >ras_block);
> + if (err) {
> + dev_err(adev->dev, "Failed to register gfx ras
> block!\n");
> + return err;
> + }
> +
> + strc

回复: [PATCH Review 1/1] drm/amdgpu: fix convert bad page retiremt

2022-01-19 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Zhou1, Tao 
> 发送时间: Thursday, January 20, 2022 11:09 AM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org
> 抄送: Zhang, Hawking ; Clements, John
> ; Ziya, Mohammad zafar
> ; Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: fix convert bad page retiremt
> 
> [AMD Official Use Only]
> 
> 
> 
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Thursday, January 20, 2022 12:29 AM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Zhang, Hawking ; Zhou1, Tao
> > ; Clements, John ;
> Ziya,
> > Mohammad zafar ; Yang, Stanley
> > 
> > Subject: [PATCH Review 1/1] drm/amdgpu: fix convert bad page retiremt
> >
> > Pmfw read ecc info registers and store values in eccinfo_table in the
> > following order
> >
> > umc0 ch_inst 0, 1, 2 ... 7
> > umc1 ch_inst 0, 1, 2 ... 7
> > ...
> > umc3 ch_inst 0, 1, 2 ... 7
> >
> > Driver should convert eccinfo_table_idx into channel_index according
> > to channel_idx_tbe.
> [Tao]: typo, channel_idx_tbe -> channel_idx_tbl
> 
> The patch is OK for me, do we also need to apply the update to umc_v8_7.c?
[Yang, Stanley] : This need to confirm the ecc info registers address 
definition in pmfw, it should be updated if
definition mechanism  of those registers addresses are same.
> 
> >
> > Change-Id: Icafe93e458912b729d2e30d655fd68be0e12124d
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 26 ++
> 
> >  1 file changed, 14 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > index 526de1ca9b8d..f5a1ba7db75a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > @@ -58,29 +58,33 @@ static inline uint32_t
> > get_umc_v6_7_channel_index(struct amdgpu_device *adev,  }
> >
> >  static void umc_v6_7_ecc_info_query_correctable_error_count(struct
> > amdgpu_device *adev,
> > -  uint32_t channel_index,
> > +  uint32_t umc_inst, uint32_t
> > ch_inst,
> >unsigned long *error_count)
> >  {
> > uint64_t mc_umc_status;
> > +   uint32_t eccinfo_table_idx;
> > struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> >
> > +   eccinfo_table_idx = umc_inst * adev->umc.channel_inst_num +
> ch_inst;
> > /* check for SRAM correctable error
> >   MCUMC_STATUS is a 64 bit register */
> > -   mc_umc_status = ras-
> >umc_ecc.ecc[channel_index].mca_umc_status;
> > +   mc_umc_status = ras-
> > >umc_ecc.ecc[eccinfo_table_idx].mca_umc_status;
> > if (REG_GET_FIELD(mc_umc_status,
> > MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1 &&
> > REG_GET_FIELD(mc_umc_status,
> > MCA_UMC_UMC0_MCUMC_STATUST0, CECC) == 1)
> > *error_count += 1;
> >  }
> >
> >  static void umc_v6_7_ecc_info_querry_uncorrectable_error_count(struct
> > amdgpu_device *adev,
> > - uint32_t channel_index,
> > + uint32_t umc_inst,
> > uint32_t ch_inst,
> >   unsigned long
> *error_count) {
> > uint64_t mc_umc_status;
> > +   uint32_t eccinfo_table_idx;
> > struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> >
> > +   eccinfo_table_idx = umc_inst * adev->umc.channel_inst_num +
> ch_inst;
> > /* check the MCUMC_STATUS */
> > -   mc_umc_status = ras-
> >umc_ecc.ecc[channel_index].mca_umc_status;
> > +   mc_umc_status = ras-
> > >umc_ecc.ecc[eccinfo_table_idx].mca_umc_status;
> > if ((REG_GET_FIELD(mc_umc_status,
> > MCA_UMC_UMC0_MCUMC_STATUST0, Val) == 1) &&
> > (REG_GET_FIELD(mc_umc_status,
> > MCA_UMC_UMC0_MCUMC_STATUST0, Deferred) == 1 ||
> > REG_GET_FIELD(mc_umc_status,
> > MCA_UMC_UMC0_MCUMC_STATUST0, UECC) == 1 || @@ -97,19 +101,15
> @@ static
> > void umc_v6_7_ecc_info_query_ras_error_count(struct amdgpu_device
> > *adev,
> >
> > uint32_t umc_inst= 0;
> > uint32_t ch_inst = 0;
> > -   uint32_t channel_index   = 0;
> >
> > /*TODO: driver needs to toggle DF Cstate to ensure
> >  * safe access of UMC registers. Will add the protection */
> > LOOP_UMC_INST_AN

回复: [PATCH Review 1/1] drm/amdgpu: remove unused variable warning

2022-01-19 Thread Yang, Stanley
[AMD Official Use Only]

Thanks Hawking, the fix in umc_v8_7.c is not included in Zafar's patch.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhang, Hawking 
> 发送时间: Wednesday, January 19, 2022 8:10 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Ziya, Mohammad zafar
> ; Clements, John
> ; Zhou1, Tao 
> 抄送: Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: remove unused variable
> warning
> 
> [AMD Official Use Only]
> 
> The change made in drivers/gpu/drm/amd/amdgpu/umc_v8_7.c looks
> already covered by Zafar's change. Other than that, the patch looks good to
> me.
> 
> Reviewed-by: Hawking Zhang 
> 
> Regards,
> Hawking
> -Original Message-
> From: Stanley.Yang 
> Sent: Wednesday, January 19, 2022 19:31
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Ziya, Mohammad zafar
> ; Clements, John
> ; Zhou1, Tao 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: remove unused variable warning
> 
> Change-Id: Ic2a488ee253a913d806bd33ee9c90e31a71af320
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 23 ---
> drivers/gpu/drm/amd/amdgpu/umc_v8_7.c |  6 --
>  2 files changed, 29 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> index 6953426f0bed..526de1ca9b8d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> @@ -61,22 +61,9 @@ static void
> umc_v6_7_ecc_info_query_correctable_error_count(struct amdgpu_device
>  uint32_t channel_index,
>  unsigned long *error_count)
>  {
> - uint32_t ecc_err_cnt;
>   uint64_t mc_umc_status;
>   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> 
> - /*
> -  * select the lower chip and check the error count
> -  * skip add error count, calc error counter only from mca_umc_status
> -  */
> - ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_lo_chip;
> -
> - /*
> -  * select the higher chip and check the err counter
> -  * skip add error count, calc error counter only from mca_umc_status
> -  */
> - ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_hi_chip;
> -
>   /* check for SRAM correctable error
> MCUMC_STATUS is a 64 bit register */
>   mc_umc_status = ras-
> >umc_ecc.ecc[channel_index].mca_umc_status;
> @@ -110,15 +97,11 @@ static void
> umc_v6_7_ecc_info_query_ras_error_count(struct amdgpu_device *adev,
> 
>   uint32_t umc_inst= 0;
>   uint32_t ch_inst = 0;
> - uint32_t umc_reg_offset  = 0;
>   uint32_t channel_index   = 0;
> 
>   /*TODO: driver needs to toggle DF Cstate to ensure
>* safe access of UMC registers. Will add the protection */
>   LOOP_UMC_INST_AND_CH(umc_inst, ch_inst) {
> - umc_reg_offset = get_umc_v6_7_reg_offset(adev,
> -  umc_inst,
> -  ch_inst);
>   channel_index = get_umc_v6_7_channel_index(adev,
>umc_inst,
>ch_inst);
> @@ -133,7 +116,6 @@ static void
> umc_v6_7_ecc_info_query_ras_error_count(struct amdgpu_device *adev,
> 
>  static void umc_v6_7_ecc_info_query_error_address(struct amdgpu_device
> *adev,
>struct ras_err_data *err_data,
> -  uint32_t umc_reg_offset,
>uint32_t ch_inst,
>uint32_t umc_inst)
>  {
> @@ -192,18 +174,13 @@ static void
> umc_v6_7_ecc_info_query_ras_error_address(struct amdgpu_device *adev
> 
>   uint32_t umc_inst= 0;
>   uint32_t ch_inst = 0;
> - uint32_t umc_reg_offset  = 0;
> 
>   /*TODO: driver needs to toggle DF Cstate to ensure
>* safe access of UMC resgisters. Will add the protection
>* when firmware interface is ready */
>   LOOP_UMC_INST_AND_CH(umc_inst, ch_inst) {
> - umc_reg_offset = get_umc_v6_7_reg_offset(adev,
> -  umc_inst,
> -  ch_inst);
>   umc_v6_7_ecc_info_query_error_address(adev,
>err_data,
> -  umc_reg_offset,
>ch_inst,
>  

回复: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into critical regions v2

2022-01-12 Thread Yang, Stanley
[AMD Official Use Only]

Thanks, will update before submit.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhou1, Tao 
> 发送时间: Thursday, January 13, 2022 11:29 AM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org
> 抄送: Zhang, Hawking ; Clements, John
> ; Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into
> critical regions v2
> 
> [AMD Official Use Only]
> 
> Since you use dev_warn, "RAS WARNING" is better than "RAS INFO" in the
> print message, with this fixed the patch is:
> 
> Reviewed-by: Tao Zhou 
> 
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Thursday, January 13, 2022 9:28 AM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Zhang, Hawking ; Clements, John
> > ; Zhou1, Tao ; Yang,
> Stanley
> > 
> > Subject: [PATCH Review 1/1] drm/amdgpu: handle denied inject error
> > into critical regions v2
> >
> > Changed from v1:
> > remove unused brace
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 -
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/ta_ras_if.h  | 3 ++-
> >  3 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index c742d1aacf5a..144176779f9e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -1309,6 +1309,11 @@ static void psp_ras_ta_check_status(struct
> > psp_context *psp)
> > break;
> > case TA_RAS_STATUS__SUCCESS:
> > break;
> > +   case TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED:
> > +   if (ras_cmd->cmd_id ==
> TA_RAS_COMMAND__TRIGGER_ERROR)
> > +   dev_warn(psp->adev->dev,
> > +   "RAS INFO: Inject error to critical
> > region is not allowed\n");
> > +   break;
> > default:
> > dev_warn(psp->adev->dev,
> > "RAS WARNING: ras status = 0x%X\n",
> ras_cmd->ras_status); @@
> > -1521,7 +1526,9 @@ int psp_ras_trigger_error(struct psp_context *psp,
> > if (amdgpu_ras_intr_triggered())
> > return 0;
> >
> > -   if (ras_cmd->ras_status)
> > +   if (ras_cmd->ras_status ==
> > TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED)
> > +   return -EACCES;
> > +   else if (ras_cmd->ras_status)
> > return -EINVAL;
> >
> > return 0;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e674dbed3615..8bdc2e85cb20 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -449,7 +449,7 @@ static ssize_t
> > amdgpu_ras_debugfs_ctrl_write(struct file *f,
> > }
> >
> > if (ret)
> > -   return -EINVAL;
> > +   return ret;
> >
> > return size;
> >  }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > index 5093826a43d1..509d8a1945eb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > @@ -64,7 +64,8 @@ enum ta_ras_status {
> > TA_RAS_STATUS__ERROR_PCS_STATE_ERROR= 0xA016,
> > TA_RAS_STATUS__ERROR_PCS_STATE_HANG = 0xA017,
> > TA_RAS_STATUS__ERROR_PCS_STATE_UNKNOWN  = 0xA018,
> > -   TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019
> > +   TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019,
> > +   TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED  = 0xA01A
> >  };
> >
> >  enum ta_ras_block {
> > --
> > 2.17.1


回复: [PATCH] drm/amdgpu: save error count in RAS poison handler

2021-12-20 Thread Yang, Stanley
[AMD Official Use Only]

> +void amdgpu_umc_ras_fini(struct amdgpu_device *adev) {
> + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC)
> &&
> + adev->umc.ras_if) {
> + struct ras_common_if *ras_if = adev->umc.ras_if;
> + struct ras_ih_if ih_info = {
> + .head = *ras_if,
> + .cb = amdgpu_umc_process_ras_data_cb,
> + };
> +
> + amdgpu_ras_late_fini(adev, ras_if, _info);
> +     kfree(ras_if);
> + }
> +}
> +
> +
> +
[Yang, Stanley] it's better remove extra blank lines.
>  int amdgpu_umc_process_ecc_irq(struct amdgpu_device *adev,
>   struct amdgpu_irq_src *source,
>   struct amdgpu_iv_entry *entry)

Other than above, patch is reviewed-by: Stanley.Yang 

> -邮件原件-
> 发件人: Zhou1, Tao 
> 发送时间: Monday, December 20, 2021 4:51 PM
> 收件人: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas 
> 抄送: Zhou1, Tao 
> 主题: [PATCH] drm/amdgpu: save error count in RAS poison handler
> 
> Otherwise the RAS error count couldn't be queried from sysfs.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c| 170 --
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h|   3 +-
>  3 files changed, 99 insertions(+), 76 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 0bf09a94d944..776a947b45df 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -727,7 +727,7 @@ void
> amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device
> *adev, bo
> 
>   /* CPU MCA will handle page retirement if connected_to_cpu is 1 */
>   if (!adev->gmc.xgmi.connected_to_cpu)
> - amdgpu_umc_do_page_retirement(adev, _data, NULL,
> reset);
> + amdgpu_umc_poison_handler(adev, _data, reset);
>   else if (reset)
>   amdgpu_amdkfd_gpu_reset(adev);
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 0c33f367a4e5..1c2dbd00f647 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -23,79 +23,7 @@
> 
>  #include "amdgpu_ras.h"
> 
> -static int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
> - void *ras_error_status,
> - struct amdgpu_iv_entry *entry)
> -{
> - return amdgpu_umc_do_page_retirement(adev, ras_error_status,
> entry, true);
> -}
> -
> -int amdgpu_umc_ras_late_init(struct amdgpu_device *adev) -{
> - int r;
> - struct ras_fs_if fs_info = {
> - .sysfs_name = "umc_err_count",
> - };
> - struct ras_ih_if ih_info = {
> - .cb = amdgpu_umc_process_ras_data_cb,
> - };
> -
> - if (!adev->umc.ras_if) {
> - adev->umc.ras_if =
> - kmalloc(sizeof(struct ras_common_if), GFP_KERNEL);
> - if (!adev->umc.ras_if)
> - return -ENOMEM;
> - adev->umc.ras_if->block = AMDGPU_RAS_BLOCK__UMC;
> - adev->umc.ras_if->type =
> AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
> - adev->umc.ras_if->sub_block_index = 0;
> - }
> - ih_info.head = fs_info.head = *adev->umc.ras_if;
> -
> - r = amdgpu_ras_late_init(adev, adev->umc.ras_if,
> -  _info, _info);
> - if (r)
> - goto free;
> -
> - if (amdgpu_ras_is_supported(adev, adev->umc.ras_if->block)) {
> - r = amdgpu_irq_get(adev, >gmc.ecc_irq, 0);
> - if (r)
> - goto late_fini;
> - } else {
> - r = 0;
> - goto free;
> - }
> -
> - /* ras init of specific umc version */
> - if (adev->umc.ras_funcs &&
> - adev->umc.ras_funcs->err_cnt_init)
> - adev->umc.ras_funcs->err_cnt_init(adev);
> -
> - return 0;
> -
> -late_fini:
> - amdgpu_ras_late_fini(adev, adev->umc.ras_if, _info);
> -free:
> - kfree(adev->umc.ras_if);
> - adev->umc.ras_if = NULL;
> - return r;
> -}
> -
> -void amdgpu_umc_ras_fini(struct amdgpu_device *adev) -{
> - if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC)
> &&
> - adev->umc.ras_if) {
> - struct ras_common_if *ras_if = adev->umc.r

答复: [PATCH Review 1/1] drm/amd/pm: print errorno if get ecc info failed

2021-12-06 Thread Yang, Stanley
Hi Evan,

The error prompts in function smu_cmn_send_smc_msg_with_param do not cover all 
failed cases since it only prints reg stat SMU_RESP_NONE,
SMU_RESP_BUSY_OTHER or response -EREMOTEIO. I think it is better update reg 
error stat judgment conditions to print more error msg.

Regards,
Stanley
发件人: Quan, Evan 
日期: 星期一, 2021年12月6日 上午9:43
收件人: Yang, Stanley , amd-gfx@lists.freedesktop.org 
, Zhang, Hawking , 
Clements, John , Zhou1, Tao , Li, 
Candice , Chai, Thomas 
抄送: Yang, Stanley 
主题: RE: [PATCH Review 1/1] drm/amd/pm: print errorno if get ecc info failed
[AMD Official Use Only]

Hi Stanley,

There is already error prompts in the smu_cmn_send_smc_msg_with_param() used by 
the API mentioned below.
Can that cover your use case?

BR
Evan
> -Original Message-
> From: Stanley.Yang 
> Sent: Sunday, December 5, 2021 6:02 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Clements, John ;
> Zhou1, Tao ; Li, Candice ;
> Chai, Thomas ; Quan, Evan 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amd/pm: print errorno if get ecc info failed
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> index 6e781cee8bb6..e0a8224e466f 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> @@ -1815,7 +1815,7 @@ static ssize_t aldebaran_get_ecc_info(struct
> smu_context *smu,
>   smu_table->ecc_table,
>   false);
>if (ret) {
> - dev_info(smu->adev->dev, "Failed to export SMU ecc
> table!\n");
> + dev_info(smu->adev->dev, "Failed to export SMU ecc table!
> ret %d.\n", ret);
>return ret;
>}
>
> --
> 2.17.1


回复: [PATCH Review 1/1] drm/amdgpu: adjust ip block add sequence on aldebaran

2021-11-28 Thread Yang, Stanley
[AMD Official Use Only]

Thanks, will update before submit.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhang, Hawking 
> 发送时间: Monday, November 29, 2021 2:36 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Clements, John ;
> Zhou1, Tao 
> 抄送: Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: adjust ip block add sequence on
> aldebaran
> 
> [AMD Official Use Only]
> 
> Please fix a typo in code comments smda->sdma. And double check the code
> alignment before commit.
> 
> V2 is
> 
> Reviewed-by: Hawking Zhang 
> 
> Regards,
> Hawking
> -Original Message-
> From: Stanley.Yang 
> Sent: Monday, November 29, 2021 14:27
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Clements, John ;
> Zhou1, Tao 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: adjust ip block add sequence on
> aldebaran
> 
> Reason:
> {
> [  578.019986] amdgpu :23:00.0: amdgpu: GPU reset begin!
> [  583.245566] amdgpu :23:00.0: amdgpu: Failed to disable smu
> features.
> [  583.245621] amdgpu :23:00.0: amdgpu: Fail to disable dpm features!
> [  583.245639] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
> *ERROR* suspend of IP block  failed -62
> [  583.248504] [drm] free PSP TMR buffer } Adjust ip block suspend
> sequence on aldebaran, it can fix disable smu feature failure.
> 
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index 4e3669407518..dc1d88a31f91 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -1309,7 +1309,9 @@ int amdgpu_discovery_set_ip_blocks(struct
> amdgpu_device *adev)
>   }
>   }
> 
> - if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)) {
> + /* move add smu block after add smda block for aldebaran */
> + if (likely(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP)
> &&
> + (adev->ip_versions[MP1_HWIP][0] !=
> IP_VERSION(13, 0 ,2))) {
>   r = amdgpu_discovery_set_smu_ip_blocks(adev);
>   if (r)
>   return r;
> @@ -1327,8 +1329,10 @@ int amdgpu_discovery_set_ip_blocks(struct
> amdgpu_device *adev)
>   if (r)
>   return r;
> 
> - if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT &&
> - !amdgpu_sriov_vf(adev)) {
> + if ((adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT &&
> + !amdgpu_sriov_vf(adev)) ||
> + ((adev->ip_versions[MP1_HWIP][0] == IP_VERSION(13, 0 ,2))
> &&
> +  likely(adev->firmware.load_type ==
> AMDGPU_FW_LOAD_PSP))) {
>   r = amdgpu_discovery_set_smu_ip_blocks(adev);
>   if (r)
>   return r;
> --
> 2.17.1


回复: [PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed when unload drvier

2021-11-26 Thread Yang, Stanley
[AMD Official Use Only]

Yeah, you are right, I ignored ras initialization failure case, will update 
soon, thanks.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhang, Hawking 
> 发送时间: Friday, November 26, 2021 9:11 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Clements, John ;
> Zhou1, Tao ; Li, Candice ;
> Chai, Thomas 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed
> when unload drvier
> 
> [AMD Official Use Only]
> 
> I suspect it is still needed, especially when amdgpu_ras_fini is used to deal
> with ras initialization failure in psp_ras_initialize.
> 
> Regards,
> Hawking
> 
> -Original Message-
> From: Yang, Stanley 
> Sent: Friday, November 26, 2021 21:08
> To: Zhang, Hawking ; amd-
> g...@lists.freedesktop.org; Clements, John ;
> Zhou1, Tao ; Li, Candice ;
> Chai, Thomas 
> Subject: 回复: [PATCH Review 1/1] drm/amdgpu: fix disable ras feature
> failed when unload drvier
> 
> [AMD Official Use Only]
> 
> It's not necessary, because before hw fini, all ras features have been
> disabled and con->features is set to zero.
> 
> Regards,
> Stanley
> > -邮件原件-
> > 发件人: Zhang, Hawking 
> > 发送时间: Friday, November 26, 2021 8:57 PM
> > 收件人: Yang, Stanley ; amd-
> > g...@lists.freedesktop.org; Clements, John ;
> > Zhou1, Tao ; Li, Candice ;
> > Chai, Thomas 
> > 抄送: Yang, Stanley 
> > 主题: RE: [PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed
> > when unload drvier
> >
> > [AMD Official Use Only]
> >
> > Good catch. We still need to release ras object in the end. Any reason
> > the sequence was removed?
> >
> > @@ -2564,9 +2563,6 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
> >
> > WARN(con->features, "Feature mask is not cleared");
> >
> > -   if (con->features)
> > -   amdgpu_ras_disable_all_features(adev, 1);
> > -
> > cancel_delayed_work_sync(>ras_counte_delay_work);
> >
> > Regards,
> > Hawking
> >
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Friday, November 26, 2021 17:48
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Clements, John
> ;
> > Zhou1, Tao ; Li, Candice ;
> > Chai, Thomas 
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed
> > when unload drvier
> >
> > Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
> > so ras ta will unload before send ras disable command, ras dsiable
> > operation must before hw fini.
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 4 
> >  2 files changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 73ec46140d68..d5e642e90010 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2838,8 +2838,6 @@ static int amdgpu_device_ip_fini(struct
> > amdgpu_device *adev)
> > if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
> > amdgpu_virt_release_ras_err_handler_data(adev);
> >
> > -   amdgpu_ras_pre_fini(adev);
> > -
> > if (adev->gmc.xgmi.num_physical_nodes > 1)
> > amdgpu_xgmi_remove_device(adev);
> >
> > @@ -3959,6 +3957,9 @@ void amdgpu_device_fini_hw(struct
> amdgpu_device
> > *adev)
> >
> > amdgpu_fbdev_fini(adev);
> >
> > +   /* disable ras feature must before hw fini */
> > +   amdgpu_ras_pre_fini(adev);
> > +
> > amdgpu_device_ip_fini_early(adev);
> >
> > amdgpu_irq_fini_hw(adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 39dfd4d59881..65102d2a0a98 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2484,7 +2484,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device
> > *adev,
> > amdgpu_ras_sysfs_remove(adev, ras_block);
> > if (ih_info->cb)
> > amdgpu_ras_interrupt_remove_handler(adev, ih_info);
> > -   amdgpu_ras_feature_enable(adev, ras_block, 0);
> >  }
> >
> >  /* do some init work after IP late init as dependence.
> > @@ -2564,9 +2563,6 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
> >
> > WARN(con->features, "Feature mask is not cleared");
> >
> > -   if (con->features)
> > -   amdgpu_ras_disable_all_features(adev, 1);
> > -
> > cancel_delayed_work_sync(>ras_counte_delay_work);
> >
> > amdgpu_ras_set_context(adev, NULL);
> > --
> > 2.17.1


回复: [PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed when unload drvier

2021-11-26 Thread Yang, Stanley
[AMD Official Use Only]

It's not necessary, because before hw fini, all ras features have been disabled 
and con->features is set to zero.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhang, Hawking 
> 发送时间: Friday, November 26, 2021 8:57 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Clements, John ;
> Zhou1, Tao ; Li, Candice ;
> Chai, Thomas 
> 抄送: Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed
> when unload drvier
> 
> [AMD Official Use Only]
> 
> Good catch. We still need to release ras object in the end. Any reason the
> sequence was removed?
> 
> @@ -2564,9 +2563,6 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
> 
>   WARN(con->features, "Feature mask is not cleared");
> 
> - if (con->features)
> - amdgpu_ras_disable_all_features(adev, 1);
> -
>   cancel_delayed_work_sync(>ras_counte_delay_work);
> 
> Regards,
> Hawking
> 
> -Original Message-
> From: Stanley.Yang 
> Sent: Friday, November 26, 2021 17:48
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Clements, John ;
> Zhou1, Tao ; Li, Candice ;
> Chai, Thomas 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: fix disable ras feature failed
> when unload drvier
> 
> Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
> so ras ta will unload before send ras disable command, ras dsiable operation
> must before hw fini.
> 
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 4 
>  2 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 73ec46140d68..d5e642e90010 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2838,8 +2838,6 @@ static int amdgpu_device_ip_fini(struct
> amdgpu_device *adev)
>   if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
>   amdgpu_virt_release_ras_err_handler_data(adev);
> 
> - amdgpu_ras_pre_fini(adev);
> -
>   if (adev->gmc.xgmi.num_physical_nodes > 1)
>   amdgpu_xgmi_remove_device(adev);
> 
> @@ -3959,6 +3957,9 @@ void amdgpu_device_fini_hw(struct
> amdgpu_device *adev)
> 
>   amdgpu_fbdev_fini(adev);
> 
> + /* disable ras feature must before hw fini */
> + amdgpu_ras_pre_fini(adev);
> +
>   amdgpu_device_ip_fini_early(adev);
> 
>   amdgpu_irq_fini_hw(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 39dfd4d59881..65102d2a0a98 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2484,7 +2484,6 @@ void amdgpu_ras_late_fini(struct amdgpu_device
> *adev,
>   amdgpu_ras_sysfs_remove(adev, ras_block);
>   if (ih_info->cb)
>   amdgpu_ras_interrupt_remove_handler(adev, ih_info);
> - amdgpu_ras_feature_enable(adev, ras_block, 0);
>  }
> 
>  /* do some init work after IP late init as dependence.
> @@ -2564,9 +2563,6 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
> 
>   WARN(con->features, "Feature mask is not cleared");
> 
> - if (con->features)
> - amdgpu_ras_disable_all_features(adev, 1);
> -
>   cancel_delayed_work_sync(>ras_counte_delay_work);
> 
>   amdgpu_ras_set_context(adev, NULL);
> --
> 2.17.1


回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table v2

2021-11-18 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Thursday, November 18, 2021 7:33 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: [PATCH Review 3/4] drm/amdgpu: add message smu to get
> ecc_table v2
> 
> 
> 
> On 11/18/2021 3:03 PM, Stanley.Yang wrote:
> > support ECC TABLE message, this table include umc ras error count and
> > error address
> >
> > v2:
> >  add smu version check to query whether support ecctable
> >  call smu_cmn_update_table to get ecctable directly
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  8 +++
> >   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 14 
> >   .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 70
> +++
> >   .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
> >   4 files changed, 94 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > index 3557f4e7fc30..7a06021a58f0 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > @@ -324,6 +324,7 @@ enum smu_table_id
> > SMU_TABLE_OVERDRIVE,
> > SMU_TABLE_I2C_COMMANDS,
> > SMU_TABLE_PACE,
> > +   SMU_TABLE_ECCINFO,
> > SMU_TABLE_COUNT,
> >   };
> >
> > @@ -340,6 +341,7 @@ struct smu_table_context
> > void*max_sustainable_clocks;
> > struct smu_bios_boot_up_values  boot_values;
> > void*driver_pptable;
> > +   void*ecc_table;
> > struct smu_tabletables[SMU_TABLE_COUNT];
> > /*
> >  * The driver table is just a staging buffer for @@ -1261,6
> > +1263,11 @@ struct pptable_funcs {
> >  *
>   of SMUBUS table.
> >  */
> > int (*send_hbm_bad_pages_num)(struct smu_context *smu,
> uint32_t
> > size);
> > +
> > +   /**
> > +* @get_ecc_table:  message SMU to get ECC INFO table.
> > +*/
> > +   ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
> >   };
> >
> >   typedef enum {
> > @@ -1397,6 +1404,7 @@ int smu_set_light_sbr(struct smu_context *smu,
> > bool enable);
> >
> >   int smu_wait_for_event(struct amdgpu_device *adev, enum
> smu_event_type event,
> >uint64_t event_arg);
> > +int smu_get_ecc_info(struct smu_context *smu, void *umc_ecc);
> >
> >   #endif
> >   #endif
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > index 01168b8955bf..fd3b6b460b12 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > @@ -3072,6 +3072,20 @@ int smu_set_light_sbr(struct smu_context *smu,
> bool enable)
> > return ret;
> >   }
> >
> > +int smu_get_ecc_info(struct smu_context *smu, void *umc_ecc) {
> > +   int ret = -EOPNOTSUPP;
> > +
> > +   mutex_lock(>mutex);
> > +   if (smu->ppt_funcs &&
> > +   smu->ppt_funcs->get_ecc_info)
> > +   ret = smu->ppt_funcs->get_ecc_info(smu, umc_ecc);
> > +   mutex_unlock(>mutex);
> > +
> > +   return ret;
> > +
> > +}
> > +
> >   static int smu_get_prv_buffer_details(void *handle, void **addr, size_t
> *size)
> >   {
> > struct smu_context *smu = handle;
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > index f835d86cc2f5..4c21609ccea5 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > @@ -78,6 +78,12 @@
> >
> >   #define smnPCIE_ESM_CTRL  0x111003D0
> >
> > +/*
> > + * SMU support ECCTABLE since version 68.42.0,
> > + * use this to check ECCTALE feature whether support  */ #define
> > +SUPPORT_ECCTABLE_SMU_VERSION 0x00442a00
> > +
> >   static const struct smu_temperature_range smu13_thermal_policy[] =
> >   {
> > {-273150,  99000, 99000, -273150, 99000, 99000, -273150, 99000,
> > 99000}, @@ -190,6 +196,7 @@ static const struct cmn2asic_mapping
> aldebaran_table_map[SMU_TABLE_COUNT] = {
> > TAB_MAP(SMU_METRICS),
> > TAB_MAP(DRIVER_SMU_CONFIG),
> > TAB_MAP

回复: [PATCH Review 1/4] drm/amdgpu: Update smu driver interface for aldebaran

2021-11-18 Thread Yang, Stanley
[AMD Official Use Only]

Thanks Evan,

Will update patch 1 and 3 title before submit.

Regards,
Stanley
> -邮件原件-
> 发件人: Quan, Evan 
> 发送时间: Thursday, November 18, 2021 5:58 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Lazar, Lijo
> ; Wang, Yang(Kevin) 
> 抄送: Yang, Stanley 
> 主题: RE: [PATCH Review 1/4] drm/amdgpu: Update smu driver interface for
> aldebaran
> 
> [AMD Official Use Only]
> 
> Better to update the patch title as "drm/amd/pm: Update smu driver
> interface for aldebaran" as all other power related patches.
> And please update patch3 also.
> Other than above, patch 1, 3 are reviewed-by: Evan Quan
> 
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Thursday, November 18, 2021 5:34 PM
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Clements, John
> ; Quan,
> > Evan ; Lazar, Lijo ; Wang,
> > Yang(Kevin) 
> > Cc: Yang, Stanley 
> > Subject: [PATCH Review 1/4] drm/amdgpu: Update smu driver interface
> > for aldebaran
> >
> > update smu driver if version to 0x08 to avoid mismatch log A version
> > mismatch can still happen with an older FW
> >
> > Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
> > Signed-off-by: Stanley.Yang 
> > ---
> >  .../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18
> > +-
> >  drivers/gpu/drm/amd/pm/inc/smu_v13_0.h |  2 +-
> >  2 files changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/inc/smu13_driver_if_aldebaran.h
> > b/drivers/gpu/drm/amd/pm/inc/smu13_driver_if_aldebaran.h
> > index a017983ff1fa..0f67c56c2863 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/smu13_driver_if_aldebaran.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/smu13_driver_if_aldebaran.h
> > @@ -140,6 +140,8 @@
> >
> >  #define MAX_SW_I2C_COMMANDS24
> >
> > +#define ALDEBARAN_UMC_CHANNEL_NUM32
> > +
> >  typedef enum {
> >I2C_CONTROLLER_PORT_0, //CKSVII2C0
> >I2C_CONTROLLER_PORT_1, //CKSVII2C1
> > @@ -507,6 +509,19 @@ typedef struct {
> >uint32_t MmHubPadding[8]; // SMU internal use  } AvfsDebugTable_t;
> >
> > +typedef struct {
> > +   uint64_t mca_umc_status;
> > +   uint64_t mca_umc_addr;
> > +   uint16_t ce_count_lo_chip;
> > +   uint16_t ce_count_hi_chip;
> > +
> > +   uint32_t eccPadding;
> > +} EccInfo_t;
> > +
> > +typedef struct {
> > +   EccInfo_t  EccInfo[ALDEBARAN_UMC_CHANNEL_NUM];
> > +} EccInfoTable_t;
> > +
> >  // These defines are used with the following messages:
> >  // SMC_MSG_TransferTableDram2Smu
> >  // SMC_MSG_TransferTableSmu2Dram
> > @@ -517,6 +532,7 @@ typedef struct {
> >  #define TABLE_SMU_METRICS 4
> >  #define TABLE_DRIVER_SMU_CONFIG   5
> >  #define TABLE_I2C_COMMANDS6
> > -#define TABLE_COUNT   7
> > +#define TABLE_ECCINFO 7
> > +#define TABLE_COUNT   8
> >
> >  #endif
> > diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
> > b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
> > index bbc608c990b0..44af23ae059e 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
> > @@ -27,7 +27,7 @@
> >
> >  #define SMU13_DRIVER_IF_VERSION_INV 0x  #define
> > SMU13_DRIVER_IF_VERSION_YELLOW_CARP 0x04 -#define
> > SMU13_DRIVER_IF_VERSION_ALDE 0x07
> > +#define SMU13_DRIVER_IF_VERSION_ALDE 0x08
> >
> >  #define SMU13_MODE1_RESET_WAIT_TIME_IN_MS 500  //500ms
> >
> > --
> > 2.17.1


回复: 回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Thursday, November 18, 2021 12:04 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: 回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get
> ecc_table
> 
> 
> 
> On 11/18/2021 9:07 AM, Yang, Stanley wrote:
> > [AMD Official Use Only]
> >
> >
> >
> >> -邮件原件-
> >> 发件人: Lazar, Lijo 
> >> 发送时间: Wednesday, November 17, 2021 7:24 PM
> >> 收件人: Yang, Stanley ; amd-
> >> g...@lists.freedesktop.org; Zhang, Hawking ;
> >> Clements, John ; Quan, Evan
> >> ; Wang, Yang(Kevin)
> 
> >> 主题: Re: [PATCH Review 3/4] drm/amdgpu: add message smu to get
> >> ecc_table
> >>
> >>
> >>
> >> On 11/17/2021 3:41 PM, Stanley.Yang wrote:
> >>> support ECC TABLE message, this table include unc ras error count
> >>> and error address
> >>>
> >>> Signed-off-by: Stanley.Yang 
> >>> ---
> >>>drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 
> >>>.../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38
> >> +++
> >>>.../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
> >>>drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 24 
> >>>drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
> >>>5 files changed, 74 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>> index 3557f4e7fc30..ea65de0160c3 100644
> >>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>> @@ -324,6 +324,7 @@ enum smu_table_id
> >>>   SMU_TABLE_OVERDRIVE,
> >>>   SMU_TABLE_I2C_COMMANDS,
> >>>   SMU_TABLE_PACE,
> >>> + SMU_TABLE_ECCINFO,
> >>>   SMU_TABLE_COUNT,
> >>>};
> >>>
> >>> @@ -340,6 +341,7 @@ struct smu_table_context
> >>>   void*max_sustainable_clocks;
> >>>   struct smu_bios_boot_up_values  boot_values;
> >>>   void*driver_pptable;
> >>> + void*ecc_table;
> >>>   struct smu_tabletables[SMU_TABLE_COUNT];
> >>>   /*
> >>>* The driver table is just a staging buffer for @@ -1261,6
> >>> +1263,11 @@ struct pptable_funcs {
> >>>*
> >>of SMUBUS table.
> >>>*/
> >>>   int (*send_hbm_bad_pages_num)(struct smu_context *smu,
> >> uint32_t
> >>> size);
> >>> +
> >>> + /**
> >>> +  * @get_ecc_table:  message SMU to get ECC INFO table.
> >>> +  */
> >>> + ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
> >>>};
> >>>
> >>>typedef enum {
> >>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>> index f835d86cc2f5..5e4ba0e14a91 100644
> >>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>> @@ -190,6 +190,7 @@ static const struct cmn2asic_mapping
> >> aldebaran_table_map[SMU_TABLE_COUNT] = {
> >>>   TAB_MAP(SMU_METRICS),
> >>>   TAB_MAP(DRIVER_SMU_CONFIG),
> >>>   TAB_MAP(I2C_COMMANDS),
> >>> + TAB_MAP(ECCINFO),
> >>>};
> >>>
> >>>static const uint8_t aldebaran_throttler_map[] = { @@ -223,6
> >>> +224,9 @@ static int aldebaran_tables_init(struct smu_context *smu)
> >>>   SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS,
> >> sizeof(SwI2cRequest_t),
> >>>  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> >>>
> >>> + SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO,
> >> sizeof(EccInfoTable_t),
> >>> +PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> >>> +
> >>>   smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t),
> >> GFP_KERNEL);
> >>>   if (!smu_table->metrics_table)
> >>>   return 

回复: [PATCH Review 4/4] query umc error info from ecc_table

2021-11-17 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Wednesday, November 17, 2021 7:15 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: [PATCH Review 4/4] query umc error info from ecc_table
> 
> 
> 
> On 11/17/2021 3:41 PM, Stanley.Yang wrote:
> > if smu support ECCTABLE, driver can message smu to get ecc_table then
> > query umc error info from ECCTABLE apply pmfw version check to ensure
> > backward compatibility
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 42 ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  7 ++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   | 71 +--
> 
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  1 +
> >   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 12 
> >   .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  4 ++
> >   6 files changed, 107 insertions(+), 30 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 90f0db3b4f65..6b0f2ba1e420 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -888,6 +888,38 @@ void amdgpu_ras_mca_query_error_status(struct
> amdgpu_device *adev,
> > }
> >   }
> >
> > +static void amdgpu_ras_get_ecc_info(struct amdgpu_device *adev,
> > +struct ras_err_data *err_data) {
> > +   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> > +
> > +   /*
> > +* choosing right query method according to
> > +* whether smu support query error information
> > +*/
> > +   if ((ras->smu_version >= SUPPORT_ECCTABLE_SMU_VERSION) &&
> > +   !smu_get_ecc_info(>smu, (void *)&(ras-
> >umc_ecc))) {
> > +
> 
> This version check should be in aldebaran_ppt implementation. In general
> the callback will check the FW version that supports ECC table for the
> corresponding ASIC. It may return ENOTSUPP or similar if the FW version
> doesn't support ECC table and that may be checked here. Keeping
> smu_version in ras context is not needed.
[Yang, Stanley] I think just check Aldebaran_ppt callback function is not 
enough here, considering this scenario using amdgpu driver with get_ecc_info 
callback function but the pmfw is an older one without ecctable feature. PMFW 
support ecctable since 68.42.0 for Aldebaran.

> 
> > +   if (adev->umc.ras_funcs &&
> > +   adev->umc.ras_funcs-
> >message_smu_query_ras_error_count)
> > +   adev->umc.ras_funcs-
> >message_smu_query_ras_error_count(adev,
> > +err_data);
> > +
> > +   if (adev->umc.ras_funcs &&
> > +   adev->umc.ras_funcs-
> >message_smu_query_ras_error_address)
> > +   adev->umc.ras_funcs-
> >message_smu_query_ras_error_address(adev, err_data);
> > +   } else {
> > +   if (adev->umc.ras_funcs &&
> > +   adev->umc.ras_funcs->query_ras_error_count)
> > +   adev->umc.ras_funcs->query_ras_error_count(adev,
> err_data);
> > +
> > +   /* umc query_ras_error_address is also responsible for
> clearing
> > +* error status
> > +*/
> > +   if (adev->umc.ras_funcs &&
> > +   adev->umc.ras_funcs->query_ras_error_address)
> > +   adev->umc.ras_funcs-
> >query_ras_error_address(adev, err_data);
> > +   }
> > +}
> > +
> >   /* query/inject/cure begin */
> >   int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
> >   struct ras_query_if *info)
> > @@ -901,15 +933,7 @@ int amdgpu_ras_query_error_status(struct
> > amdgpu_device *adev,
> >
> > switch (info->head.block) {
> > case AMDGPU_RAS_BLOCK__UMC:
> > -   if (adev->umc.ras_funcs &&
> > -   adev->umc.ras_funcs->query_ras_error_count)
> > -   adev->umc.ras_funcs->query_ras_error_count(adev,
> _data);
> > -   /* umc query_ras_error_address is also responsible for
> clearing
> > -* error status
> > -*/
> > -   if (adev->umc.ras_funcs &&
> > -   adev->umc.ras_funcs->query_ras_error_address)
> > -   ade

回复: [PATCH Review 3/4] drm/amdgpu: add message smu to get ecc_table

2021-11-17 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Wednesday, November 17, 2021 7:24 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: [PATCH Review 3/4] drm/amdgpu: add message smu to get
> ecc_table
> 
> 
> 
> On 11/17/2021 3:41 PM, Stanley.Yang wrote:
> > support ECC TABLE message, this table include unc ras error count and
> > error address
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  7 
> >   .../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38
> +++
> >   .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c|  2 +
> >   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c| 24 
> >   drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h|  3 ++
> >   5 files changed, 74 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > index 3557f4e7fc30..ea65de0160c3 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > @@ -324,6 +324,7 @@ enum smu_table_id
> > SMU_TABLE_OVERDRIVE,
> > SMU_TABLE_I2C_COMMANDS,
> > SMU_TABLE_PACE,
> > +   SMU_TABLE_ECCINFO,
> > SMU_TABLE_COUNT,
> >   };
> >
> > @@ -340,6 +341,7 @@ struct smu_table_context
> > void*max_sustainable_clocks;
> > struct smu_bios_boot_up_values  boot_values;
> > void*driver_pptable;
> > +   void*ecc_table;
> > struct smu_tabletables[SMU_TABLE_COUNT];
> > /*
> >  * The driver table is just a staging buffer for @@ -1261,6
> > +1263,11 @@ struct pptable_funcs {
> >  *
>   of SMUBUS table.
> >  */
> > int (*send_hbm_bad_pages_num)(struct smu_context *smu,
> uint32_t
> > size);
> > +
> > +   /**
> > +* @get_ecc_table:  message SMU to get ECC INFO table.
> > +*/
> > +   ssize_t (*get_ecc_info)(struct smu_context *smu, void *table);
> >   };
> >
> >   typedef enum {
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > index f835d86cc2f5..5e4ba0e14a91 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> > @@ -190,6 +190,7 @@ static const struct cmn2asic_mapping
> aldebaran_table_map[SMU_TABLE_COUNT] = {
> > TAB_MAP(SMU_METRICS),
> > TAB_MAP(DRIVER_SMU_CONFIG),
> > TAB_MAP(I2C_COMMANDS),
> > +   TAB_MAP(ECCINFO),
> >   };
> >
> >   static const uint8_t aldebaran_throttler_map[] = { @@ -223,6 +224,9
> > @@ static int aldebaran_tables_init(struct smu_context *smu)
> > SMU_TABLE_INIT(tables, SMU_TABLE_I2C_COMMANDS,
> sizeof(SwI2cRequest_t),
> >PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> >
> > +   SMU_TABLE_INIT(tables, SMU_TABLE_ECCINFO,
> sizeof(EccInfoTable_t),
> > +  PAGE_SIZE, AMDGPU_GEM_DOMAIN_VRAM);
> > +
> > smu_table->metrics_table = kzalloc(sizeof(SmuMetrics_t),
> GFP_KERNEL);
> > if (!smu_table->metrics_table)
> > return -ENOMEM;
> > @@ -235,6 +239,10 @@ static int aldebaran_tables_init(struct smu_context
> *smu)
> > return -ENOMEM;
> > }
> >
> > +   smu_table->ecc_table = kzalloc(tables[SMU_TABLE_ECCINFO].size,
> GFP_KERNEL);
> > +   if (!smu_table->ecc_table)
> > +   return -ENOMEM;
> > +
> > return 0;
> >   }
> >
> > @@ -1765,6 +1773,35 @@ static ssize_t aldebaran_get_gpu_metrics(struct
> smu_context *smu,
> > return sizeof(struct gpu_metrics_v1_3);
> >   }
> >
> > +static ssize_t aldebaran_get_ecc_info(struct smu_context *smu,
> > +void *table)
> > +{
> > +   struct smu_table_context *smu_table = >smu_table;
> > +   EccInfoTable_t ecc_table;
> > +   struct ecc_info_per_ch *ecc_info_per_channel = NULL;
> > +   int i, ret = 0;
> > +   struct umc_ecc_info *eccinfo = (struct umc_ecc_info *)table;
> > +
> > +   ret = smu_cmn_get_ecc_info_table(smu,
> > +   _table);
> > +   if (ret)
> > +   return ret;
> > +
> > +   for (i = 0; i < ALDEBARAN_UMC_CHANNEL_NUM; i++) {
> > +   ecc_info_p

回复: [PATCH Review 2/4] drm/amdgpu: add new query interface for umc block

2021-11-17 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Wednesday, November 17, 2021 7:36 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Quan, Evan
> ; Wang, Yang(Kevin) 
> 主题: Re: [PATCH Review 2/4] drm/amdgpu: add new query interface for
> umc block
> 
> 
> 
> On 11/17/2021 3:41 PM, Stanley.Yang wrote:
> > add message smu to query error information
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  16 +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |   4 +
> >   drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   | 161
> 
> >   3 files changed, 181 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index cdd0010a5389..bcbf3264d92f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -320,6 +320,19 @@ struct ras_common_if {
> > char name[32];
> >   };
> >
> > +#define MAX_UMC_CHANNEL_NUM 32
> > +
> > +struct ecc_info_per_ch {
> > +   uint16_t ce_count_lo_chip;
> > +   uint16_t ce_count_hi_chip;
> > +   uint64_t mca_umc_status;
> > +   uint64_t mca_umc_addr;
> > +};
> > +
> > +struct umc_ecc_info {
> > +   struct ecc_info_per_ch ecc[MAX_UMC_CHANNEL_NUM]; };
> > +
> >   struct amdgpu_ras {
> > /* ras infrastructure */
> > /* for ras itself. */
> > @@ -359,6 +372,9 @@ struct amdgpu_ras {
> > struct delayed_work ras_counte_delay_work;
> > atomic_t ras_ue_count;
> > atomic_t ras_ce_count;
> > +
> > +   /* record umc error info queried from smu */
> > +   struct umc_ecc_info umc_ecc;
> >   };
> >
> >   struct ras_fs_data {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > index 1f5fe2315236..7aa9b21eb906 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > @@ -49,6 +49,10 @@ struct amdgpu_umc_ras_funcs {
> > void (*query_ras_error_address)(struct amdgpu_device *adev,
> > void *ras_error_status);
> > bool (*query_ras_poison_mode)(struct amdgpu_device *adev);
> > +   void (*message_smu_query_ras_error_count)(struct
> amdgpu_device *adev,
> > + void *ras_error_status);
> > +   void (*message_smu_query_ras_error_address)(struct
> amdgpu_device *adev,
> > +   void *ras_error_status);
> 
> Maybe rename message_smu to ecc_info. These methods fetch the error
> from umc_ecc_info table. They don't deal with smu or care about how the
> information gets filled. As long as ecc_info_table is filled, they could get 
> the
> info.
[Yang, Stanley] yeah, it seems rename message_smu to ecc_info is better since 
ecc_table has been update before this call.

> 
> Thanks,
> Lijo
> 
> >   };
> >
> >   struct amdgpu_umc_funcs {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > index f7ec3fe134e5..cd96e8b734cb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
> > @@ -50,6 +50,165 @@ static inline uint32_t
> get_umc_v6_7_reg_offset(struct amdgpu_device *adev,
> > return adev->umc.channel_offs * ch_inst + UMC_V6_7_INST_DIST *
> umc_inst;
> >   }
> >
> > +static inline uint32_t get_umc_v6_7_channel_index(struct
> amdgpu_device *adev,
> > + uint32_t umc_inst,
> > + uint32_t ch_inst)
> > +{
> > +   return adev->umc.channel_idx_tbl[umc_inst *
> > +adev->umc.channel_inst_num + ch_inst]; }
> > +
> > +static void
> umc_v6_7_message_smu_query_correctable_error_count(struct
> amdgpu_device *adev,
> > +  uint32_t channel_index,
> > +  unsigned long *error_count)
> > +{
> > +   uint32_t ecc_err_cnt;
> > +   uint64_t mc_umc_status;
> > +   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> > +
> > +   /*
> > +* select the lower chip and check the error count
> > +* skip add error count, calc error counter only from mca_umc_status
> > +*/
> > +   ecc_err_cnt = ras->umc_ecc.ecc[channel_index].ce_count_lo_chip;
> > 

回复: [PATCH Review 1/1] drm/amdgpu: fix smu not match warning

2021-11-16 Thread Yang, Stanley
[AMD Official Use Only]

Thanks Lijo, will be updated.

Regards,
Stanley
> -邮件原件-
> 发件人: Lazar, Lijo 
> 发送时间: Tuesday, November 16, 2021 3:49 PM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org; Zhang, Hawking ;
> Clements, John ; Zhou1, Tao
> ; Quan, Evan 
> 主题: Re: [PATCH Review 1/1] drm/amdgpu: fix smu not match warning
> 
> 
> 
> On 11/16/2021 1:13 PM, Stanley.Yang wrote:
> > update smu driver if version to avoid mismatch log
> >
> > Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
> > Signed-off-by: Stanley.Yang 
> > ---
> >   drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
> > b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
> > index e5d3b0d1a032..2e35885c7287 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
> > @@ -27,7 +27,7 @@
> >
> >   #define SMU13_DRIVER_IF_VERSION_INV 0x
> >   #define SMU13_DRIVER_IF_VERSION_YELLOW_CARP 0x04 -#define
> > SMU13_DRIVER_IF_VERSION_ALDE 0x07
> > +#define SMU13_DRIVER_IF_VERSION_ALDE 0x08
> >
> 
> This is not an independent change, it should go along with a change in
> interface file. Please post the changes in smu13_driver_if_aldebaran.h along
> with this as one patch.
> 
> Thanks,
> Lijo
> 
> >   /* MP Apertures */
> >   #define MP0_Public0x0380
> >


回复: 回复: [PATCH Review 1/1] drm/ttm: fix debugfs node create failed

2021-10-19 Thread Yang, Stanley
[AMD Official Use Only]



> -邮件原件-
> 发件人: Christian König 
> 发送时间: Tuesday, October 19, 2021 4:46 PM
> 收件人: Yang, Stanley ; Das, Nirmoy
> ; amd-gfx@lists.freedesktop.org
> 主题: Re: 回复: [PATCH Review 1/1] drm/ttm: fix debugfs node create failed
> 
> Am 19.10.21 um 10:02 schrieb Yang, Stanley:
> > [AMD Official Use Only]
> >
> >
> >> -邮件原件-
> >> 发件人: amd-gfx  代表 Das,
> Nirmoy
> >> 发送时间: Thursday, October 14, 2021 2:11 AM
> >> 收件人: Christian König ; amd-
> >> g...@lists.freedesktop.org
> >> 主题: Re: [PATCH Review 1/1] drm/ttm: fix debugfs node create failed
> >>
> >>
> >> On 10/13/2021 2:29 PM, Christian König wrote:
> >>> Am 12.10.21 um 15:12 schrieb Das, Nirmoy:
> >>>> On 10/12/2021 1:58 PM, Stanley.Yang wrote:
> >>>>> Test scenario:
> >>>>>   modprobe amdgpu -> rmmod amdgpu -> modprobe amdgpu Error
> log:
> >>>>>   [   54.396807] debugfs: File 'page_pool' in directory 'amdttm'
> >>>>> already present!
> >>>>>   [   54.396833] debugfs: File 'page_pool_shrink' in directory
> >>>>> 'amdttm' already present!
> >>>>>   [   54.396848] debugfs: File 'buffer_objects' in directory
> >>>>> 'amdttm' already present!
> >>>>
> >>>> We should instead add a check if those debugfs files already
> >>>> exist/created in ttm debugfs dir using debugfs_lookup() before creating.
> >>> No, IIRC the Intel guys had fixed that already by adding/removing
> >>> the debugfs file on module load/unload.
> >>
> >> Adding/removing on ttm module load/unload is nicer.
> > The point is that page_pool, page_pool_shrink and buffer_objects are
> > created by amdgpu driver,
> 
> Yeah, but the debugfs files are not created by the driver. Those are global to
> TTM and can trivially be created during module load/unload.
[Yang, Stanley] Thanks Christian, I double check ttm related code the ttm load 
will create those debugfs file.

Stanley
> 
> Christian.
> 
> >   I think it's better to remove them by amdgpu module due to amdgpu
> > module create them, otherwise, there will be a scene create them failed
> only reload amdgpu module.
> >
> > Stanley
> >>
> >> Nirmoy
> >>
> >>>
> >>> Christian.
> >>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Nirmoy
> >>>>
> >>>>
> >>>>
> >>>>> Reason:
> >>>>>   page_pool, page_pool_shrink and buffer_objects can be
> >>>>> removed when
> >>>>>   rmmod amdttm, in the above test scenario only rmmod amdgpu,
> >>>>> so those
> >>>>>   debugfs node will not be removed, this caused file create failed.
> >>>>> Soultion:
> >>>>>   create ttm_page directory under ttm_root directory when
> >>>>> insmod amdgpu,
> >>>>>   page_pool, page_pool_shrink and buffer_objects are stored in
> >>>>> ttm_page directiry,
> >>>>>   remove ttm_page directory when do rmmod amdgpu, this can fix
> >>>>> above issue.
> >>>>>
> >>>>> Signed-off-by: Stanley.Yang 
> >>>>> ---
> >>>>>    drivers/gpu/drm/ttm/ttm_device.c | 12 +++-
> >>>>>    drivers/gpu/drm/ttm/ttm_module.c |  1 +
> >>>>>    drivers/gpu/drm/ttm/ttm_module.h |  1 +
> >>>>>    drivers/gpu/drm/ttm/ttm_pool.c   |  4 ++--
> >>>>>    4 files changed, 15 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c
> >>>>> b/drivers/gpu/drm/ttm/ttm_device.c
> >>>>> index 1de23edbc182..ad170328f0c8 100644
> >>>>> --- a/drivers/gpu/drm/ttm/ttm_device.c
> >>>>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> >>>>> @@ -55,6 +55,10 @@ static void ttm_global_release(void)
> >>>>>      ttm_pool_mgr_fini();
> >>>>>    +#ifdef CONFIG_DEBUG_FS
> >>>>> +    debugfs_remove(ttm_debugfs_page); #endif
> >>>>> +
> >>>>>    __free_page(glob->dummy_read_page);
> >>>>>    memset(glob, 0, sizeof(*glob));
> >>>>>    out:
> >>>>> @@ -85,6 +89,10 @@ static int ttm_global_init(voi

回复: [PATCH Review 1/1] drm/ttm: fix debugfs node create failed

2021-10-19 Thread Yang, Stanley
[AMD Official Use Only]


> -邮件原件-
> 发件人: amd-gfx  代表 Das,
> Nirmoy
> 发送时间: Thursday, October 14, 2021 2:11 AM
> 收件人: Christian König ; amd-
> g...@lists.freedesktop.org
> 主题: Re: [PATCH Review 1/1] drm/ttm: fix debugfs node create failed
> 
> 
> On 10/13/2021 2:29 PM, Christian König wrote:
> > Am 12.10.21 um 15:12 schrieb Das, Nirmoy:
> >>
> >> On 10/12/2021 1:58 PM, Stanley.Yang wrote:
> >>> Test scenario:
> >>>  modprobe amdgpu -> rmmod amdgpu -> modprobe amdgpu Error log:
> >>>  [   54.396807] debugfs: File 'page_pool' in directory 'amdttm'
> >>> already present!
> >>>  [   54.396833] debugfs: File 'page_pool_shrink' in directory
> >>> 'amdttm' already present!
> >>>  [   54.396848] debugfs: File 'buffer_objects' in directory
> >>> 'amdttm' already present!
> >>
> >>
> >> We should instead add a check if those debugfs files already
> >> exist/created in ttm debugfs dir using debugfs_lookup() before creating.
> >
> > No, IIRC the Intel guys had fixed that already by adding/removing the
> > debugfs file on module load/unload.
> 
> 
> Adding/removing on ttm module load/unload is nicer.
The point is that page_pool, page_pool_shrink and buffer_objects are created by 
amdgpu driver, I think it's better to remove them by amdgpu module due to 
amdgpu module create them,
otherwise, there will be a scene create them failed only reload amdgpu module.

Stanley
> 
> 
> Nirmoy
> 
> >
> >
> > Christian.
> >
> >>
> >>
> >> Regards,
> >>
> >> Nirmoy
> >>
> >>
> >>
> >>> Reason:
> >>>  page_pool, page_pool_shrink and buffer_objects can be removed
> >>> when
> >>>  rmmod amdttm, in the above test scenario only rmmod amdgpu, so
> >>> those
> >>>  debugfs node will not be removed, this caused file create failed.
> >>> Soultion:
> >>>  create ttm_page directory under ttm_root directory when insmod
> >>> amdgpu,
> >>>  page_pool, page_pool_shrink and buffer_objects are stored in
> >>> ttm_page directiry,
> >>>  remove ttm_page directory when do rmmod amdgpu, this can fix
> >>> above issue.
> >>>
> >>> Signed-off-by: Stanley.Yang 
> >>> ---
> >>>   drivers/gpu/drm/ttm/ttm_device.c | 12 +++-
> >>>   drivers/gpu/drm/ttm/ttm_module.c |  1 +
> >>>   drivers/gpu/drm/ttm/ttm_module.h |  1 +
> >>>   drivers/gpu/drm/ttm/ttm_pool.c   |  4 ++--
> >>>   4 files changed, 15 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c
> >>> b/drivers/gpu/drm/ttm/ttm_device.c
> >>> index 1de23edbc182..ad170328f0c8 100644
> >>> --- a/drivers/gpu/drm/ttm/ttm_device.c
> >>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> >>> @@ -55,6 +55,10 @@ static void ttm_global_release(void)
> >>>     ttm_pool_mgr_fini();
> >>>   +#ifdef CONFIG_DEBUG_FS
> >>> +    debugfs_remove(ttm_debugfs_page); #endif
> >>> +
> >>>   __free_page(glob->dummy_read_page);
> >>>   memset(glob, 0, sizeof(*glob));
> >>>   out:
> >>> @@ -85,6 +89,10 @@ static int ttm_global_init(void)
> >>>   >> PAGE_SHIFT;
> >>>   num_dma32 = min(num_dma32, 2UL << (30 - PAGE_SHIFT));
> >>>   +#ifdef CONFIG_DEBUG_FS
> >>> +    ttm_debugfs_page = debugfs_create_dir("ttm_page",
> >>> ttm_debugfs_root);
> >>> +#endif
> >>> +
> >>>   ttm_pool_mgr_init(num_pages);
> >>>   ttm_tt_mgr_init(num_pages, num_dma32);
> >>>   @@ -98,8 +106,10 @@ static int ttm_global_init(void)
> >>>   INIT_LIST_HEAD(>device_list);
> >>>   atomic_set(>bo_count, 0);
> >>>   -    debugfs_create_atomic_t("buffer_objects", 0444,
> >>> ttm_debugfs_root,
> >>> +#ifdef CONFIG_DEBUG_FS
> >>> +    debugfs_create_atomic_t("buffer_objects", 0444,
> >>> +ttm_debugfs_page,
> >>>   >bo_count);
> >>> +#endif
> >>>   out:
> >>>   mutex_unlock(_global_mutex);
> >>>   return ret;
> >>> diff --git a/drivers/gpu/drm/ttm/ttm_module.c
> >>> b/drivers/gpu/drm/ttm/ttm_module.c
> >>> index 88970a6b8e32..66595e6e7087 100644
> >>> --- a/drivers/gpu/drm/ttm/ttm_module.c
> >>> +++ b/drivers/gpu/drm/ttm/ttm_module.c
> >>> @@ -38,6 +38,7 @@
> >>>   #include "ttm_module.h"
> >>>     struct dentry *ttm_debugfs_root;
> >>> +struct dentry *ttm_debugfs_page;
> >>>     static int __init ttm_init(void)
> >>>   {
> >>> diff --git a/drivers/gpu/drm/ttm/ttm_module.h
> >>> b/drivers/gpu/drm/ttm/ttm_module.h
> >>> index d7cac5d4b835..6007dc66f44e 100644
> >>> --- a/drivers/gpu/drm/ttm/ttm_module.h
> >>> +++ b/drivers/gpu/drm/ttm/ttm_module.h
> >>> @@ -36,5 +36,6 @@
> >>>   struct dentry;
> >>>     extern struct dentry *ttm_debugfs_root;
> >>> +extern struct dentry *ttm_debugfs_page;
> >>>     #endif /* _TTM_MODULE_H_ */
> >>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> >>> b/drivers/gpu/drm/ttm/ttm_pool.c index 8be7fd7161fd..ecb33daad7b5
> >>> 100644
> >>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> >>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> >>> @@ -709,9 +709,9 @@ int ttm_pool_mgr_init(unsigned long num_pages)
> >>>   }
> >>>     #ifdef CONFIG_DEBUG_FS
> >>> -    debugfs_create_file("page_pool", 0444, ttm_debugfs_root, 

回复: [PATCH] drm/amdgpu: fix sdma firmware version error in sriov

2021-06-01 Thread Yang, Stanley
[AMD Official Use Only]

Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -邮件原件-
> 发件人: Wang, Kevin(Yang) 
> 发送时间: Monday, May 31, 2021 5:33 PM
> 收件人: amd-gfx@lists.freedesktop.org
> 抄送: frank@amd.ccom; Yang, Stanley ; Wang,
> Kevin(Yang) 
> 主题: [PATCH] drm/amdgpu: fix sdma firmware version error in sriov
> 
> Re-adjust the function return order to avoid empty sdma version in the sriov
> environment. (read amdgpu_firmware_info)
> 
> Signed-off-by: Kevin Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index deb907f96090..98059bce692f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -147,9 +147,6 @@ static int sdma_v5_2_init_microcode(struct
> amdgpu_device *adev)
>   struct amdgpu_firmware_info *info = NULL;
>   const struct common_firmware_header *header = NULL;
> 
> - if (amdgpu_sriov_vf(adev) && (adev->asic_type ==
> CHIP_SIENNA_CICHLID))
> - return 0;
> -
>   DRM_DEBUG("\n");
> 
>   switch (adev->asic_type) {
> @@ -187,6 +184,9 @@ static int sdma_v5_2_init_microcode(struct
> amdgpu_device *adev)
>  (void *)>sdma.instance[0],
>  sizeof(struct amdgpu_sdma_instance));
> 
> + if (amdgpu_sriov_vf(adev) && (adev->asic_type ==
> CHIP_SIENNA_CICHLID))
> + return 0;
> +
>   DRM_DEBUG("psp_load == '%s'\n",
> adev->firmware.load_type == AMDGPU_FW_LOAD_PSP ?
> "true" : "false");
> 
> --
> 2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 2/2] drm/amdgpu: only harvest gcea/mmea error status in aldebaran

2021-04-16 Thread Yang, Stanley
[AMD Official Use Only - Internal Distribution Only]

Seriers is Reviewed-by: Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Zhang, Hawking 
> Sent: Friday, April 16, 2021 5:44 PM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ;
> John Clements ; Li, Dennis
> 
> Cc: Zhang, Hawking 
> Subject: [PATCH 2/2] drm/amdgpu: only harvest gcea/mmea error status in
> aldebaran
> 
> In aldebaran, driver only needs to harvest SDP RdRspStatus, WrRspStatus
> and first parity error on RdRsp data. Check error type before harvest error
> information.
> 
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 21 -
> drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c | 11 +++
>  2 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> index 9ca76a3ac38c..91427543aabe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> @@ -808,7 +808,7 @@ static struct gfx_v9_4_2_utc_block
> gfx_v9_4_2_utc_blocks[] = {
> REG_SET_FIELD(0, ATC_L2_CACHE_4K_DSM_CNTL,
> WRITE_COUNTERS, 1) },  };
> 
> -static const struct soc15_reg_entry gfx_v9_4_2_rdrsp_status_regs =
> +static const struct soc15_reg_entry gfx_v9_4_2_ea_err_status_regs =
>   { SOC15_REG_ENTRY(GC, 0, regGCEA_ERR_STATUS), 0, 1, 16 };
> 
>  static int gfx_v9_4_2_get_reg_error_count(struct amdgpu_device *adev,
> @@ -1040,11 +1040,11 @@ static void
> gfx_v9_4_2_reset_ea_err_status(struct amdgpu_device *adev)
>   uint32_t i, j;
> 
>   mutex_lock(>grbm_idx_mutex);
> - for (i = 0; i < gfx_v9_4_2_rdrsp_status_regs.se_num; i++) {
> - for (j = 0; j < gfx_v9_4_2_rdrsp_status_regs.instance;
> + for (i = 0; i < gfx_v9_4_2_ea_err_status_regs.se_num; i++) {
> + for (j = 0; j < gfx_v9_4_2_ea_err_status_regs.instance;
>j++) {
>   gfx_v9_4_2_select_se_sh(adev, i, 0, j);
> -
>   WREG32(SOC15_REG_ENTRY_OFFSET(gfx_v9_4_2_rdrsp_status_reg
> s), 0x10);
> +
>   WREG32(SOC15_REG_ENTRY_OFFSET(gfx_v9_4_2_ea_err_status_re
> gs), 0x10);
>   }
>   }
>   gfx_v9_4_2_select_se_sh(adev, 0x, 0x, 0x); @@
> -1089,17 +1089,20 @@ static void gfx_v9_4_2_query_ea_err_status(struct
> amdgpu_device *adev)
> 
>   mutex_lock(>grbm_idx_mutex);
> 
> - for (i = 0; i < gfx_v9_4_2_rdrsp_status_regs.se_num; i++) {
> - for (j = 0; j < gfx_v9_4_2_rdrsp_status_regs.instance;
> + for (i = 0; i < gfx_v9_4_2_ea_err_status_regs.se_num; i++) {
> + for (j = 0; j < gfx_v9_4_2_ea_err_status_regs.instance;
>j++) {
>   gfx_v9_4_2_select_se_sh(adev, i, 0, j);
>   reg_value = RREG32(SOC15_REG_ENTRY_OFFSET(
> - gfx_v9_4_2_rdrsp_status_regs));
> - if (reg_value)
> + gfx_v9_4_2_ea_err_status_regs));
> + if (REG_GET_FIELD(reg_value, GCEA_ERR_STATUS,
> SDP_RDRSP_STATUS) ||
> + REG_GET_FIELD(reg_value, GCEA_ERR_STATUS,
> SDP_WRRSP_STATUS) ||
> + REG_GET_FIELD(reg_value, GCEA_ERR_STATUS,
> +SDP_RDRSP_DATAPARITY_ERROR)) {
>   dev_warn(adev->dev, "GCEA err detected at
> instance: %d, status: 0x%x!\n",
>   j, reg_value);
> + }
>   /* clear after read */
> -
>   WREG32(SOC15_REG_ENTRY_OFFSET(gfx_v9_4_2_rdrsp_status_reg
> s), 0x10);
> +
>   WREG32(SOC15_REG_ENTRY_OFFSET(gfx_v9_4_2_ea_err_status_re
> gs), 0x10);
>   }
>   }
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
> index d0f41346ea0c..cc69c434d0de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c
> @@ -1286,7 +1286,7 @@ static void
> mmhub_v1_7_reset_ras_error_count(struct amdgpu_device *adev)
>   }
>  }
> 
> -static const struct soc15_reg_entry mmhub_v1_7_err_status_regs[] = {
> +static const struct soc15_reg_entry mmhub_v1_7_ea_err_status_regs[] = {
>   { SOC15_REG_ENTRY(MMHUB, 0, regMMEA0_ERR_STATUS), 0, 0, 0 },
>   { SOC15_REG_ENTRY(MMHUB, 0, regMMEA1_ERR_STATUS), 0, 0, 0 },
>   { SOC15_REG_ENTRY(MMHUB, 0, regMMEA2_ERR_STATUS), 0, 0, 0 },
> @@ -1303,12 +1303,15 @@ static void
> mmhub_v1_7_query_ras_error_status(struct amdgpu_device *adev)
>   if (!amdgpu_ras_is_supp

Recall: [PATCH] drm/amd/sriov no need to config GECC for sriov

2021-04-14 Thread Yang, Stanley
Yang, Stanley would like to recall the message, "[PATCH] drm/amd/sriov no need 
to config GECC for sriov".
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amd/sriov no need to config GECC for sriov

2021-04-14 Thread Yang, Stanley
[AMD Official Use Only - Internal Distribution Only]

Stanley.Yang 

Regards,
Stanley
> -Original Message-
> From: Jack Zhang 
> Sent: Wednesday, April 14, 2021 5:04 PM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ;
> Clements, John ; Zhang, Hawking
> 
> Cc: Zhang, Jack (Jian) 
> Subject: [PATCH] drm/amd/sriov no need to config GECC for sriov
> 
> No need to config GECC feature here for sriov Leave the host drvier to do the
> configuration job.
> 
> Signed-off-by: Jack Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 123ab3156f5a..7bdf93716fbf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -567,7 +567,7 @@ static int psp_boot_config_set(struct amdgpu_device
> *adev)
>   struct psp_context *psp = >psp;
>   struct psp_gfx_cmd_resp *cmd = psp->cmd;
> 
> - if (adev->asic_type != CHIP_SIENNA_CICHLID)
> + if (adev->asic_type != CHIP_SIENNA_CICHLID ||
> amdgpu_sriov_vf(adev))
>   return 0;
> 
>   memset(cmd, 0, sizeof(struct psp_gfx_cmd_resp));
> --
> 2.25.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Recall: [PATCH] drm/amd/sriov no need to config GECC for sriov

2021-04-14 Thread Yang, Stanley
Yang, Stanley would like to recall the message, "[PATCH] drm/amd/sriov no need 
to config GECC for sriov".
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Recall: [PATCH] drm/amd/sriov no need to config GECC for sriov

2021-04-14 Thread Yang, Stanley
Yang, Stanley would like to recall the message, "[PATCH] drm/amd/sriov no need 
to config GECC for sriov".
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amd/sriov no need to config GECC for sriov

2021-04-14 Thread Yang, Stanley
[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Hawking Zhang 

Regards,
Stanley
> -Original Message-
> From: Jack Zhang 
> Sent: Wednesday, April 14, 2021 5:04 PM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ;
> Clements, John ; Zhang, Hawking
> 
> Cc: Zhang, Jack (Jian) 
> Subject: [PATCH] drm/amd/sriov no need to config GECC for sriov
> 
> No need to config GECC feature here for sriov Leave the host drvier to do the
> configuration job.
> 
> Signed-off-by: Jack Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 123ab3156f5a..7bdf93716fbf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -567,7 +567,7 @@ static int psp_boot_config_set(struct amdgpu_device
> *adev)
>   struct psp_context *psp = >psp;
>   struct psp_gfx_cmd_resp *cmd = psp->cmd;
> 
> - if (adev->asic_type != CHIP_SIENNA_CICHLID)
> + if (adev->asic_type != CHIP_SIENNA_CICHLID ||
> amdgpu_sriov_vf(adev))
>   return 0;
> 
>   memset(cmd, 0, sizeof(struct psp_gfx_cmd_resp));
> --
> 2.25.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/1] drm/amdgpu: skip load smu and sdma microcode on sriov for SIENNA_CICHLID

2020-12-13 Thread Yang, Stanley
[AMD Public Use]

Hi Lijo,

Good point, I will modify and send patch version two.

Regards,
Stanley
> -Original Message-
> From: Lazar, Lijo 
> Sent: Monday, December 14, 2020 12:01 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley ; Jian, Jane 
> Subject: RE: [PATCH 1/1] drm/amdgpu: skip load smu and sdma microcode on
> sriov for SIENNA_CICHLID
> 
> [AMD Public Use]
> 
> >-Original Message-
> >From: amd-gfx  On Behalf Of
> >Stanley.Yang
> >Sent: Monday, December 14, 2020 8:41 AM
> >To: amd-gfx@lists.freedesktop.org
> >Cc: Yang, Stanley ; Jian, Jane
> >
> >Subject: [PATCH 1/1] drm/amdgpu: skip load smu and sdma microcode on
> >sriov for SIENNA_CICHLID
> >
> >[CAUTION: External Email]
> >
> >skip load smu and sdma fw on sriov due to smc, sos, ta and asd fw have
> >been skipped for SIENNA_CICHLID.
> >
> >Signed-off-by: Stanley.Yang 
> >---
> > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c| 3 +++
> > drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 4 +++-
> > 2 files changed, 6 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >index 39e17aae655f..87566dee048d 100644
> >--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> >@@ -153,6 +153,9 @@ static int sdma_v5_2_init_microcode(struct
> >amdgpu_device *adev)
> >struct amdgpu_firmware_info *info = NULL;
> >const struct common_firmware_header *header = NULL;
> >
> >+   if (amdgpu_sriov_vf(adev) && (adev->asic_type ==
> >CHIP_SIENNA_CICHLID))
> >+   return 0;
> >+
> >DRM_DEBUG("\n");
> >
> >switch (adev->asic_type) {
> >diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> >b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> >index cf999b7a2164..31f05d96586c 100644
> >--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> >+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> >@@ -847,7 +847,9 @@ static int smu_sw_init(void *handle)
> >smu->smu_dpm.dpm_level = AMD_DPM_FORCED_LEVEL_AUTO;
> >smu->smu_dpm.requested_dpm_level =
> AMD_DPM_FORCED_LEVEL_AUTO;
> >
> >-   if (!amdgpu_sriov_vf(adev) || (adev->asic_type != CHIP_NAVI12)) {
> >+   if (!amdgpu_sriov_vf(adev) ||
> >+   ((adev->asic_type != CHIP_NAVI12) &&
> >+   (adev->asic_type != CHIP_SIENNA_CICHLID))) {
> >ret = smu_init_microcode(smu);
> >if (ret) {
> >dev_err(adev->dev, "Failed to load smu
> >firmware!\n");
> >--
> 
> It's not good to keep adding ASIC checks in the generic interface code. Move
> this check to smuv11.
> 
> Thanks,
> Lijo
> 
> >2.17.1
> >
> >___
> >amd-gfx mailing list
> >amd-gfx@lists.freedesktop.org
> >https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s
> >.f
> >reedesktop.org%2Fmailman%2Flistinfo%2Famd-
> >gfxdata=04%7C01%7Clijo.lazar%40amd.com%7C0a496c71fa1d4bc6a8
> 72
> >08d89fddf683%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6374
> 35
> >122965129344%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLC
> JQ
> >IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=i1y6s
> nfi
> >3bnQVHOuVGfMqjSG%2FsBLYtxLkrnT9PV4%2FbU%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and asd fw in SRIOV for navi12

2020-11-24 Thread Yang, Stanley
[AMD Public Use]

Hi Guchun,

Thanks for your review.

Regards,
Stanley

> -Original Message-
> From: Chen, Guchun 
> Sent: Wednesday, November 25, 2020 2:45 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org;
> Chen, JingWen 
> Subject: RE: [PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and asd
> fw in SRIOV for navi12
> 
> [AMD Public Use]
> 
> Okay. With that fixed, the patch is:
> 
> Reviewed-by: Guchun Chen 
> 
> Regards,
> Guchun
> 
> -Original Message-
> From: Yang, Stanley 
> Sent: Tuesday, November 24, 2020 10:37 PM
> To: Chen, Guchun ; amd-
> g...@lists.freedesktop.org; Chen, JingWen 
> Subject: RE: [PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and asd
> fw in SRIOV for navi12
> 
> [AMD Public Use]
> 
> Hi Guchun,
> 
> This is an oversight. I forgot to remove it from patch version first.
> 
> Regards,
> Stanley
> > -----Original Message-
> > From: Chen, Guchun 
> > Sent: Tuesday, November 24, 2020 9:47 PM
> > To: Yang, Stanley ;
> > amd-gfx@lists.freedesktop.org; Chen, JingWen
> 
> > Cc: Yang, Stanley 
> > Subject: RE: [PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and
> > asd fw in SRIOV for navi12
> >
> > [AMD Public Use]
> >
> > --- a/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> > +++ b/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> > @@ -208,14 +208,13 @@ static int vega10_smu_init(struct pp_hwmgr
> > *hwmgr)
> > unsigned long tools_size;
> > int ret;
> > struct cgs_firmware_info info = {0};
> > +   struct amdgpu_device *adev = hwmgr->adev;
> >
> > Why add this local variable? Looks no one is using it.
> >
> > Regards,
> > Guchun
> >
> > -Original Message-
> > From: amd-gfx  On Behalf Of
> > Stanley.Yang
> > Sent: Tuesday, November 24, 2020 5:49 PM
> > To: amd-gfx@lists.freedesktop.org; Chen, JingWen
> > 
> > Cc: Yang, Stanley 
> > Subject: [PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and asd
> > fw in SRIOV for navi12
> >
> > The KFDTopologyTest.BasicTest will failed if skip smc, sdma, sos, ta
> > and asd fw in SRIOV for vega10, so adjust above fw and skip load them
> > in SRIOV only for navi12.
> >
> > v2: remove unnecessary asic type check.
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c  |  3 ---
> >  drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c  |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c  |  3 ---
> >  .../gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c | 13
> ++---
> > 
> >  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c   |  2 +-
> >  5 files changed, 8 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> > index 16b551f330a4..8309dd95aa48 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> > @@ -593,9 +593,6 @@ static int sdma_v4_0_init_microcode(struct
> > amdgpu_device *adev)
> > struct amdgpu_firmware_info *info = NULL;
> > const struct common_firmware_header *header = NULL;
> >
> > -   if (amdgpu_sriov_vf(adev))
> > -   return 0;
> > -
> > DRM_DEBUG("\n");
> >
> > switch (adev->asic_type) {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> > b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> > index 9c72b95b7463..fad1cc394219 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> > @@ -203,7 +203,7 @@ static int sdma_v5_0_init_microcode(struct
> > amdgpu_device *adev)
> > const struct common_firmware_header *header = NULL;
> > const struct sdma_firmware_header_v1_0 *hdr;
> >
> > -   if (amdgpu_sriov_vf(adev))
> > +   if (amdgpu_sriov_vf(adev) && (adev->asic_type == CHIP_NAVI12))
> > return 0;
> >
> > DRM_DEBUG("\n");
> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > index cb5a6f1437f8..5ea11a0f568f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > @@ -153,9 +153,6 @@ static int sdma_v5_2_init_microcode(struct
> > amdgpu_device *adev)
> > struct amdgpu_firmware_info *info = NULL;
> > const struct common_firmware_header *header = NULL;
> >
> > -   if 

RE: [PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and asd fw in SRIOV for navi12

2020-11-24 Thread Yang, Stanley
[AMD Public Use]

Hi Guchun,

This is an oversight. I forgot to remove it from patch version first.

Regards,
Stanley
> -Original Message-
> From: Chen, Guchun 
> Sent: Tuesday, November 24, 2020 9:47 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org;
> Chen, JingWen 
> Cc: Yang, Stanley 
> Subject: RE: [PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and asd
> fw in SRIOV for navi12
> 
> [AMD Public Use]
> 
> --- a/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> @@ -208,14 +208,13 @@ static int vega10_smu_init(struct pp_hwmgr
> *hwmgr)
>   unsigned long tools_size;
>   int ret;
>   struct cgs_firmware_info info = {0};
> + struct amdgpu_device *adev = hwmgr->adev;
> 
> Why add this local variable? Looks no one is using it.
> 
> Regards,
> Guchun
> 
> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Tuesday, November 24, 2020 5:49 PM
> To: amd-gfx@lists.freedesktop.org; Chen, JingWen
> 
> Cc: Yang, Stanley 
> Subject: [PATCH V2 1/1] drm/amdgpu: only skip smc sdma sos ta and asd fw
> in SRIOV for navi12
> 
> The KFDTopologyTest.BasicTest will failed if skip smc, sdma, sos, ta and asd
> fw in SRIOV for vega10, so adjust above fw and skip load them in SRIOV only
> for navi12.
> 
> v2: remove unnecessary asic type check.
> 
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c  |  3 ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c  |  2 +-
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c  |  3 ---
>  .../gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c | 13 ++---
> 
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c   |  2 +-
>  5 files changed, 8 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> index 16b551f330a4..8309dd95aa48 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> @@ -593,9 +593,6 @@ static int sdma_v4_0_init_microcode(struct
> amdgpu_device *adev)
>   struct amdgpu_firmware_info *info = NULL;
>   const struct common_firmware_header *header = NULL;
> 
> - if (amdgpu_sriov_vf(adev))
> - return 0;
> -
>   DRM_DEBUG("\n");
> 
>   switch (adev->asic_type) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> index 9c72b95b7463..fad1cc394219 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> @@ -203,7 +203,7 @@ static int sdma_v5_0_init_microcode(struct
> amdgpu_device *adev)
>   const struct common_firmware_header *header = NULL;
>   const struct sdma_firmware_header_v1_0 *hdr;
> 
> - if (amdgpu_sriov_vf(adev))
> + if (amdgpu_sriov_vf(adev) && (adev->asic_type == CHIP_NAVI12))
>   return 0;
> 
>   DRM_DEBUG("\n");
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index cb5a6f1437f8..5ea11a0f568f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -153,9 +153,6 @@ static int sdma_v5_2_init_microcode(struct
> amdgpu_device *adev)
>   struct amdgpu_firmware_info *info = NULL;
>   const struct common_firmware_header *header = NULL;
> 
> - if (amdgpu_sriov_vf(adev))
> - return 0;
> -
>   DRM_DEBUG("\n");
> 
>   switch (adev->asic_type) {
> diff --git a/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> b/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> index daf122f24f23..e2192d8762a4 100644
> --- a/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> @@ -208,14 +208,13 @@ static int vega10_smu_init(struct pp_hwmgr
> *hwmgr)
>   unsigned long tools_size;
>   int ret;
>   struct cgs_firmware_info info = {0};
> + struct amdgpu_device *adev = hwmgr->adev;
> 
> - if (!amdgpu_sriov_vf((struct amdgpu_device *)hwmgr->adev)) {
> - ret = cgs_get_firmware_info(hwmgr->device,
> - CGS_UCODE_ID_SMU,
> - );
> - if (ret || !info.kptr)
> - return -EINVAL;
> - }
> + ret = cgs_get_firmware_info(hwmgr->device,
> + CGS_UCODE_ID_SMU,
> + );
> + if (ret || !info.

RE: [PATCH 1/1] drm/amdgpu: only skip smc sdma sos ta and asd fw in SRIOV for navi12

2020-11-23 Thread Yang, Stanley
[AMD Public Use]

Hi Hawking,

Thanks for your reminding, it is unnecessary to check navi12 asic type in sdma 
v4, I will update patch and resend it.

Regards,
Stanley

> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, November 23, 2020 9:31 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org;
> Chen, JingWen 
> Cc: Yang, Stanley 
> Subject: RE: [PATCH 1/1] drm/amdgpu: only skip smc sdma sos ta and asd fw
> in SRIOV for navi12
> 
> [AMD Public Use]
> 
> @@ -593,7 +593,7 @@ static int sdma_v4_0_init_microcode(struct
> amdgpu_device *adev)
>   struct amdgpu_firmware_info *info = NULL;
>   const struct common_firmware_header *header = NULL;
> 
> - if (amdgpu_sriov_vf(adev))
> + if (amdgpu_sriov_vf(adev) && (adev->asic_type == CHIP_NAVI12))
> 
> Navi12 doesn't integrate sdma v4. Why we need to check Navi12 in sdma v4
> function.
> 
> Regards,
> Hawking
> 
> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Monday, November 23, 2020 21:19
> To: amd-gfx@lists.freedesktop.org; Chen, JingWen
> 
> Cc: Yang, Stanley 
> Subject: [PATCH 1/1] drm/amdgpu: only skip smc sdma sos ta and asd fw in
> SRIOV for navi12
> 
> The KFDTopologyTest.BasicTest will failed if skip smc, sdma, sos, ta and asd
> fw in SRIOV for vega10, so adjust above fw and skip load them in SRIOV only
> for navi12.
> 
> Signed-off-by: Stanley.Yang 
> Change-Id: Id354be93723d7b5d769d73dc67c596af300305af
> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c  | 2 +-
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c  | 2 +-
>  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c  | 2 +-
>  drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c | 3 ++-
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c   | 2 +-
>  5 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> index 16b551f330a4..7e2f063120d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> @@ -593,7 +593,7 @@ static int sdma_v4_0_init_microcode(struct
> amdgpu_device *adev)
>   struct amdgpu_firmware_info *info = NULL;
>   const struct common_firmware_header *header = NULL;
> 
> - if (amdgpu_sriov_vf(adev))
> + if (amdgpu_sriov_vf(adev) && (adev->asic_type == CHIP_NAVI12))
>   return 0;
> 
>   DRM_DEBUG("\n");
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> index 9c72b95b7463..fad1cc394219 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c
> @@ -203,7 +203,7 @@ static int sdma_v5_0_init_microcode(struct
> amdgpu_device *adev)
>   const struct common_firmware_header *header = NULL;
>   const struct sdma_firmware_header_v1_0 *hdr;
> 
> - if (amdgpu_sriov_vf(adev))
> + if (amdgpu_sriov_vf(adev) && (adev->asic_type == CHIP_NAVI12))
>   return 0;
> 
>   DRM_DEBUG("\n");
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index cb5a6f1437f8..674bc88c3ec1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -153,7 +153,7 @@ static int sdma_v5_2_init_microcode(struct
> amdgpu_device *adev)
>   struct amdgpu_firmware_info *info = NULL;
>   const struct common_firmware_header *header = NULL;
> 
> - if (amdgpu_sriov_vf(adev))
> + if (amdgpu_sriov_vf(adev) && (adev->asic_type == CHIP_NAVI12))
>   return 0;
> 
>   DRM_DEBUG("\n");
> diff --git a/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> b/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> index daf122f24f23..192149e94f6c 100644
> --- a/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> +++ b/drivers/gpu/drm/amd/pm/powerplay/smumgr/vega10_smumgr.c
> @@ -208,8 +208,9 @@ static int vega10_smu_init(struct pp_hwmgr *hwmgr)
>   unsigned long tools_size;
>   int ret;
>   struct cgs_firmware_info info = {0};
> + struct amdgpu_device *adev = hwmgr->adev;
> 
> - if (!amdgpu_sriov_vf((struct amdgpu_device *)hwmgr->adev)) {
> + if (!amdgpu_sriov_vf(adev) || (adev->asic_type != CHIP_NAVI12)) {
>   ret = cgs_get_firmware_info(hwmgr->device,
>   CGS_UCODE_ID_SMU,
>   );
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> i

RE: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras

2020-07-22 Thread Yang, Stanley
[AMD Official Use Only - Internal Distribution Only]

Hi Guchun,

Please see my comment inline.

> -Original Message-
> From: Chen, Guchun 
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> ; Zhang, Hawking
> ; Li, Dennis ; Yang,
> Stanley ; Zhou1, Tao ;
> Clements, John 
> Cc: Chen, Guchun 
> Subject: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras
> 
> Bad page threshold value should be valid in the range between
> -1 and max records length of eeprom. It could determine when the GPU
> should be retired.
> 
> Signed-off-by: Guchun Chen 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 43
> +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  3 ++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|  5 +++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h|  2 +
>  4 files changed, 53 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6f06e1214622..e3d67d85c55f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -69,6 +69,9 @@ const char *ras_block_string[] = {
>  /* inject address is 52 bits */
>  #define  RAS_UMC_INJECT_ADDR_LIMIT   (0x1ULL << 52)
> 
> +/* typical ECC bad page rate(1 bad page per 100MB VRAM) */
> +#define RAS_BAD_PAGE_RATE(100 * 1024 * 1024ULL)
> +
>  enum amdgpu_ras_retire_page_reservation {
>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -1700,6 +1703,42 @@ static bool amdgpu_ras_check_bad_page(struct
> amdgpu_device *adev,
>   return ret;
>  }
> 
> +static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
> + uint32_t max_length)
> +{
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + int tmp_threshold = amdgpu_bad_page_threshold;
> + u64 val;
> +
> + /*
> +  * Justification of value bad_page_cnt_threshold in ras structure
> +  *
> +  * 1. -1 <= amdgpu_bad_page_threshold <= max record length in
> eeprom
> +  * 2. if amdgpu_bad_page_threshold = -1,
> +  *bad_page_cnt_threshold = typical value by formula.
> +  * 3. if amdgpu_bad_page_threshold = 0,
> +  *bad_page_cnt_threshold = 0x,
> +  *and disable RMA feature accordingly.
> +  * 4. use the value specified from user when
> (amdgpu_bad_page_threshold
> +  *> 0 && < max record length in eeprom).
> +  */
> +
> + if (tmp_threshold < -1)
> + tmp_threshold = -1;
> + else if (tmp_threshold > max_length)
> +         tmp_threshold = max_length;
> +
> + if (tmp_threshold == -1) {
> + val = adev->gmc.mc_vram_size;
> + do_div(val, RAS_BAD_PAGE_RATE);
> + con->bad_page_cnt_threshold = lower_32_bits(val);
[Yang, Stanley] : It's better to compare con->bad_page_cnt_threshold with 
max_length,
the value of bad_page_cnt_threshold should not exceed max_length.

> + } else if (tmp_threshold == 0) {
> + con->bad_page_cnt_threshold = 0x;
> + } else {
> + con->bad_page_cnt_threshold = tmp_threshold;
> + }
> +}
> +
>  /* called in gpu recovery/init */
>  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)  { @@ -
> 1777,6 +1816,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   struct ras_err_handler_data **data;
> + uint32_t max_eeprom_records_len = 0;
>   int ret;
> 
>   if (con)
> @@ -1795,6 +1835,9 @@ int amdgpu_ras_recovery_init(struct
> amdgpu_device *adev)
>   atomic_set(>in_recovery, 0);
>   con->adev = adev;
> 
> + max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
> + amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
> +
>   ret = amdgpu_ras_eeprom_init(>eeprom_control);
>   if (ret)
>   goto free;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index b2667342cf67..4672649a9293 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -336,6 +336,9 @@ struct amdgpu_ras {
>   struct amdgpu_ras_eeprom_control eeprom_control;
> 
>   bool error_query_ready;
> +
> + /* bad page count threshold */
> + uint32_t bad_page_cnt_threshold;
>  };
> 
>  struct ras_fs_data {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
&

RE: [PATCH V3] drm/amdgpu: support reserve bad page for virt

2020-06-05 Thread Yang, Stanley
[AMD Official Use Only - Internal Distribution Only]

Please ignore this patch, will resend.

> -Original Message-
> From: Stanley.Yang 
> Sent: Friday, June 5, 2020 1:54 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Chen, Guchun
> ; Liu, Monk ; Clements,
> John ; Zhou1, Tao ; Li,
> Dennis ; Yang, Stanley 
> Subject: [PATCH V3] drm/amdgpu: support reserve bad page for virt
> 
> Changed from V1:
>   rename some functions name, only init ras error handler data for
>   supported asic.
> 
> Changed from V2:
>   fix poential memory leak.
> 
> Signed-off-by: Stanley.Yang 
> Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   3 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 173
> +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  30 +++-
>  3 files changed, 202 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1df28b7bf22e..668ad0e35160 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2326,6 +2326,9 @@ static int amdgpu_device_ip_fini(struct
> amdgpu_device *adev)  {
>   int i, r;
> 
> + if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
> + amdgpu_virt_release_ras_err_handler_data(adev);
> +
>   amdgpu_ras_pre_fini(adev);
> 
>   if (adev->gmc.xgmi.num_physical_nodes > 1) diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index bab9286021a7..0891f27ba166 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -26,6 +26,7 @@
>  #include 
> 
>  #include "amdgpu.h"
> +#include "amdgpu_ras.h"
> 
>  bool amdgpu_virt_mmio_blocked(struct amdgpu_device *adev)  { @@ -
> 255,12 +256,171 @@ int amdgpu_virt_fw_reserve_get_checksum(void *obj,
>   return ret;
>  }
> 
> +static int amdgpu_virt_init_ras_err_handler_data(struct amdgpu_device
> +*adev) {
> + struct amdgpu_virt *virt = >virt;
> + struct amdgpu_virt_ras_err_handler_data **data = 
> >virt_eh_data;
> + /* GPU will be marked bad on host if bp count more then 10,
> +  * so alloc 512 is enough.
> +  */
> + unsigned int align_space = 512;
> + void *bps = NULL;
> + struct amdgpu_bo **bps_bo = NULL;
> +
> + *data = kmalloc(sizeof(struct amdgpu_virt_ras_err_handler_data),
> GFP_KERNEL);
> + if (!*data)
> + return -ENOMEM;
> +
> + bps = kmalloc(align_space * sizeof((*data)->bps), GFP_KERNEL);
> + bps_bo = kmalloc(align_space * sizeof((*data)->bps_bo),
> GFP_KERNEL);
> +
> + if (!bps || !bps_bo) {
> + kfree(bps);
> + kfree(bps_bo);
> + kfree(*data);
> + return -ENOMEM;
> + }
> +
> + (*data)->bps = bps;
> + (*data)->bps_bo = bps_bo;
> + (*data)->count = 0;
> + (*data)->last_reserved = 0;
> +
> + virt->ras_init_done = true;
> +
> + return 0;
> +}
> +
> +static void amdgpu_virt_ras_release_bp(struct amdgpu_device *adev) {
> + struct amdgpu_virt *virt = >virt;
> + struct amdgpu_virt_ras_err_handler_data *data = virt->virt_eh_data;
> + struct amdgpu_bo *bo;
> + int i;
> +
> + if (!data)
> + return;
> +
> + for (i = data->last_reserved - 1; i >= 0; i--) {
> + bo = data->bps_bo[i];
> + amdgpu_bo_free_kernel(, NULL, NULL);
> + data->bps_bo[i] = bo;
> + data->last_reserved = i;
> + }
> +}
> +
> +void amdgpu_virt_release_ras_err_handler_data(struct amdgpu_device
> +*adev) {
> + struct amdgpu_virt *virt = >virt;
> + struct amdgpu_virt_ras_err_handler_data *data = virt->virt_eh_data;
> +
> + virt->ras_init_done = false;
> +
> + if (!data)
> + return;
> +
> + amdgpu_virt_ras_release_bp(adev);
> +
> + kfree(data->bps);
> + kfree(data->bps_bo);
> + kfree(data);
> + virt->virt_eh_data = NULL;
> +}
> +
> +static void amdgpu_virt_ras_add_bps(struct amdgpu_device *adev,
> + struct eeprom_table_record *bps, int pages) {
> + struct amdgpu_virt *virt = >virt;
> + struct amdgpu_virt_ras_err_handler_data *data = virt->virt_eh_data;
> +
> + if (!data)
> + return;
> +
> + memcpy(>bps[data->count], bps, pages * sizeof(*data->bps));
> +

RE: [PATCH V2] drm/amdgpu: support reserve bad page for virt

2020-06-04 Thread Yang, Stanley
[AMD Public Use]


Hi Tao,



Thanks for your suggestion and reply inline.



Regards,

Stanley

> -Original Message-

> From: Zhou1, Tao 

> Sent: Friday, June 5, 2020 11:00 AM

> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org

> Cc: Zhang, Hawking ; Chen, Guchun

> ; Liu, Monk ; Clements,

> John ; Li, Dennis ; Yang,

> Stanley 

> Subject: RE: [PATCH V2] drm/amdgpu: support reserve bad page for virt

>

> [AMD Public Use]

>

>

>

> > -Original Message-

> > From: Stanley.Yang 

> > Sent: 2020年6月4日 20:36

> > To: amd-gfx@lists.freedesktop.org

> > Cc: Zhang, Hawking ; Chen, Guchun

> > ; Liu, Monk ; Clements,

> John

> > ; Zhou1, Tao ; Li,

> Dennis

> > ; Yang, Stanley 

> > Subject: [PATCH V2] drm/amdgpu: support reserve bad page for virt

> >

> > Changed from V1:

> > rename same functions name, only init ras error handler data for

> > supported asic.

> >

> > Signed-off-by: Stanley.Yang 

> > Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816

> > ---

> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   3 +

> >  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 172

> > +

> >  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  30 +++-

> >  3 files changed, 201 insertions(+), 4 deletions(-)

> >

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> > index 1df28b7bf22e..668ad0e35160 100644

> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> > @@ -2326,6 +2326,9 @@ static int amdgpu_device_ip_fini(struct

> > amdgpu_device *adev)  {

> > int i, r;

> >

> > +  if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)

> > +amdgpu_virt_release_ras_err_handler_data(adev);

> > +

> > amdgpu_ras_pre_fini(adev);

> >

> > if (adev->gmc.xgmi.num_physical_nodes > 1) diff --git

> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c

> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c

> > index bab9286021a7..174fcb8c8b57 100644

> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c

> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c

> > @@ -26,6 +26,7 @@

> >  #include 

> >

> >  #include "amdgpu.h"

> > +#include "amdgpu_ras.h"

> >

> >  bool amdgpu_virt_mmio_blocked(struct amdgpu_device *adev)  { @@ -

> > 255,12 +256,169 @@ int amdgpu_virt_fw_reserve_get_checksum(void

> *obj,

> > return ret;

> >  }

> >

> > +static int amdgpu_virt_init_ras_err_handler_data(struct amdgpu_device

> > +*adev) {

> > +  struct amdgpu_virt *virt = >virt;

> > +  struct amdgpu_virt_ras_err_handler_data **data = 

> > >virt_eh_data;

> > +  /* GPU will be marked bad on host if bp count more then 10,

> > +  * so alloc 512 is enough.

> > +  */

> > +  unsigned int align_space = 512;

> > +  void *bps = NULL;

> > +  struct amdgpu_bo **bps_bo = NULL;

> > +

> > +  *data = kmalloc(sizeof(struct amdgpu_virt_ras_err_handler_data),

> > GFP_KERNEL);

> > +  if (!*data)

> > +return -ENOMEM;

> > +

> > +  bps = kmalloc(align_space * sizeof((*data)->bps), GFP_KERNEL);

> > +  bps_bo = kmalloc(align_space * sizeof((*data)->bps_bo),

> > GFP_KERNEL);

> > +

> > +  if (!bps || !bps_bo) {

> > +kfree(bps);

> > +kfree(bps_bo);

> > +return -ENOMEM;

> > +  }

> > +

> > +  (*data)->bps = bps;

> > +  (*data)->bps_bo = bps_bo;

> > +  (*data)->count = 0;

> > +  (*data)->last_reserved = 0;

> > +

> > +  virt->ras_init_done = true;

> > +

> > +  return 0;

> > +}

> > +

> > +static void amdgpu_virt_ras_release_bp(struct amdgpu_device *adev) {

> > +  struct amdgpu_virt *virt = >virt;

> > +  struct amdgpu_virt_ras_err_handler_data *data = virt->virt_eh_data;

> > +  struct amdgpu_bo *bo;

> > +  int i;

> > +

> > +  if (!data)

> > +return;

> > +

> > +  for (i = data->last_reserved - 1; i >= 0; i--) {

> > +bo = data->bps_bo[i];

> > +amdgpu_bo_free_kernel(, NULL, NULL);

> > +data->bps_bo[i] = bo;

> > 

RE: [PATCH V2] drm/amdgpu: support reserve bad page for virt

2020-06-04 Thread Yang, Stanley
[AMD Public Use]

Thanks GuChun,

Will fix potential memory leak and typo.

Regards,
Stanley
> -Original Message-
> From: Chen, Guchun 
> Sent: Friday, June 5, 2020 10:24 AM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Liu, Monk
> ; Clements, John ; Zhou1,
> Tao ; Li, Dennis ; Yang,
> Stanley 
> Subject: RE: [PATCH V2] drm/amdgpu: support reserve bad page for virt
> 
> [AMD Public Use]
> 
> Please see my comments with prefix [Guchun].
> 
> Regards,
> Guchun
> 
> -Original Message-
> From: Stanley.Yang 
> Sent: Thursday, June 4, 2020 8:36 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Chen, Guchun
> ; Liu, Monk ; Clements,
> John ; Zhou1, Tao ; Li,
> Dennis ; Yang, Stanley 
> Subject: [PATCH V2] drm/amdgpu: support reserve bad page for virt
> 
> Changed from V1:
>   rename same functions name, only init ras error handler data for
>   supported asic.
> [Guchun] 'same' is one typo? It should be some..
> 
> 
> Signed-off-by: Stanley.Yang 
> Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   3 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 172
> +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  30 +++-
>  3 files changed, 201 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1df28b7bf22e..668ad0e35160 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2326,6 +2326,9 @@ static int amdgpu_device_ip_fini(struct
> amdgpu_device *adev)  {
>   int i, r;
> 
> + if (amdgpu_sriov_vf(adev) && adev->virt.ras_init_done)
> + amdgpu_virt_release_ras_err_handler_data(adev);
> +
>   amdgpu_ras_pre_fini(adev);
> 
>   if (adev->gmc.xgmi.num_physical_nodes > 1) diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index bab9286021a7..174fcb8c8b57 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -26,6 +26,7 @@
>  #include 
> 
>  #include "amdgpu.h"
> +#include "amdgpu_ras.h"
> 
>  bool amdgpu_virt_mmio_blocked(struct amdgpu_device *adev)  { @@ -
> 255,12 +256,169 @@ int amdgpu_virt_fw_reserve_get_checksum(void *obj,
>   return ret;
>  }
> 
> +static int amdgpu_virt_init_ras_err_handler_data(struct amdgpu_device
> +*adev) {
> + struct amdgpu_virt *virt = >virt;
> + struct amdgpu_virt_ras_err_handler_data **data = 
> >virt_eh_data;
> + /* GPU will be marked bad on host if bp count more then 10,
> +  * so alloc 512 is enough.
> +  */
> + unsigned int align_space = 512;
> + void *bps = NULL;
> + struct amdgpu_bo **bps_bo = NULL;
> +
> + *data = kmalloc(sizeof(struct amdgpu_virt_ras_err_handler_data),
> GFP_KERNEL);
> + if (!*data)
> + return -ENOMEM;
> +
> + bps = kmalloc(align_space * sizeof((*data)->bps), GFP_KERNEL);
> + bps_bo = kmalloc(align_space * sizeof((*data)->bps_bo),
> GFP_KERNEL);
> +
> + if (!bps || !bps_bo) {
> + kfree(bps);
> + kfree(bps_bo);
> [Guchun]It's needed to release *data as well to prevent memory leak?
> 
> 
> + return -ENOMEM;
> + }
> +
> + (*data)->bps = bps;
> + (*data)->bps_bo = bps_bo;
> + (*data)->count = 0;
> + (*data)->last_reserved = 0;
> +
> + virt->ras_init_done = true;
> +
> + return 0;
> +}
> +
> +static void amdgpu_virt_ras_release_bp(struct amdgpu_device *adev) {
> + struct amdgpu_virt *virt = >virt;
> + struct amdgpu_virt_ras_err_handler_data *data = virt->virt_eh_data;
> + struct amdgpu_bo *bo;
> + int i;
> +
> + if (!data)
> + return;
> +
> + for (i = data->last_reserved - 1; i >= 0; i--) {
> + bo = data->bps_bo[i];
> + amdgpu_bo_free_kernel(, NULL, NULL);
> + data->bps_bo[i] = bo;
> + data->last_reserved = i;
> + }
> +}
> +
> +void amdgpu_virt_release_ras_err_handler_data(struct amdgpu_device
> +*adev) {
> + struct amdgpu_virt *virt = >virt;
> + struct amdgpu_virt_ras_err_handler_data *data = virt->virt_eh_data;
> +
> + virt->ras_init_done = false;
> +
> + if (!data)
> + return;
> +
> + amdgpu_virt_ras_release_bp(adev);
> +
> + kfree(data->bps);
> + kfree(data->bps_bo);
&

RE: [PATCH] drm/amdgpu: support reserve bad page for virt

2020-06-04 Thread Yang, Stanley
[AMD Public Use]

Thanks tao, to call amdgpu_virt_init_err_handler_data In 
amdgpu_virt_add_bad_page once Is also a way, I will check whether has potential 
risk.
And I'll make distinguish the message from the one in bare mental RAS when 
reserved page failed.

Regards,
Stanley

> -Original Message-
> From: Zhou1, Tao 
> Sent: Thursday, June 4, 2020 12:16 PM
> To: Yang, Stanley ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Chen, Guchun
> ; Liu, Monk ; Clements,
> John ; Li, Dennis ; Yang,
> Stanley 
> Subject: RE: [PATCH] drm/amdgpu: support reserve bad page for virt
> 
> [AMD Public Use]
> 
> Two comments inline
> 
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: 2020年6月3日 22:10
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Zhang, Hawking ; Chen, Guchun
> > ; Liu, Monk ; Clements,
> John
> > ; Zhou1, Tao ; Li,
> Dennis
> > ; Yang, Stanley 
> > Subject: [PATCH] drm/amdgpu: support reserve bad page for virt
> >
> > Signed-off-by: Stanley.Yang 
> > Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   7 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 164
> > +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  30 +++-
> >  3 files changed, 196 insertions(+), 5 deletions(-)  mode change
> > 100644 =>
> > 100755 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> >  mode change 100644 => 100755
> > drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index b633171281f8..e8986e007206 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2001,8 +2001,10 @@ static int amdgpu_device_ip_init(struct
> > amdgpu_device *adev)
> > }
> > }
> >
> > -   if (amdgpu_sriov_vf(adev))
> > +   if (amdgpu_sriov_vf(adev)) {
> > +   amdgpu_virt_init_err_handler_data(adev);
> 
> [Tao] It's can be also called in amdgpu_virt_add_bad_page once to avoid asic
> type check, but either way is OK for me.
> 
> > amdgpu_virt_init_data_exchange(adev);
> > +   }
> >
> > r = amdgpu_ib_pool_init(adev);
> > if (r) {
> > @@ -2306,6 +2308,9 @@ static int amdgpu_device_ip_fini(struct
> > amdgpu_device *adev)  {
> > int i, r;
> >
> > +   if (amdgpu_sriov_vf(adev))
> > +   amdgpu_release_virt_err_handler_data(adev);
> > +
> > amdgpu_ras_pre_fini(adev);
> >
> > if (adev->gmc.xgmi.num_physical_nodes > 1) diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> > old mode 100644
> > new mode 100755
> > index f3b38c9e04ca..c1554562a2ce
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> > @@ -26,6 +26,7 @@
> >  #include 
> >
> >  #include "amdgpu.h"
> > +#include "amdgpu_ras.h"
> >
> >  bool amdgpu_virt_mmio_blocked(struct amdgpu_device *adev)  { @@ -
> > 255,12 +256,164 @@ int amdgpu_virt_fw_reserve_get_checksum(void
> *obj,
> > return ret;
> >  }
> >
> > +int amdgpu_virt_init_err_handler_data(struct amdgpu_device *adev) {
> > +   struct amdgpu_virt *virt = >virt;
> > +   struct virt_ras_err_handler_data **data = >virt_eh_data;
> > +   /* GPU will be marked bad on host if bp count more then 10,
> > +* so alloc 512 is enough.
> > +*/
> > +   unsigned int align_space = 512;
> > +   void *bps = NULL;
> > +   struct amdgpu_bo **bps_bo = NULL;
> > +
> > +   *data = kmalloc(sizeof(struct virt_ras_err_handler_data),
> > GFP_KERNEL);
> > +   if (!*data)
> > +   return -ENOMEM;
> > +
> > +   bps = kmalloc(align_space * sizeof((*data)->bps), GFP_KERNEL);
> > +   bps_bo = kmalloc(align_space * sizeof((*data)->bps_bo),
> > GFP_KERNEL);
> > +
> > +   if (!bps || !bps_bo) {
> > +   kfree(bps);
> > +   kfree(bps_bo);
> > +   return -ENOMEM;
> > +   }
> > +
> > +   (*data)->bps = bps;
> > +   (*data)->bps_bo = bps_bo;
> > +   (*data)->count = 0;
> > +   (*data)->last_reserved = 0;
> > +   return 0;
> > +}
> > +
> > +static void amdgpu_virt_release_bp(struct amdgpu_device *adev) {
> > +   struct amdgpu_virt *virt = >virt;
> > +   struct virt_ras_err_handle

  1   2   >