[PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-21 Thread Tao Zhou
Define page retirement functions for MCA platform.

v2: remove page retirement handling from MCA poison handler,
let MCA notifier do page retirement.

v3: remove specific poison handler for MCA to simplify code.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 53 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |  2 +
 2 files changed, 55 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index aad3c8b4c810..3c83129f4090 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -22,6 +22,59 @@
  */
 
 #include "amdgpu.h"
+#include "umc_v6_7.h"
+
+static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, uint64_t 
err_addr,
+   uint32_t ch_inst, uint32_t umc_inst)
+{
+   switch (adev->ip_versions[UMC_HWIP][0]) {
+   case IP_VERSION(6, 7, 0):
+   umc_v6_7_convert_error_address(adev,
+   err_data, err_addr, ch_inst, umc_inst);
+   break;
+   default:
+   dev_warn(adev->dev,
+"UMC address to Physical address translation is not 
supported\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
+
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
+{
+   struct ras_err_data err_data = {0, 0, 0, NULL};
+   int ret = AMDGPU_RAS_FAIL;
+
+   err_data.err_addr =
+   kcalloc(adev->umc.max_ras_err_cnt_per_query,
+   sizeof(struct eeprom_table_record), GFP_KERNEL);
+   if (!err_data.err_addr) {
+   dev_warn(adev->dev,
+   "Failed to alloc memory for umc error record in MCA 
notifier!\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   /*
+* Translate UMC channel address to Physical address
+*/
+   ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
+   ch_inst, umc_inst);
+   if (ret)
+   goto out;
+
+   if (amdgpu_bad_page_threshold != 0) {
+   amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
+   err_data.err_addr_cnt);
+   amdgpu_ras_save_bad_pages(adev);
+   }
+
+out:
+   kfree(err_data.err_addr);
+   return ret;
+}
 
 static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
void *ras_error_status,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 3629d8f292ef..659a10de29c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct ras_err_data 
*err_data,
 int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
void *ras_error_status,
struct amdgpu_iv_entry *entry);
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst);
 #endif
-- 
2.35.1



RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-20 Thread Zhou1, Tao
[AMD Official Use Only - General]



> -Original Message-
> From: Zhang, Hawking 
> Sent: Friday, October 21, 2022 12:15 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org; Yang,
> Stanley ; Chai, Thomas ; Li,
> Candice 
> Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for
> MCA
> 
> [AMD Official Use Only - General]
> 
> Re - whether need to do gpu reset is determined by unmap queue status, so
> reset parameter can't be dropped
> 
> + if (adev->gmc.xgmi.connected_to_cpu) {
> + ret = amdgpu_umc_poison_handler_mca(adev,
> ras_error_status, reset);
> 
> I think amdgpu_umc_poison_handler_mca is fallback handler specific for MCA
> platform, right?
> 
> I noticed there is platform check already in amdgpu_umc_poison_handler with
> the reset flag. so driver already knows whether the reset is needed, right.
> That's why I think "ras_error_status", "reset" are all not necessary. You can
> either call the followings directly by checking connected_to_cpu && reset,
> 
> + kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> + amdgpu_ras_reset_gpu(adev);
> 
> Or still provide a wrapper like amdgpu_umc_poison_handler_mca for above two
> calls.
> 
> The latter seems redundant as well. I mean, we don't need to maintain a 
> specific
> API for poison handling fallback on MCA platform - Aldebaran is the last SOC
> that supports this A + A RAS design. I can confirm we'll move to a new design
> going forward.
> 
> Regards,
> Hawking

[Tao] adding amdgpu_umc_poison_handler_mca is for better extension, although it 
only calls gpu reset right now. But since A + A RAS design will change 
dramatically, I'll remove amdgpu_umc_poison_handler_mca as you suggested.

> 
> -Original Message-
> From: Zhou1, Tao 
> Sent: Friday, October 21, 2022 10:54
> To: Zhang, Hawking ; amd-
> g...@lists.freedesktop.org; Yang, Stanley ; Chai,
> Thomas ; Li, Candice 
> Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for
> MCA
> 
> [AMD Official Use Only - General]
> 
> 
> > -----Original Message-----
> > From: Zhang, Hawking 
> > Sent: Thursday, October 20, 2022 5:13 PM
> > To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org;
> > Yang, Stanley ; Chai, Thomas
> > ; Li, Candice 
> > Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions
> > for MCA
> >
> > [AMD Official Use Only - General]
> >
> > +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> > +   struct ras_err_data *err_data, bool reset)
> >
> >
> > +   if (adev->gmc.xgmi.connected_to_cpu) {
> > +   ret = amdgpu_umc_poison_handler_mca(adev,
> > ras_error_status, reset);
> >
> > The input parameters "reset" and "err_data" can be dropped since
> > amdgpu_umc_poison_handler_mca is dedicated to MCA platform
> 
> [Tao] whether need to do gpu reset is determined by unmap queue status, so
> reset parameter can't be dropped.
> For "err_data", it can be removed currently, but I'm afraid we may need it on
> other ASICs in the future.
> 
> >
> > Regards,
> > Hawking
> >
> > -Original Message-
> > From: Zhou1, Tao 
> > Sent: Wednesday, October 19, 2022 16:12
> > To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> > ; Yang, Stanley ; Chai,
> > Thomas ; Li, Candice 
> > Cc: Zhou1, Tao 
> > Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for
> > MCA
> >
> > Define page retirement functions for MCA platform.
> >
> > v2: remove page retirement handling from MCA poison handler,
> > let MCA notifier do page retirement.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67
> > +
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > |  2 +
> >  2 files changed, 69 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index aad3c8b4c810..9494fa14db9a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -22,6 +22,73 @@
> >   */
> >
> >  #include "amdgpu.h"
> > +#include "umc_v6_7.h"
> > +
> > +static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
> > +   struct ras_err_data *err_data, uint64_t
> > err_addr,
> > +   

RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-20 Thread Zhang, Hawking
[AMD Official Use Only - General]

Re - whether need to do gpu reset is determined by unmap queue status, so reset 
parameter can't be dropped

+   if (adev->gmc.xgmi.connected_to_cpu) {
+   ret = amdgpu_umc_poison_handler_mca(adev, ras_error_status, 
reset);

I think amdgpu_umc_poison_handler_mca is fallback handler specific for MCA 
platform, right? 

I noticed there is platform check already in amdgpu_umc_poison_handler with the 
reset flag. so driver already knows whether the reset is needed, right.
That's why I think "ras_error_status", "reset" are all not necessary. You can 
either call the followings directly by checking connected_to_cpu && reset,

+   kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+   amdgpu_ras_reset_gpu(adev);

Or still provide a wrapper like amdgpu_umc_poison_handler_mca for above two 
calls.

The latter seems redundant as well. I mean, we don't need to maintain a 
specific API for poison handling fallback on MCA platform - Aldebaran is
the last SOC that supports this A + A RAS design. I can confirm we'll move to a 
new design going forward.

Regards,
Hawking

-Original Message-
From: Zhou1, Tao  
Sent: Friday, October 21, 2022 10:54
To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org; 
Yang, Stanley ; Chai, Thomas ; Li, 
Candice 
Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

[AMD Official Use Only - General]


> -Original Message-
> From: Zhang, Hawking 
> Sent: Thursday, October 20, 2022 5:13 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org; 
> Yang, Stanley ; Chai, Thomas 
> ; Li, Candice 
> Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions 
> for MCA
> 
> [AMD Official Use Only - General]
> 
> +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> + struct ras_err_data *err_data, bool reset)
> 
> 
> + if (adev->gmc.xgmi.connected_to_cpu) {
> + ret = amdgpu_umc_poison_handler_mca(adev,
> ras_error_status, reset);
> 
> The input parameters "reset" and "err_data" can be dropped since 
> amdgpu_umc_poison_handler_mca is dedicated to MCA platform

[Tao] whether need to do gpu reset is determined by unmap queue status, so 
reset parameter can't be dropped.
For "err_data", it can be removed currently, but I'm afraid we may need it on 
other ASICs in the future.

> 
> Regards,
> Hawking
> 
> -Original Message-
> From: Zhou1, Tao 
> Sent: Wednesday, October 19, 2022 16:12
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking 
> ; Yang, Stanley ; Chai, 
> Thomas ; Li, Candice 
> Cc: Zhou1, Tao 
> Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for 
> MCA
> 
> Define page retirement functions for MCA platform.
> 
> v2: remove page retirement handling from MCA poison handler,
> let MCA notifier do page retirement.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67
> +  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> |  2 +
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index aad3c8b4c810..9494fa14db9a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -22,6 +22,73 @@
>   */
> 
>  #include "amdgpu.h"
> +#include "umc_v6_7.h"
> +
> +static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
> + struct ras_err_data *err_data, uint64_t
> err_addr,
> + uint32_t ch_inst, uint32_t umc_inst) {
> + switch (adev->ip_versions[UMC_HWIP][0]) {
> + case IP_VERSION(6, 7, 0):
> + umc_v6_7_convert_error_address(adev,
> + err_data, err_addr, ch_inst, umc_inst);
> + break;
> + default:
> + dev_warn(adev->dev,
> +  "UMC address to Physical address translation is not
> supported\n");
> + return AMDGPU_RAS_FAIL;
> + }
> +
> + return AMDGPU_RAS_SUCCESS;
> +}
> +
> +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
> + uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
> {
> + struct ras_err_data err_data = {0, 0, 0, NULL};
> + int ret = AMDGPU_RAS_FAIL;
> +
> + err_data.err_addr =
> + kcalloc(adev->umc.max_ras_err_cnt_per_query,
> + sizeof(struct eeprom_table_record), GFP_KERNEL);
> + if (!err_data.err_addr) {
> + dev_warn(adev->dev,
>

RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-20 Thread Zhou1, Tao
[AMD Official Use Only - General]


> -Original Message-
> From: Zhang, Hawking 
> Sent: Thursday, October 20, 2022 5:13 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org; Yang,
> Stanley ; Chai, Thomas ; Li,
> Candice 
> Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for
> MCA
> 
> [AMD Official Use Only - General]
> 
> +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> + struct ras_err_data *err_data, bool reset)
> 
> 
> + if (adev->gmc.xgmi.connected_to_cpu) {
> + ret = amdgpu_umc_poison_handler_mca(adev,
> ras_error_status, reset);
> 
> The input parameters "reset" and "err_data" can be dropped since
> amdgpu_umc_poison_handler_mca is dedicated to MCA platform

[Tao] whether need to do gpu reset is determined by unmap queue status, so 
reset parameter can't be dropped.
For "err_data", it can be removed currently, but I'm afraid we may need it on 
other ASICs in the future.

> 
> Regards,
> Hawking
> 
> -Original Message-
> From: Zhou1, Tao 
> Sent: Wednesday, October 19, 2022 16:12
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking
> ; Yang, Stanley ; Chai,
> Thomas ; Li, Candice 
> Cc: Zhou1, Tao 
> Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA
> 
> Define page retirement functions for MCA platform.
> 
> v2: remove page retirement handling from MCA poison handler,
> let MCA notifier do page retirement.
> 
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67
> +  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> |  2 +
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index aad3c8b4c810..9494fa14db9a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -22,6 +22,73 @@
>   */
> 
>  #include "amdgpu.h"
> +#include "umc_v6_7.h"
> +
> +static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
> + struct ras_err_data *err_data, uint64_t
> err_addr,
> + uint32_t ch_inst, uint32_t umc_inst) {
> + switch (adev->ip_versions[UMC_HWIP][0]) {
> + case IP_VERSION(6, 7, 0):
> + umc_v6_7_convert_error_address(adev,
> + err_data, err_addr, ch_inst, umc_inst);
> + break;
> + default:
> + dev_warn(adev->dev,
> +  "UMC address to Physical address translation is not
> supported\n");
> + return AMDGPU_RAS_FAIL;
> + }
> +
> + return AMDGPU_RAS_SUCCESS;
> +}
> +
> +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
> + uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
> {
> + struct ras_err_data err_data = {0, 0, 0, NULL};
> + int ret = AMDGPU_RAS_FAIL;
> +
> + err_data.err_addr =
> + kcalloc(adev->umc.max_ras_err_cnt_per_query,
> + sizeof(struct eeprom_table_record), GFP_KERNEL);
> + if (!err_data.err_addr) {
> + dev_warn(adev->dev,
> + "Failed to alloc memory for umc error record in MCA
> notifier!\n");
> + return AMDGPU_RAS_FAIL;
> + }
> +
> + /*
> +  * Translate UMC channel address to Physical address
> +  */
> + ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
> + ch_inst, umc_inst);
> + if (ret)
> + goto out;
> +
> + if (amdgpu_bad_page_threshold != 0) {
> + amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> + err_data.err_addr_cnt);
> + amdgpu_ras_save_bad_pages(adev);
> + }
> +
> +out:
> + kfree(err_data.err_addr);
> + return ret;
> +}
> +
> +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> + struct ras_err_data *err_data, bool reset) {
> + /* MCA poison handler is only responsible for GPU reset,
> +  * let MCA notifier do page retirement.
> +  */
> + if (reset) {
> + kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> + amdgpu_ras_reset_gpu(adev);
> + }
> +
> + return AMDGPU_RAS_SUCCESS;
> +}
> 
>  static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
>   void *ras_error_status,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
&g

RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-20 Thread Zhang, Hawking
[AMD Official Use Only - General]

+static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, bool reset)


+   if (adev->gmc.xgmi.connected_to_cpu) {
+   ret = amdgpu_umc_poison_handler_mca(adev, ras_error_status, 
reset);

The input parameters "reset" and "err_data" can be dropped since 
amdgpu_umc_poison_handler_mca is dedicated to MCA platform

Regards,
Hawking

-Original Message-
From: Zhou1, Tao  
Sent: Wednesday, October 19, 2022 16:12
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Yang, Stanley ; Chai, Thomas ; Li, 
Candice 
Cc: Zhou1, Tao 
Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

Define page retirement functions for MCA platform.

v2: remove page retirement handling from MCA poison handler,
let MCA notifier do page retirement.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67 +  
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |  2 +
 2 files changed, 69 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index aad3c8b4c810..9494fa14db9a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -22,6 +22,73 @@
  */
 
 #include "amdgpu.h"
+#include "umc_v6_7.h"
+
+static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, uint64_t 
err_addr,
+   uint32_t ch_inst, uint32_t umc_inst) {
+   switch (adev->ip_versions[UMC_HWIP][0]) {
+   case IP_VERSION(6, 7, 0):
+   umc_v6_7_convert_error_address(adev,
+   err_data, err_addr, ch_inst, umc_inst);
+   break;
+   default:
+   dev_warn(adev->dev,
+"UMC address to Physical address translation is not 
supported\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
+
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst) 
{
+   struct ras_err_data err_data = {0, 0, 0, NULL};
+   int ret = AMDGPU_RAS_FAIL;
+
+   err_data.err_addr =
+   kcalloc(adev->umc.max_ras_err_cnt_per_query,
+   sizeof(struct eeprom_table_record), GFP_KERNEL);
+   if (!err_data.err_addr) {
+   dev_warn(adev->dev,
+   "Failed to alloc memory for umc error record in MCA 
notifier!\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   /*
+* Translate UMC channel address to Physical address
+*/
+   ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
+   ch_inst, umc_inst);
+   if (ret)
+   goto out;
+
+   if (amdgpu_bad_page_threshold != 0) {
+   amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
+   err_data.err_addr_cnt);
+   amdgpu_ras_save_bad_pages(adev);
+   }
+
+out:
+   kfree(err_data.err_addr);
+   return ret;
+}
+
+static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, bool reset) {
+   /* MCA poison handler is only responsible for GPU reset,
+* let MCA notifier do page retirement.
+*/
+   if (reset) {
+   kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+   amdgpu_ras_reset_gpu(adev);
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
 
 static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
void *ras_error_status,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 3629d8f292ef..659a10de29c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct ras_err_data 
*err_data,  int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
void *ras_error_status,
struct amdgpu_iv_entry *entry);
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst);
 #endif
--
2.35.1


[PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-19 Thread Tao Zhou
Define page retirement functions for MCA platform.

v2: remove page retirement handling from MCA poison handler,
let MCA notifier do page retirement.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |  2 +
 2 files changed, 69 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index aad3c8b4c810..9494fa14db9a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -22,6 +22,73 @@
  */
 
 #include "amdgpu.h"
+#include "umc_v6_7.h"
+
+static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, uint64_t 
err_addr,
+   uint32_t ch_inst, uint32_t umc_inst)
+{
+   switch (adev->ip_versions[UMC_HWIP][0]) {
+   case IP_VERSION(6, 7, 0):
+   umc_v6_7_convert_error_address(adev,
+   err_data, err_addr, ch_inst, umc_inst);
+   break;
+   default:
+   dev_warn(adev->dev,
+"UMC address to Physical address translation is not 
supported\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
+
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
+{
+   struct ras_err_data err_data = {0, 0, 0, NULL};
+   int ret = AMDGPU_RAS_FAIL;
+
+   err_data.err_addr =
+   kcalloc(adev->umc.max_ras_err_cnt_per_query,
+   sizeof(struct eeprom_table_record), GFP_KERNEL);
+   if (!err_data.err_addr) {
+   dev_warn(adev->dev,
+   "Failed to alloc memory for umc error record in MCA 
notifier!\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   /*
+* Translate UMC channel address to Physical address
+*/
+   ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
+   ch_inst, umc_inst);
+   if (ret)
+   goto out;
+
+   if (amdgpu_bad_page_threshold != 0) {
+   amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
+   err_data.err_addr_cnt);
+   amdgpu_ras_save_bad_pages(adev);
+   }
+
+out:
+   kfree(err_data.err_addr);
+   return ret;
+}
+
+static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, bool reset)
+{
+   /* MCA poison handler is only responsible for GPU reset,
+* let MCA notifier do page retirement.
+*/
+   if (reset) {
+   kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+   amdgpu_ras_reset_gpu(adev);
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
 
 static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
void *ras_error_status,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 3629d8f292ef..659a10de29c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct ras_err_data 
*err_data,
 int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
void *ras_error_status,
struct amdgpu_iv_entry *entry);
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst);
 #endif
-- 
2.35.1



RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-18 Thread Zhang, Hawking
[AMD Official Use Only - General]

Thinking about it more, I'm leaning toward *not* touching MCA_STATUS in poison 
consumption handler on MCA (A + A) platform.

As discussed earlier, the asynchronous access between MCA driver and GPU driver 
introduces the risk to break the message queue of bad address because of the 
nature of MCA.

We shall keep GPU driver away from any MCA operation. When poison is consumed, 
GPU driver is *only* responsible for unmap_queue, or mode-2 recovery as 
fallback, and only monitor MCA notifier for the bad page address retirement. I 
guess we don't want trigger MCA reset and let MCA notifier report 0.

Regards,
Hawking

-Original Message-
From: Zhou1, Tao  
Sent: Tuesday, October 18, 2022 16:32
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Yang, Stanley ; Chai, Thomas ; Li, 
Candice 
Cc: Zhou1, Tao 
Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

Define page retirement functions for MCA platform.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 112 
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |   2 +
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.h   |   2 +
 4 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index aad3c8b4c810..e97b1bd343ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -22,6 +22,118 @@
  */
 
 #include "amdgpu.h"
+#include "umc_v6_7.h"
+
+static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, uint64_t 
err_addr,
+   uint32_t ch_inst, uint32_t umc_inst) {
+   switch (adev->ip_versions[UMC_HWIP][0]) {
+   case IP_VERSION(6, 7, 0):
+   umc_v6_7_convert_error_address(adev,
+   err_data, err_addr, ch_inst, umc_inst);
+   break;
+   default:
+   dev_warn(adev->dev,
+"UMC address to Physical address translation is not 
supported\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
+
+static int amdgpu_umc_ecc_info_query_error_address(struct amdgpu_device *adev,
+struct ras_err_data *err_data) {
+   switch (adev->ip_versions[UMC_HWIP][0]) {
+   case IP_VERSION(6, 7, 0):
+   umc_v6_7_ecc_info_query_ras_error_address(adev,
+   err_data);
+   break;
+   default:
+   dev_warn(adev->dev,
+"UMC error address query is not supported\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
+
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst) 
{
+   struct ras_err_data err_data = {0, 0, 0, NULL};
+   int ret = AMDGPU_RAS_FAIL;
+
+   err_data.err_addr =
+   kcalloc(adev->umc.max_ras_err_cnt_per_query,
+   sizeof(struct eeprom_table_record), GFP_KERNEL);
+   if (!err_data.err_addr) {
+   dev_warn(adev->dev,
+   "Failed to alloc memory for umc error record in MCA 
notifier!\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   /*
+* Translate UMC channel address to Physical address
+*/
+   ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
+   ch_inst, umc_inst);
+   if (ret)
+   goto out;
+
+   if (amdgpu_bad_page_threshold != 0) {
+   amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
+   err_data.err_addr_cnt);
+   amdgpu_ras_save_bad_pages(adev);
+   }
+
+out:
+   kfree(err_data.err_addr);
+   return ret;
+}
+
+static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, bool reset) {
+   int ret = AMDGPU_RAS_FAIL;
+
+   err_data->err_addr =
+   kcalloc(adev->umc.max_ras_err_cnt_per_query,
+   sizeof(struct eeprom_table_record), GFP_KERNEL);
+   if (!err_data->err_addr) {
+   dev_warn(adev->dev,
+   "Failed to alloc memory for MCA RAS poison handler!\n");
+   goto out2;
+   }
+
+   /*
+* Translate UMC channel address to Physical address
+*/
+   ret = amdgpu_umc_ecc_info_query_error_address(adev, err_data);
+   if (ret)
+   goto out1;
+
+   if (amdgpu_bad_page_threshold != 0) {
+   amdgpu_ras_add_ba

[PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

2022-10-18 Thread Tao Zhou
Define page retirement functions for MCA platform.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 112 
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |   2 +
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/umc_v6_7.h   |   2 +
 4 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index aad3c8b4c810..e97b1bd343ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -22,6 +22,118 @@
  */
 
 #include "amdgpu.h"
+#include "umc_v6_7.h"
+
+static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, uint64_t 
err_addr,
+   uint32_t ch_inst, uint32_t umc_inst)
+{
+   switch (adev->ip_versions[UMC_HWIP][0]) {
+   case IP_VERSION(6, 7, 0):
+   umc_v6_7_convert_error_address(adev,
+   err_data, err_addr, ch_inst, umc_inst);
+   break;
+   default:
+   dev_warn(adev->dev,
+"UMC address to Physical address translation is not 
supported\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
+
+static int amdgpu_umc_ecc_info_query_error_address(struct amdgpu_device *adev,
+struct ras_err_data *err_data)
+{
+   switch (adev->ip_versions[UMC_HWIP][0]) {
+   case IP_VERSION(6, 7, 0):
+   umc_v6_7_ecc_info_query_ras_error_address(adev,
+   err_data);
+   break;
+   default:
+   dev_warn(adev->dev,
+"UMC error address query is not supported\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   return AMDGPU_RAS_SUCCESS;
+}
+
+int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
+   uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
+{
+   struct ras_err_data err_data = {0, 0, 0, NULL};
+   int ret = AMDGPU_RAS_FAIL;
+
+   err_data.err_addr =
+   kcalloc(adev->umc.max_ras_err_cnt_per_query,
+   sizeof(struct eeprom_table_record), GFP_KERNEL);
+   if (!err_data.err_addr) {
+   dev_warn(adev->dev,
+   "Failed to alloc memory for umc error record in MCA 
notifier!\n");
+   return AMDGPU_RAS_FAIL;
+   }
+
+   /*
+* Translate UMC channel address to Physical address
+*/
+   ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
+   ch_inst, umc_inst);
+   if (ret)
+   goto out;
+
+   if (amdgpu_bad_page_threshold != 0) {
+   amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
+   err_data.err_addr_cnt);
+   amdgpu_ras_save_bad_pages(adev);
+   }
+
+out:
+   kfree(err_data.err_addr);
+   return ret;
+}
+
+static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
+   struct ras_err_data *err_data, bool reset)
+{
+   int ret = AMDGPU_RAS_FAIL;
+
+   err_data->err_addr =
+   kcalloc(adev->umc.max_ras_err_cnt_per_query,
+   sizeof(struct eeprom_table_record), GFP_KERNEL);
+   if (!err_data->err_addr) {
+   dev_warn(adev->dev,
+   "Failed to alloc memory for MCA RAS poison handler!\n");
+   goto out2;
+   }
+
+   /*
+* Translate UMC channel address to Physical address
+*/
+   ret = amdgpu_umc_ecc_info_query_error_address(adev, err_data);
+   if (ret)
+   goto out1;
+
+   if (amdgpu_bad_page_threshold != 0) {
+   amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
+   err_data->err_addr_cnt);
+   amdgpu_ras_save_bad_pages(adev);
+   }
+
+out1:
+   kfree(err_data->err_addr);
+out2:
+   /* trigger gpu reset even error count is 0 for CPU MCA RAS,
+* MCA notifier is responsible for page retirement if error
+* count can't be queried in poison handler.
+*/
+   if (reset) {
+   kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+   amdgpu_ras_reset_gpu(adev);
+   }
+
+   return ret;
+}
 
 static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
void *ras_error_status,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index 3629d8f292ef..659a10de29c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
@@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct ras_err_data 
*err_data,
 int amdgpu_umc_process_ras_data_cb(struct amdgpu_device