RE: [PATCH] drm/amdgpu: disable GPU RAS bad page feature for specific ASIC

2024-09-10 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

It's not true, the feature on gpu side is ASIC specific even for APU.

Regards,
Tao

> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, September 10, 2024 9:58 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking 
> Subject: Re: [PATCH] drm/amdgpu: disable GPU RAS bad page feature for specific
> ASIC
>
>
> On a second thought, this may be made more generic by just checking APU flag -
> holds true for any APU in general.
>
> Thanks,
> Lijo
>
> On 9/10/2024 7:24 PM, Lazar, Lijo wrote:
> >
> >
> > On 9/10/2024 2:07 PM, Tao Zhou wrote:
> >> The feature is not applicable to specific app platform.
> >>
> >> v2: update the disablement condition and commit description
> >> v3: move the setting to amdgpu_ras_check_supported
> >>
> >> Signed-off-by: Tao Zhou 
> >> Reviewed-by: Hawking Zhang 
> >
> > Reviewed-by: Lijo Lazar 
> >
> > Thanks,
> > Lijo
> >
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +
> >>  1 file changed, 5 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> index dbfc41ddc3c7..ebe3e8f01fe2 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> >> @@ -3483,6 +3483,11 @@ static void amdgpu_ras_check_supported(struct
> >> amdgpu_device *adev)
> >>
> >>/* aca is disabled by default */
> >>adev->aca.is_enabled = false;
> >> +
> >> +  /* bad page feature is not applicable to specific app platform */
> >> +  if (adev->gmc.is_app_apu &&
> >> +  amdgpu_ip_version(adev, UMC_HWIP, 0) == IP_VERSION(12, 0, 0))
> >> +  amdgpu_bad_page_threshold = 0;
> >>  }
> >>
> >>  static void amdgpu_ras_counte_dw(struct work_struct *work)


RE: [PATCH] drm/amdgpu: disable GPU RAS bad page feature for specific ASIC

2024-09-10 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, September 10, 2024 1:21 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: disable GPU RAS bad page feature for specific
> ASIC
>
>
>
> On 9/10/2024 9:29 AM, Tao Zhou wrote:
> > The feature is not applicable to specific app platform.
> >
> > v2: update the disablement condition and commit description
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index dbfc41ddc3c7..08efc9121adc 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2055,6 +2055,11 @@ static int amdgpu_ras_fs_init(struct amdgpu_device
> *adev)
> > con->event_state_attr = dev_attr_event_state;
> > sysfs_attr_init(attrs[3]);
> >
> > +   /* bad page feature is not applicable to specific app platform */
> > +   if (adev->gmc.is_app_apu &&
> > +   amdgpu_ip_version(adev, UMC_HWIP, 0) == IP_VERSION(12, 0, 0))
> > +   amdgpu_bad_page_threshold = 0;
>
> I think sysfs file creation is not the right place to do this. It should be 
> done
> probably much earlier at a place where it says what features are supported for
> the SOC.
>
> Thanks,
> Lijo

[Tao] thanks for your suggestion, will update it in v3.

>
> > +
> > if (amdgpu_bad_page_threshold != 0) {
> > /* add bad_page_features entry */
> > bin_attr_gpu_vram_bad_pages.private = NULL;


RE: [PATCH] drm/amdkfd: Select reset method for poison handling

2024-09-06 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Friday, September 6, 2024 4:13 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdkfd: Select reset method for poison handling
>
> Driver mode-2 is only supported by relative new smc firmware.
>
> Signed-off-by: Hawking Zhang 
> ---
>  .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 40 +++
>  1 file changed, 32 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index fecdbbab9894..d46a13156ee9 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -167,11 +167,23 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SE3SH:
>   case SOC15_IH_CLIENTID_UTCL2:
>   block = AMDGPU_RAS_BLOCK__GFX;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3) ||
> - amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 4))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3)) {
> + /* driver mode-2 for gfx poison is only supported by
> +  * pmfw 0x00557300 and onwards */
> + if (dev->adev->pm.fw_version < 0x00557300)
> + reset =
> AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset =
> AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 4)) {
> + /* driver mode-2 for gfx poison is only supported by
> +  * pmfw 0x05550C00 and onwards */
> + if (dev->adev->pm.fw_version < 0x05550C00)
> + reset =
> AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset =
> AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else {
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
>   break;
>   case SOC15_IH_CLIENTID_VMC:
>   case SOC15_IH_CLIENTID_VMC1:
> @@ -184,11 +196,23 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SDMA3:
>   case SOC15_IH_CLIENTID_SDMA4:
>   block = AMDGPU_RAS_BLOCK__SDMA;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3) ||
> - amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 4))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, SDMA0_HWIP, 0) ==
> IP_VERSION(4, 4, 2)) {
> + /* driver mode-2 for gfx poison is only supported by
> +  * pmfw 0x00557300 and onwards */
> + if (dev->adev->pm.fw_version < 0x00557300)
> + reset =
> AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset =
> AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else if (amdgpu_ip_version(dev->adev, SDMA0_HWIP, 0) ==
> IP_VERSION(4, 4, 5)) {
> + /* driver mode-2 for gfx poison is only supported by
> +  * pmfw 0x05550C00 and onwards */
> + if (dev->adev->pm.fw_version < 0x05550C00)
> + reset =
> AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + else
> + reset =
> AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + } else {
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
>   break;
>   default:
>   dev_warn(dev->adev->dev,
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: add list empty check to avoid null pointer issue

2024-08-21 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Wednesday, August 21, 2024 2:57 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> 
> Subject: [PATCH] drm/amdgpu: add list empty check to avoid null pointer issue
>
> Add list empty check to avoid null pointer issues in some corner cases.
> - list_for_each_entry_safe()
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> index 929095a2e088..57bda66e85ef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> @@ -80,6 +80,9 @@ static void aca_banks_release(struct aca_banks *banks)  {
>   struct aca_bank_node *node, *tmp;
>
> + if (list_empty(&banks->list))
> + return;
> +
>   list_for_each_entry_safe(node, tmp, &banks->list, node) {
>   list_del(&node->node);
>   kvfree(node);
> @@ -562,9 +565,13 @@ static void aca_error_fini(struct aca_error *aerr)
>   struct aca_bank_error *bank_error, *tmp;
>
>   mutex_lock(&aerr->lock);
> + if (list_empty(&aerr->list))
> + goto out_unlock;
> +
>   list_for_each_entry_safe(bank_error, tmp, &aerr->list, node)
>   aca_bank_error_remove(aerr, bank_error);
>
> +out_unlock:
>   mutex_destroy(&aerr->lock);
>  }
>
> @@ -680,6 +687,9 @@ static void aca_manager_fini(struct
> aca_handle_manager *mgr)  {
>   struct aca_handle *handle, *tmp;
>
> + if (list_empty(&mgr->list))
> + return;
> +
>   list_for_each_entry_safe(handle, tmp, &mgr->list, node)
>   amdgpu_aca_remove_handle(handle);
>  }
> --
> 2.34.1



RE: [PATCH v2 1/3] drm/amdkfd: Check int source id for utcl2 poison event

2024-08-19 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Tuesday, August 20, 2024 2:05 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley 
> Cc: Zhang, Hawking ; Fan, Shikang
> 
> Subject: [PATCH v2 1/3] drm/amdkfd: Check int source id for utcl2 poison event
>
> Traditional utcl2 fault_status polling does not work in SRIOV environment. The
> polling of fault status register from guest side will be dropped by hardware.
>
> Driver should switch to check utcl2 interrupt source id to identify utcl2 
> poison
> event. It is set to 1 when poisoned data interrupts are signaled.
>
> v2: drop the unused local variable (Tao)
>
> Signed-off-by: Hawking Zhang 
> ---
>  .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c| 18 +-
>  drivers/gpu/drm/amd/amdkfd/soc15_int.h |  1 +
>  2 files changed, 2 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index a9c3580be8c9..fecdbbab9894 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -431,25 +431,9 @@ static void event_interrupt_wq_v9(struct kfd_node
> *dev,
>  client_id == SOC15_IH_CLIENTID_UTCL2) {
>   struct kfd_vm_fault_info info = {0};
>   uint16_t ring_id =
> SOC15_RING_ID_FROM_IH_ENTRY(ih_ring_entry);
> - uint32_t node_id =
> SOC15_NODEID_FROM_IH_ENTRY(ih_ring_entry);
> - uint32_t vmid_type =
> SOC15_VMID_TYPE_FROM_IH_ENTRY(ih_ring_entry);
> - int hub_inst = 0;
>   struct kfd_hsa_memory_exception_data exception_data;
>
> - /* gfxhub */
> - if (!vmid_type && dev->adev->gfx.funcs-
> >ih_node_to_logical_xcc) {
> - hub_inst = dev->adev->gfx.funcs-
> >ih_node_to_logical_xcc(dev->adev,
> - node_id);
> - if (hub_inst < 0)
> - hub_inst = 0;
> - }
> -
> - /* mmhub */
> - if (vmid_type && client_id == SOC15_IH_CLIENTID_VMC)
> - hub_inst = node_id / 4;
> -
> - if (amdgpu_amdkfd_ras_query_utcl2_poison_status(dev->adev,
> - hub_inst, vmid_type)) {
> + if (source_id == SOC15_INTSRC_VMC_UTCL2_POISON) {
>   event_interrupt_poison_consumption_v9(dev, pasid,
> client_id);
>   return;
>   }
> diff --git a/drivers/gpu/drm/amd/amdkfd/soc15_int.h
> b/drivers/gpu/drm/amd/amdkfd/soc15_int.h
> index 10138676f27f..e5c0205f2618 100644
> --- a/drivers/gpu/drm/amd/amdkfd/soc15_int.h
> +++ b/drivers/gpu/drm/amd/amdkfd/soc15_int.h
> @@ -29,6 +29,7 @@
>  #define SOC15_INTSRC_CP_BAD_OPCODE   183
>  #define SOC15_INTSRC_SQ_INTERRUPT_MSG239
>  #define SOC15_INTSRC_VMC_FAULT   0
> +#define SOC15_INTSRC_VMC_UTCL2_POISON1
>  #define SOC15_INTSRC_SDMA_TRAP   224
>  #define SOC15_INTSRC_SDMA_ECC220
>  #define SOC21_INTSRC_SDMA_TRAP   49
> --
> 2.17.1



RE: [PATCH 1/3] drm/amdkfd: Check int source id for utcl2 poison event

2024-08-19 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Hawking Zhang 
> Sent: Monday, August 19, 2024 11:15 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley 
> Cc: Zhang, Hawking ; Fan, Shikang
> 
> Subject: [PATCH 1/3] drm/amdkfd: Check int source id for utcl2 poison event
>
> Traditional utcl2 fault_status polling does not work in SRIOV environment. The
> polling of fault status register from guest side will be dropped by hardware.
>
> Driver should switch to check utcl2 interrupt source id to identify utcl2 
> poison
> event. It is set to 1 when poisoned data interrupts are signaled.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 3 +--
>  drivers/gpu/drm/amd/amdkfd/soc15_int.h  | 1 +
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index a9c3580be8c9..1196dccbe6bc 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -448,8 +448,7 @@ static void event_interrupt_wq_v9(struct kfd_node *dev,
>   if (vmid_type && client_id == SOC15_IH_CLIENTID_VMC)
>   hub_inst = node_id / 4;
>
> - if (amdgpu_amdkfd_ras_query_utcl2_poison_status(dev->adev,
> - hub_inst, vmid_type)) {

[Tao] the node_id, vmid_type and hub_inst can be also dropped.

> + if (source_id == SOC15_INTSRC_VMC_UTCL2_POISON) {
>   event_interrupt_poison_consumption_v9(dev, pasid,
> client_id);
>   return;
>   }
> diff --git a/drivers/gpu/drm/amd/amdkfd/soc15_int.h
> b/drivers/gpu/drm/amd/amdkfd/soc15_int.h
> index 10138676f27f..e5c0205f2618 100644
> --- a/drivers/gpu/drm/amd/amdkfd/soc15_int.h
> +++ b/drivers/gpu/drm/amd/amdkfd/soc15_int.h
> @@ -29,6 +29,7 @@
>  #define SOC15_INTSRC_CP_BAD_OPCODE   183
>  #define SOC15_INTSRC_SQ_INTERRUPT_MSG239
>  #define SOC15_INTSRC_VMC_FAULT   0
> +#define SOC15_INTSRC_VMC_UTCL2_POISON1
>  #define SOC15_INTSRC_SDMA_TRAP   224
>  #define SOC15_INTSRC_SDMA_ECC220
>  #define SOC21_INTSRC_SDMA_TRAP   49
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: Add debug option to enable mode2 for poison recovery

2024-08-11 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Monday, August 12, 2024 11:26 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Add debug option to enable mode2 for poison
> recovery
>
> Add debug option to enable mode2 for poison recovery for testing purpose only.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  6 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 16 ++--
>  3 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e6b641cb362a..c34819f947ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1201,6 +1201,7 @@ struct amdgpu_device {
>   booldebug_disable_soft_recovery;
>   booldebug_use_vram_fw_buf;
>   booldebug_enable_ras_aca;
> + booldebug_mode2_for_poison_recovery;
>  };
>
>  static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, 
> diff
> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index afe3b8bd35a1..be6b920933d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -133,6 +133,7 @@ enum AMDGPU_DEBUG_MASK {
>   AMDGPU_DEBUG_DISABLE_GPU_SOFT_RECOVERY = BIT(2),
>   AMDGPU_DEBUG_USE_VRAM_FW_BUF = BIT(3),
>   AMDGPU_DEBUG_ENABLE_RAS_ACA = BIT(4),
> + AMDGPU_DEBUG_MODE2_FOR_POISON_RECOVERY = BIT(5),
>  };
>
>  unsigned int amdgpu_vram_limit = UINT_MAX; @@ -2229,6 +2230,11 @@ static
> void amdgpu_init_debug_options(struct amdgpu_device *adev)
>   pr_info("debug: enable RAS ACA\n");
>   adev->debug_enable_ras_aca = true;
>   }
> +
> + if (amdgpu_debug_mask &
> AMDGPU_DEBUG_MODE2_FOR_POISON_RECOVERY) {
> + pr_info("debug: enable mode2 reset for poison consumption
> recovery");
> + adev->debug_mode2_for_poison_recovery = true;
> + }
>  }
>
>  static unsigned long amdgpu_fix_asic_type(struct pci_dev *pdev, unsigned long
> flags) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index 816800555f7f..a355b2bc2214 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -164,10 +164,12 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SE3SH:
>   case SOC15_IH_CLIENTID_UTCL2:
>   block = AMDGPU_RAS_BLOCK__GFX;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3)) {
> + reset = ((dev->adev-
> >debug_mode2_for_poison_recovery) ?
> +  AMDGPU_RAS_GPU_RESET_MODE2_RESET :
> AMDGPU_RAS_GPU_RESET_MODE1_RESET);
> + } else {
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
>   break;
>   case SOC15_IH_CLIENTID_VMC:
>   case SOC15_IH_CLIENTID_VMC1:
> @@ -180,10 +182,12 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SDMA3:
>   case SOC15_IH_CLIENTID_SDMA4:
>   block = AMDGPU_RAS_BLOCK__SDMA;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3)) {
> + reset = ((dev->adev-
> >debug_mode2_for_poison_recovery) ?
> +  AMDGPU_RAS_GPU_RESET_MODE2_RESET :
> AMDGPU_RAS_GPU_RESET_MODE1_RESET);
> + } else {
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
>   break;
>   default:
>   dev_warn(dev->adev->dev,
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: Add debug option to enable mode2 for poison recovery

2024-08-11 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Hawking Zhang 
> Sent: Monday, August 12, 2024 11:26 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Add debug option to enable mode2 for poison
> recovery
>
> Add debug option to enable mode2 for poison recovery for testing purpose only.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  6 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 16 ++--
>  3 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e6b641cb362a..c34819f947ed 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1201,6 +1201,7 @@ struct amdgpu_device {
>   booldebug_disable_soft_recovery;
>   booldebug_use_vram_fw_buf;
>   booldebug_enable_ras_aca;
> + booldebug_mode2_for_poison_recovery;
>  };
>
>  static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, 
> diff
> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index afe3b8bd35a1..be6b920933d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -133,6 +133,7 @@ enum AMDGPU_DEBUG_MASK {
>   AMDGPU_DEBUG_DISABLE_GPU_SOFT_RECOVERY = BIT(2),
>   AMDGPU_DEBUG_USE_VRAM_FW_BUF = BIT(3),
>   AMDGPU_DEBUG_ENABLE_RAS_ACA = BIT(4),
> + AMDGPU_DEBUG_MODE2_FOR_POISON_RECOVERY = BIT(5),
>  };
>
>  unsigned int amdgpu_vram_limit = UINT_MAX; @@ -2229,6 +2230,11 @@ static
> void amdgpu_init_debug_options(struct amdgpu_device *adev)
>   pr_info("debug: enable RAS ACA\n");
>   adev->debug_enable_ras_aca = true;
>   }
> +
> + if (amdgpu_debug_mask &
> AMDGPU_DEBUG_MODE2_FOR_POISON_RECOVERY) {
> + pr_info("debug: enable mode2 reset for poison consumption
> recovery");
> + adev->debug_mode2_for_poison_recovery = true;
> + }
>  }
>
>  static unsigned long amdgpu_fix_asic_type(struct pci_dev *pdev, unsigned long
> flags) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index 816800555f7f..a355b2bc2214 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -164,10 +164,12 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SE3SH:
>   case SOC15_IH_CLIENTID_UTCL2:
>   block = AMDGPU_RAS_BLOCK__GFX;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3)) {
> + reset = ((dev->adev-
> >debug_mode2_for_poison_recovery) ?
> +  AMDGPU_RAS_GPU_RESET_MODE2_RESET :
> AMDGPU_RAS_GPU_RESET_MODE1_RESET);

[Tao] can we apply the debug option for all ASICs?

> + } else {
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
>   break;
>   case SOC15_IH_CLIENTID_VMC:
>   case SOC15_IH_CLIENTID_VMC1:
> @@ -180,10 +182,12 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SDMA3:
>   case SOC15_IH_CLIENTID_SDMA4:
>   block = AMDGPU_RAS_BLOCK__SDMA;
> - if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3))
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> - else
> + if (amdgpu_ip_version(dev->adev, GC_HWIP, 0) ==
> IP_VERSION(9, 4, 3)) {
> + reset = ((dev->adev-
> >debug_mode2_for_poison_recovery) ?
> +  AMDGPU_RAS_GPU_RESET_MODE2_RESET :
> AMDGPU_RAS_GPU_RESET_MODE1_RESET);
> + } else {
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + }
>   break;
>   default:
>   dev_warn(dev->adev->dev,
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: report bad status in GPU recovery

2024-08-01 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

We need to perform gpu reset for HW and only make the reset  flow failing from 
driver perspective.

Tao

> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, August 1, 2024 2:41 PM
> To: Zhou1, Tao ; Zhang, Hawking
> ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
>
>
> On 8/1/2024 11:28 AM, Zhou1, Tao wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Yes, the bad status message is printed twice with this patch. I think it's 
> > harmless
> and the second message is more convenient for customer.
> >
> > I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable
> the first message if you think printing message twice is not a good idea.
> >
>
> Instead of this way, can't this be added to amdgpu_ras_do_recovery() and stop 
> all
> recovery actions?
>
> Thanks,
> Lijo
>
> > Tao
> >
> >> -Original Message-
> >> From: Zhang, Hawking 
> >> Sent: Thursday, August 1, 2024 1:30 PM
> >> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> Right, it's functional. My concern is whether the kernel message in
> >> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is
> >> the end of gpu recovery (i.e., report gpu reset failed or gpu reset 
> >> succeed).
> >> Check_err_threshold was already done before reaching here.
> >>
> >> Regards,
> >> Hawking
> >>
> >> -Original Message-
> >> From: Zhou1, Tao 
> >> Sent: Thursday, August 1, 2024 11:49
> >> To: Zhang, Hawking ;
> >> amd-gfx@lists.freedesktop.org
> >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is
> >> good enough, no need to update it with is_rma.
> >>
> >> Tao
> >>
> >>> -Original Message-
> >>> From: Zhang, Hawking 
> >>> Sent: Thursday, August 1, 2024 11:00 AM
> >>> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> >>> Cc: Zhou1, Tao 
> >>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>>
> >>> [AMD Official Use Only - AMD Internal Distribution Only]
> >>>
> >>> Might consider leverage is_RMA flag for the same purpose?
> >>>
> >>> Regards,
> >>> Hawking
> >>>
> >>> -Original Message-
> >>> From: amd-gfx  On Behalf Of
> >>> Tao Zhou
> >>> Sent: Wednesday, July 31, 2024 18:05
> >>> To: amd-gfx@lists.freedesktop.org
> >>> Cc: Zhou1, Tao 
> >>> Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>>
> >>> Instead of printing GPU reset failed.
> >>>
> >>> Signed-off-by: Tao Zhou 
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++--
> >>>  1 file changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 355c2478c4b6..b7c967779b4b 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> >>> amdgpu_device *adev,
> >>> tmp_adev->asic_reset_res = 0;
> >>>
> >>> if (r) {
> >>> -   /* bad news, how to tell it to userspace ? */
> >>> -   dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> >>> atomic_read(&tmp_adev->gpu_reset_counter));
> >>> +   /* bad news, how to tell it to userspace ?
> >>> +* for ras error, we should report GPU bad status 
> >>> instead of
> >>> +* reset failure
> >>> +*/
> >>> +   if 
> >>> (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> >>> +   dev_info(tmp_adev->dev, "GPU
> >>> + reset(%d) failed\n",
> >>> +
> >>> + atomic_read(&tmp_adev->gpu_reset_counter));
> >>> amdgpu_vf_error_put(tmp_adev,
> >>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> >>> } else {
> >>> dev_info(tmp_adev->dev, "GPU reset(%d)
> >>> succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
> >>> --
> >>> 2.34.1
> >>>
> >>
> >>
> >


RE: [PATCH] drm/amdgpu: Add more types for boot time error reporting

2024-07-31 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Thursday, August 1, 2024 1:55 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Add more types for boot time error reporting
>
> Data abort exception and unknown errors are supported.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 ++
>  2 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 12ab48f26bd5..7aff6150898b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -4769,6 +4769,16 @@ static void
> amdgpu_ras_boot_time_error_reporting(struct amdgpu_device *adev,
>   dev_info(adev->dev,
>"socket: %d, aid: %d, hbm: %d, fw_status: 0x%x, hbm
> bist test failed\n",
>socket_id, aid_id, hbm_id, fw_status);
> +
> + if (AMDGPU_RAS_GPU_ERR_DATA_ABORT(boot_error))
> + dev_info(adev->dev,
> +  "socket: %d, aid: %d, fw_status: 0x%x, data abort
> exception\n",
> +  socket_id, aid_id, fw_status);
> +
> + if (AMDGPU_RAS_GPU_ERR_UNKNOWN(boot_error))
> + dev_info(adev->dev,
> +  "socket: %d, aid: %d, fw_status: 0x%x, unknown boot
> time errors\n",
> +  socket_id, aid_id, fw_status);
>  }
>
>  static bool amdgpu_ras_boot_error_detected(struct amdgpu_device *adev, diff -
> -git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 7ddd13d5c06b..0d49b74bfe5e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -46,6 +46,8 @@ struct amdgpu_iv_entry;
>  #define AMDGPU_RAS_GPU_ERR_SOCKET_ID(x)
>   AMDGPU_GET_REG_FIELD(x, 10, 8)
>  #define AMDGPU_RAS_GPU_ERR_AID_ID(x)
>   AMDGPU_GET_REG_FIELD(x, 12, 11)
>  #define AMDGPU_RAS_GPU_ERR_HBM_ID(x)
>   AMDGPU_GET_REG_FIELD(x, 14, 13)
> +#define AMDGPU_RAS_GPU_ERR_DATA_ABORT(x)
>   AMDGPU_GET_REG_FIELD(x, 29, 29)
> +#define AMDGPU_RAS_GPU_ERR_UNKNOWN(x)
>   AMDGPU_GET_REG_FIELD(x, 30, 30)
>
>  #define AMDGPU_RAS_BOOT_STATUS_POLLING_LIMIT 100
>  #define AMDGPU_RAS_BOOT_STEADY_STATUS0xBA
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: report bad status in GPU recovery

2024-07-31 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Yes, the bad status message is printed twice with this patch. I think it's 
harmless and the second message is more convenient for customer.

I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable the 
first message if you think printing message twice is not a good idea.

Tao

> -Original Message-
> From: Zhang, Hawking 
> Sent: Thursday, August 1, 2024 1:30 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Right, it's functional. My concern is whether the kernel message in
> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is the end
> of gpu recovery (i.e., report gpu reset failed or gpu reset succeed).
> Check_err_threshold was already done before reaching here.
>
> Regards,
> Hawking
>
> -Original Message-
> From: Zhou1, Tao 
> Sent: Thursday, August 1, 2024 11:49
> To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is good
> enough, no need to update it with is_rma.
>
> Tao
>
> > -----Original Message-
> > From: Zhang, Hawking 
> > Sent: Thursday, August 1, 2024 11:00 AM
> > To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> > Cc: Zhou1, Tao 
> > Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Might consider leverage is_RMA flag for the same purpose?
> >
> > Regards,
> > Hawking
> >
> > -Original Message-
> > From: amd-gfx  On Behalf Of Tao
> > Zhou
> > Sent: Wednesday, July 31, 2024 18:05
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Zhou1, Tao 
> > Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >
> > Instead of printing GPU reset failed.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 355c2478c4b6..b7c967779b4b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> > amdgpu_device *adev,
> > tmp_adev->asic_reset_res = 0;
> >
> > if (r) {
> > -   /* bad news, how to tell it to userspace ? */
> > -   dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> > atomic_read(&tmp_adev->gpu_reset_counter));
> > +   /* bad news, how to tell it to userspace ?
> > +* for ras error, we should report GPU bad status 
> > instead of
> > +* reset failure
> > +*/
> > +   if 
> > (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> > +   dev_info(tmp_adev->dev, "GPU reset(%d)
> > + failed\n",
> > +
> > + atomic_read(&tmp_adev->gpu_reset_counter));
> > amdgpu_vf_error_put(tmp_adev,
> > AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> > } else {
> > dev_info(tmp_adev->dev, "GPU reset(%d)
> > succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
> > --
> > 2.34.1
> >
>
>



RE: [PATCH] drm/amdgpu: report bad status in GPU recovery

2024-07-31 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

I think the if condition in amdgpu_ras_eeprom_check_err_threshold is good 
enough, no need to update it with is_rma.

Tao

> -Original Message-
> From: Zhang, Hawking 
> Sent: Thursday, August 1, 2024 11:00 AM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Might consider leverage is_RMA flag for the same purpose?
>
> Regards,
> Hawking
>
> -Original Message-
> From: amd-gfx  On Behalf Of Tao Zhou
> Sent: Wednesday, July 31, 2024 18:05
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
> Instead of printing GPU reset failed.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 355c2478c4b6..b7c967779b4b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> amdgpu_device *adev,
> tmp_adev->asic_reset_res = 0;
>
> if (r) {
> -   /* bad news, how to tell it to userspace ? */
> -   dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> atomic_read(&tmp_adev->gpu_reset_counter));
> +   /* bad news, how to tell it to userspace ?
> +* for ras error, we should report GPU bad status 
> instead of
> +* reset failure
> +*/
> +   if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> +   dev_info(tmp_adev->dev, "GPU reset(%d) 
> failed\n",
> +   
> atomic_read(&tmp_adev->gpu_reset_counter));
> amdgpu_vf_error_put(tmp_adev,
> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> } else {
> dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n",
> atomic_read(&tmp_adev->gpu_reset_counter));
> --
> 2.34.1
>



RE: [PATCH] drm/amdgpu: report bad status in GPU recovery

2024-07-31 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Lazar, Lijo 
> Sent: Wednesday, July 31, 2024 9:31 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
>
>
> On 7/31/2024 3:35 PM, Tao Zhou wrote:
> > Instead of printing GPU reset failed.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 355c2478c4b6..b7c967779b4b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> amdgpu_device *adev,
> > tmp_adev->asic_reset_res = 0;
> >
> > if (r) {
> > -   /* bad news, how to tell it to userspace ? */
> > -   dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> atomic_read(&tmp_adev->gpu_reset_counter));
> > +   /* bad news, how to tell it to userspace ?
> > +* for ras error, we should report GPU bad status 
> > instead
> of
> > +* reset failure
> > +*/
> > +   if
> (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> > +   dev_info(tmp_adev->dev, "GPU reset(%d)
> failed\n",
> > +   atomic_read(&tmp_adev-
> >gpu_reset_counter));
>
> Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm that
> the reset is indeed triggered due to ras error.

[Tao] It seems AMDGPU_RESET_SRC_RAS is not used currently, I will set it before 
use the flag.

>
> Thanks,
> Lijo
>
> > amdgpu_vf_error_put(tmp_adev,
> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> > } else {
> > dev_info(tmp_adev->dev, "GPU reset(%d)
> succeeded!\n",
> > atomic_read(&tmp_adev->gpu_reset_counter));


RE: [PATCH 2/3] drm/amdgpu: optimize logging deferred error info

2024-07-17 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Chai, Thomas 
> Sent: Thursday, July 18, 2024 12:34 PM
> To: Chai, Thomas ; Zhou1, Tao ;
> amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley 
> Subject: RE: [PATCH 2/3] drm/amdgpu: optimize logging deferred error info
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> -
> Best Regards,
> Thomas
>
> -Original Message-
> From: amd-gfx  On Behalf Of Chai,
> Thomas
> Sent: Thursday, July 18, 2024 11:35 AM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley 
> Subject: RE: [PATCH 2/3] drm/amdgpu: optimize logging deferred error info
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> -
> Best Regards,
> Thomas
>
> -Original Message-
> From: Zhou1, Tao 
> Sent: Thursday, July 18, 2024 10:57 AM
> To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley 
> Subject: RE: [PATCH 2/3] drm/amdgpu: optimize logging deferred error info
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -Original Message-----
> > From: Chai, Thomas 
> > Sent: Wednesday, July 17, 2024 4:16 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Zhang, Hawking ; Zhou1, Tao
> > ; Li, Candice ; Wang,
> > Yang(Kevin) ; Yang, Stanley
> > ; Chai, Thomas 
> > Subject: [PATCH 2/3] drm/amdgpu: optimize logging deferred error info
> >
> > 1. Use pa_pfn as the radix-tree key index to log
> >deferred error info.
> > 2. Use local array to store expanded bad pages.
> >
> > Signed-off-by: YiPeng Chai 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 14 ++
> > drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  | 65 -
> > drivers/gpu/drm/amd/amdgpu/umc_v12_0.h  |  5 ++
> >  4 files changed, 40 insertions(+), 46 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index dcf1f3dbb5c4..f607ff620015 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -476,10 +476,10 @@ struct ras_err_pages {  };
> >
> >  struct ras_ecc_err {
> > - u64 hash_index;
> >   uint64_t status;
> >   uint64_t ipid;
> >   uint64_t addr;
> > + uint64_t pa_pfn;
> >   struct ras_err_pages err_pages;
> >  };
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 5d08c03fe543..2fc90799bf8d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -523,18 +523,10 @@ int amdgpu_umc_logs_ecc_err(struct
> amdgpu_device
> > *adev,
> >   ecc_log = &con->umc_ecc_log;
> >
> >   mutex_lock(&ecc_log->lock);
> > - ret = radix_tree_insert(ecc_tree, ecc_err->hash_index, ecc_err);
> > - if (!ret) {
> > - struct ras_err_pages *err_pages = &ecc_err->err_pages;
> > - int i;
> > -
> > - /* Reserve memory */
> > - for (i = 0; i < err_pages->count; i++)
> > - amdgpu_ras_reserve_page(adev, err_pages->pfn[i]);
> > -
> > + ret = radix_tree_insert(ecc_tree, ecc_err->pa_pfn, ecc_err);
> > + if (!ret)
> >   radix_tree_tag_set(ecc_tree,
> > - ecc_err->hash_index,
> > UMC_ECC_NEW_DETECTED_TAG);
> > - }
> > + ecc_err->pa_pfn, UMC_ECC_NEW_DETECTED_TAG);
> >   mutex_unlock(&ecc_log->lock);
> >
> >   return ret;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> > b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> > index eca5ac6a0532..f2235c9ead29 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> > @@ -524,9 +524,9 @@ static int umc_v12_0_update_ecc_status(struct
> > amdgpu_device *adev,
> >   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >   uint16_t hwid, mcatype;
> >   uint64_t page_pfn[UMC_V12_0_BAD_PAGE_NUM_PER_CHANNEL];
> > - uint64_t err_addr, hash

RE: [PATCH 2/3] drm/amdgpu: optimize logging deferred error info

2024-07-17 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, July 17, 2024 4:16 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang, Yang(Kevin)
> ; Yang, Stanley ; Chai,
> Thomas 
> Subject: [PATCH 2/3] drm/amdgpu: optimize logging deferred error info
>
> 1. Use pa_pfn as the radix-tree key index to log
>deferred error info.
> 2. Use local array to store expanded bad pages.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 14 ++
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  | 65 -
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.h  |  5 ++
>  4 files changed, 40 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index dcf1f3dbb5c4..f607ff620015 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -476,10 +476,10 @@ struct ras_err_pages {  };
>
>  struct ras_ecc_err {
> - u64 hash_index;
>   uint64_t status;
>   uint64_t ipid;
>   uint64_t addr;
> + uint64_t pa_pfn;
>   struct ras_err_pages err_pages;
>  };
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 5d08c03fe543..2fc90799bf8d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -523,18 +523,10 @@ int amdgpu_umc_logs_ecc_err(struct amdgpu_device
> *adev,
>   ecc_log = &con->umc_ecc_log;
>
>   mutex_lock(&ecc_log->lock);
> - ret = radix_tree_insert(ecc_tree, ecc_err->hash_index, ecc_err);
> - if (!ret) {
> - struct ras_err_pages *err_pages = &ecc_err->err_pages;
> - int i;
> -
> - /* Reserve memory */
> - for (i = 0; i < err_pages->count; i++)
> - amdgpu_ras_reserve_page(adev, err_pages->pfn[i]);
> -
> + ret = radix_tree_insert(ecc_tree, ecc_err->pa_pfn, ecc_err);
> + if (!ret)
>   radix_tree_tag_set(ecc_tree,
> - ecc_err->hash_index,
> UMC_ECC_NEW_DETECTED_TAG);
> - }
> + ecc_err->pa_pfn, UMC_ECC_NEW_DETECTED_TAG);
>   mutex_unlock(&ecc_log->lock);
>
>   return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index eca5ac6a0532..f2235c9ead29 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -524,9 +524,9 @@ static int umc_v12_0_update_ecc_status(struct
> amdgpu_device *adev,
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   uint16_t hwid, mcatype;
>   uint64_t page_pfn[UMC_V12_0_BAD_PAGE_NUM_PER_CHANNEL];
> - uint64_t err_addr, hash_val = 0, pa_addr = 0;
> + uint64_t err_addr, pa_addr = 0;
>   struct ras_ecc_err *ecc_err;
> - int count, ret;
> + int count, ret, i;
>
>   hwid = REG_GET_FIELD(ipid, MCMP1_IPIDT0, HardwareID);
>   mcatype = REG_GET_FIELD(ipid, MCMP1_IPIDT0, McaType); @@ -559,39
> +559,18 @@ static int umc_v12_0_update_ecc_status(struct amdgpu_device
> *adev,
>   if (ret)
>   return ret;
>
> - memset(page_pfn, 0, sizeof(page_pfn));
> - count = umc_v12_0_expand_addr_to_bad_pages(adev,
> - pa_addr,
> - page_pfn, ARRAY_SIZE(page_pfn));
> - if (count <= 0) {
> - dev_warn(adev->dev, "Fail to convert error address!
> count:%d\n", count);
> - return 0;
> - }
> -
> - ret = amdgpu_umc_build_pages_hash(adev,
> - page_pfn, count, &hash_val);
> - if (ret) {
> - dev_err(adev->dev, "Fail to build error pages hash\n");
> - return ret;
> - }
> -
>   ecc_err = kzalloc(sizeof(*ecc_err), GFP_KERNEL);
>   if (!ecc_err)
>   return -ENOMEM;
>
> - ecc_err->err_pages.pfn = kcalloc(count, sizeof(*ecc_err->err_pages.pfn),
> GFP_KERNEL);
> - if (!ecc_err->err_pages.pfn) {
> - kfree(ecc_err);
> - return -ENOMEM;
> - }
> -
> - memcpy(ecc_err->err_pages.pfn, page_pfn, count * sizeof(*ecc_err-
> >err_pages.pfn));
> - ecc_err->err_pages.count = count;
> -
> - ecc_err->hash_index = hash_val;
>   ecc_err->status = status;
>   ecc_err->ipid = ipid;
> 

RE: [PATCH 1/3] drm/amdgpu: optimize umc v12 address conversion function

2024-07-17 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, July 17, 2024 4:16 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang, Yang(Kevin)
> ; Yang, Stanley ; Chai,
> Thomas 
> Subject: [PATCH 1/3] drm/amdgpu: optimize umc v12 address conversion function
>
> Split into 3 parts:
> 1. Convert soc physical address via ras ta.
> 2. Expand bad pages from soc physical address.
> 3. Dump bad address info.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 116 -
>  1 file changed, 77 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index 9dbb13adb661..eca5ac6a0532 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -225,26 +225,16 @@ static void umc_v12_0_convert_error_address(struct
> amdgpu_device *adev,
>   }
>  }
>
> -static int umc_v12_0_convert_err_addr(struct amdgpu_device *adev,
> - struct ta_ras_query_address_input *addr_in,
> - uint64_t *pfns, int len)
> +static void umc_v12_0_dump_addr_info(struct amdgpu_device *adev,
> + struct ta_ras_query_address_output *addr_out,
> + uint64_t err_addr)
>  {
>   uint32_t col, row, row_xor, bank, channel_index;
> - uint64_t soc_pa, retired_page, column, err_addr;
> - struct ta_ras_query_address_output addr_out;
> - uint32_t pos = 0;
> -
> - err_addr = addr_in->ma.err_addr;
> - addr_in->addr_type = TA_RAS_MCA_TO_PA;
> - if (psp_ras_query_address(&adev->psp, addr_in, &addr_out)) {
> - dev_warn(adev->dev, "Failed to query RAS physical address for
> 0x%llx",
> - err_addr);
> - return 0;
> - }
> + uint64_t soc_pa, retired_page, column;
>
> - soc_pa = addr_out.pa.pa;
> - bank = addr_out.pa.bank;
> - channel_index = addr_out.pa.channel_idx;
> + soc_pa = addr_out->pa.pa;
> + bank = addr_out->pa.bank;
> + channel_index = addr_out->pa.channel_idx;
>
>   col = (err_addr >> 1) & 0x1fULL;
>   row = (err_addr >> 10) & 0x3fffULL;
> @@ -258,11 +248,6 @@ static int umc_v12_0_convert_err_addr(struct
> amdgpu_device *adev,
>   for (column = 0; column < UMC_V12_0_NA_MAP_PA_NUM; column++) {
>   retired_page = soc_pa | ((column & 0x3) <<
> UMC_V12_0_PA_C2_BIT);
>   retired_page |= (((column & 0x4) >> 2) <<
> UMC_V12_0_PA_C4_BIT);
> -
> - if (pos >= len)
> - return 0;
> - pfns[pos++] = retired_page >> AMDGPU_GPU_PAGE_SHIFT;
> -
>   /* include column bit 0 and 1 */
>   col &= 0x3;
>   col |= (column << 2);
> @@ -270,6 +255,35 @@ static int umc_v12_0_convert_err_addr(struct
> amdgpu_device *adev,
>   "Error Address(PA):0x%-10llx Row:0x%-4x Col:0x%-2x
> Bank:0x%x Channel:0x%x\n",
>   retired_page, row, col, bank, channel_index);
>
> + /* shift R13 bit */
> + retired_page ^= (0x1ULL << UMC_V12_0_PA_R13_BIT);
> + dev_info(adev->dev,
> + "Error Address(PA):0x%-10llx Row:0x%-4x Col:0x%-2x
> Bank:0x%x Channel:0x%x\n",
> + retired_page, row_xor, col, bank, channel_index);
> + }
> +}
> +
> +static int umc_v12_0_expand_addr_to_bad_pages(struct amdgpu_device
> *adev,
> + uint64_t pa_addr, uint64_t *pfns, int len) {
> + uint64_t soc_pa, retired_page, column;
> + uint32_t pos = 0;
> +
> + soc_pa = pa_addr;
> + /* clear [C3 C2] in soc physical address */
> + soc_pa &= ~(0x3ULL << UMC_V12_0_PA_C2_BIT);
> + /* clear [C4] in soc physical address */
> + soc_pa &= ~(0x1ULL << UMC_V12_0_PA_C4_BIT);

[Tao] these bits are already cleared via UMC_V12_ADDR_MASK_BAD_COLS in patch 
#2, is the clear here redundant?

> +
> + /* loop for all possibilities of [C4 C3 C2] */
> + for (column = 0; column < UMC_V12_0_NA_MAP_PA_NUM; column++) {
> + retired_page = soc_pa | ((column & 0x3) <<
> UMC_V12_0_PA_C2_BIT);
> + retired_page |= (((column & 0x4) >> 2) <<
> UMC_V12_0_PA_C4_BIT);
> +
> + if (pos >= len)
> + return 0;
> + pfns[po

RE: [PATCH V2 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed

2024-07-08 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Chai, Thomas 
> Sent: Tuesday, July 9, 2024 1:56 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang, Yang(Kevin)
> ; Yang, Stanley ; Chai,
> Thomas 
> Subject: [PATCH V2 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu
> ras reset is completed
>
> The problem case is as follows:
> 1. GPU A triggers a gpu ras reset, and GPU A drives
>GPU B to also perform a gpu ras reset.
> 2. After gpu B ras reset started, gpu B queried a DE
>data. Since the DE data was queried in the ras reset
>thread instead of the page retirement thread, bad
>page retirement work would not be triggered. Then
>even if all gpu resets are completed, the bad pages
>will be cached in RAM until GPU B's bad page retirement
>work is triggered again and then saved to eeprom.
>
> This patch can save the bad pages to eeprom in time after gpu ras reset is
> completed.
>
> v2:
>   1. Add the above description to code comments.
>   2. Reuse existing function.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  6 +-
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  | 18 ++
>  2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d923151af752..34226ae010c7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2864,8 +2864,12 @@ static void amdgpu_ras_do_page_retirement(struct
> work_struct *work)
>   struct ras_err_data err_data;
>   unsigned long err_cnt;
>
> - if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev))
> + /* If gpu reset is ongoing, delay retiring the bad pages */
> + if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) {
> + amdgpu_ras_schedule_retirement_dwork(con,
> + AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3);
>   return;
> + }
>
>   amdgpu_ras_error_data_init(&err_data);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index 0faa21d8a7b4..9dbb13adb661 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -29,6 +29,7 @@
>  #include "mp/mp_13_0_6_sh_mask.h"
>
>  #define MAX_ECC_NUM_PER_RETIREMENT  32
> +#define DELAYED_TIME_FOR_GPU_RESET  1000  //ms
>
>  static inline uint64_t get_umc_v12_0_reg_offset(struct amdgpu_device *adev,
>   uint32_t node_inst,
> @@ -568,6 +569,23 @@ static int umc_v12_0_update_ecc_status(struct
> amdgpu_device *adev,
>
>   con->umc_ecc_log.de_queried_count++;
>
> + /* The problem case is as follows:
> +  * 1. GPU A triggers a gpu ras reset, and GPU A drives
> +  *GPU B to also perform a gpu ras reset.
> +  * 2. After gpu B ras reset started, gpu B queried a DE
> +  *data. Since the DE data was queried in the ras reset
> +  *thread instead of the page retirement thread, bad
> +  *page retirement work would not be triggered. Then
> +  *even if all gpu resets are completed, the bad pages
> +  *will be cached in RAM until GPU B's bad page retirement
> +  *work is triggered again and then saved to eeprom.
> +  * Trigger delayed work to save the bad pages to eeprom in time
> +  * after gpu ras reset is completed.
> +  */
> + if (amdgpu_ras_in_recovery(adev))
> + schedule_delayed_work(&con->page_retirement_dwork,
> + msecs_to_jiffies(DELAYED_TIME_FOR_GPU_RESET));
> +
>   return 0;
>  }
>
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: remove redundant semicolons in RAS_EVENT_LOG

2024-07-03 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Thursday, July 4, 2024 1:53 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> 
> Subject: [PATCH] drm/amdgpu: remove redundant semicolons in
> RAS_EVENT_LOG
>
> remove redundant semicolons in RAS_EVENT_LOG to avoid code format check
> warning.
>
> Fixes: 951c09c88fca ("drm/amdgpu: fix compiler 'side-effect' check issue for
> RAS_EVENT_LOG()")
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 9224fc6418e4..518b10f190ec 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -72,7 +72,7 @@ struct amdgpu_iv_entry;
>  #define RAS_EVENT_ID_IS_VALID(x) (!((x) & BIT_ULL(63)))
>
>  #define RAS_EVENT_LOG(adev, id, fmt, ...)\
> - amdgpu_ras_event_log_print((adev), (id), (fmt), ##__VA_ARGS__);
> + amdgpu_ras_event_log_print((adev), (id), (fmt), ##__VA_ARGS__)
>
>  #define amdgpu_ras_mark_ras_event(adev, type)\
>   (amdgpu_ras_mark_ras_event_caller((adev), (type),
> __builtin_return_address(0)))
> --
> 2.34.1



RE: [PATCH 4/4] drm/amdgpu: add ras event state device attribute support

2024-07-03 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Wednesday, July 3, 2024 5:03 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> 
> Subject: [PATCH 4/4] drm/amdgpu: add ras event state device attribute support
>
> add amdgpu ras 'event_state' sysfs device attribute support
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 56 +++-
> -  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  7 +++-
>  2 files changed, 58 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 11f8c37a97ef..d84e4f841ecc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1731,6 +1731,39 @@ static ssize_t
> amdgpu_ras_sysfs_schema_show(struct device *dev,
>   return sysfs_emit(buf, "schema: 0x%x\n", con->schema);  }
>
> +static struct {
> + enum ras_event_type type;
> + const char *name;
> +} dump_event[] = {
> + {RAS_EVENT_TYPE_ISR, "Fault Error"},
> + {RAS_EVENT_TYPE_POISON_CREATION, "Poison Creation"},
> + {RAS_EVENT_TYPE_POISON_CONSUMPTION, "Poison Consumption"}, };
> +
> +static ssize_t amdgpu_ras_sysfs_event_state_show(struct device *dev,
> +  struct device_attribute *attr,
> char *buf) {
> + struct amdgpu_ras *con =
> + container_of(attr, struct amdgpu_ras, event_state_attr);
> + struct ras_event_manager *event_mgr = con->event_mgr;
> + struct ras_event_state *event_state;
> + int i, size = 0;
> +
> + if (!event_mgr)
> + return -EINVAL;
> +
> + size += sysfs_emit_at(buf, size, "current seqno: %llu\n",
> atomic64_read(&event_mgr->seqno));
> + for (i = 0; i <  ARRAY_SIZE(dump_event); i++) {
> + event_state = &event_mgr->event_state[dump_event[i].type];
> + size += sysfs_emit_at(buf, size, "%s : count:%llu,
> last_seqno:%llu\n",
> +   dump_event[i].name,
> +   atomic64_read(&event_state->count),
> +   event_state->last_seqno);
> + }
> +
> + return (ssize_t)size;
> +}
> +
>  static void amdgpu_ras_sysfs_remove_bad_page_node(struct amdgpu_device
> *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -1748,6
> +1781,7 @@ static int amdgpu_ras_sysfs_remove_dev_attr_node(struct
> amdgpu_device *adev)
>   &con->features_attr.attr,
>   &con->version_attr.attr,
>   &con->schema_attr.attr,
> + &con->event_state_attr.attr,
>   NULL
>   };
>   struct attribute_group group = {
> @@ -1980,6 +2014,8 @@ static DEVICE_ATTR(version, 0444,
>   amdgpu_ras_sysfs_version_show, NULL);  static
> DEVICE_ATTR(schema, 0444,
>   amdgpu_ras_sysfs_schema_show, NULL);
> +static DEVICE_ATTR(event_state, 0444,
> +amdgpu_ras_sysfs_event_state_show, NULL);
>  static int amdgpu_ras_fs_init(struct amdgpu_device *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -1990,6
> +2026,7 @@ static int amdgpu_ras_fs_init(struct amdgpu_device *adev)
>   &con->features_attr.attr,
>   &con->version_attr.attr,
>   &con->schema_attr.attr,
> + &con->event_state_attr.attr,
>   NULL
>   };
>   struct bin_attribute *bin_attrs[] = {
> @@ -2012,6 +2049,10 @@ static int amdgpu_ras_fs_init(struct amdgpu_device
> *adev)
>   con->schema_attr = dev_attr_schema;
>   sysfs_attr_init(attrs[2]);
>
> + /* add event_state entry */
> + con->event_state_attr = dev_attr_event_state;
> + sysfs_attr_init(attrs[3]);
> +
>   if (amdgpu_bad_page_threshold != 0) {
>   /* add bad_page_features entry */
>   bin_attr_gpu_vram_bad_pages.private = NULL; @@ -3440,13
> +3481,17 @@ static int amdgpu_get_ras_schema(struct amdgpu_device *adev)
>
>  static void ras_event_mgr_init(struct ras_event_manager *mgr)  {
> + struct ras_event_state *event_state;
>   int i;
>
>   memset(mgr, 0, sizeof(*mgr));
>   atomic64_set(&mgr->seqno, 0);
>
> - for (i = 0; i < ARRAY_SIZE(mgr->last_seqno); i++)
> - mgr->last_seqno[i] = RAS_EVENT_INVALID_ID;
> + for (i = 

RE: [PATCH 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is complete

2024-07-03 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, July 3, 2024 4:41 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang, Yang(Kevin)
> ; Yang, Stanley ; Chai,
> Thomas 
> Subject: [PATCH 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu ras
> reset is complete
>
> The problem case is as follows:
> 1. GPU A triggers a gpu ras reset, and GPU A drives
>GPU B to also perform a gpu ras reset.
> 2. After gpu B ras reset started, gpu B queried a DE
>data. Since the DE data was queried in the ras reset
>thread instead of the page retirement thread, bad
>page retirement work would not be triggered. Then
>even if all gpu resets are completed, the bad pages
>will be cached in RAM until GPU B's bad page retirement
>work is triggered again and then saved to eeprom.

[Tao] can we add this description to code comment?

>
> This patch can save the bad pages to eeprom in time after gpu ras reset is
> complete.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +-
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  |  6 ++
>  2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 1b6f5b26957b..b6e047a354a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2844,8 +2844,20 @@ static void amdgpu_ras_do_page_retirement(struct
> work_struct *work)
>   struct ras_err_data err_data;
>   unsigned long err_cnt;
>
> - if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev))
> + if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) {
> + int ret;
> +
> + mutex_lock(&con->umc_ecc_log.lock);
> + ret = radix_tree_tagged(&con->umc_ecc_log.de_page_tree,
> + UMC_ECC_NEW_DETECTED_TAG);
> + mutex_unlock(&con->umc_ecc_log.lock);
> +
> + /* If gpu reset is not completed, schedule delayed work again */
> + if (ret)
> + schedule_delayed_work(&con-
> >page_retirement_dwork,
> +
>   msecs_to_jiffies(AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3));

[Tao] this section of code can be put in a function to make code reusable.

>   return;
> + }
>
>   amdgpu_ras_error_data_init(&err_data);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index 0faa21d8a7b4..7bdba5532adb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -29,6 +29,7 @@
>  #include "mp/mp_13_0_6_sh_mask.h"
>
>  #define MAX_ECC_NUM_PER_RETIREMENT  32
> +#define DELAYED_TIME_FOR_GPU_RESET  1000  //ms
>
>  static inline uint64_t get_umc_v12_0_reg_offset(struct amdgpu_device *adev,
>   uint32_t node_inst,
> @@ -568,6 +569,11 @@ static int umc_v12_0_update_ecc_status(struct
> amdgpu_device *adev,
>
>   con->umc_ecc_log.de_queried_count++;
>
> + /* Try to retire the bad pages detected after gpu ras reset started */
> + if (amdgpu_ras_in_recovery(adev))
> + schedule_delayed_work(&con->page_retirement_dwork,
> + msecs_to_jiffies(DELAYED_TIME_FOR_GPU_RESET));
> +
>   return 0;
>  }
>
> --
> 2.34.1



RE: [PATCH 3/4] drm/amdgpu: add ras POSION_CONSUMPTION event id support

2024-07-03 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Wednesday, July 3, 2024 1:52 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> 
> Subject: [PATCH 3/4] drm/amdgpu: add ras POSION_CONSUMPTION event id
> support
>
> add amdgpu ras POSION_CONSUMPTION event id support.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 16 +---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 15 ---
>  3 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 8a98611d2353..11f8c37a97ef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2076,10 +2076,17 @@ static void
> amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
>   struct amdgpu_ras_block_object *block_obj =
>   amdgpu_ras_get_ras_block(adev, obj->head.block, 0);
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + enum ras_event_type type =
> RAS_EVENT_TYPE_POISON_CONSUMPTION;
> + u64 event_id;
> + int ret;
>
>   if (!block_obj || !con)
>   return;
>
> + ret = amdgpu_ras_mark_ras_event(adev, type);
> + if (ret)

[Tao] add warning? Or you can add it in amdgpu_ras_mark_ras_event.

> + return;
> +
>   /* both query_poison_status and handle_poison_consumption are
> optional,
>* but at least one of them should be implemented if we need poison
>* consumption handler
> @@ -2104,8 +2111,10 @@ static void
> amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
>* For RMA case, amdgpu_umc_poison_handler will handle gpu reset.
>*/
>   if (poison_stat && !con->is_rma) {
> - dev_info(adev->dev, "GPU reset for %s RAS poison consumption
> is issued!\n",
> - block_obj->ras_comm.name);
> + event_id = amdgpu_ras_acquire_event_id(adev, type);
> + RAS_EVENT_LOG(adev, event_id,
> +   "GPU reset for %s RAS poison consumption is
> issued!\n",
> +   block_obj->ras_comm.name);
>   amdgpu_ras_reset_gpu(adev);
>   }
>
> @@ -2498,7 +2507,7 @@ static enum ras_event_type
> amdgpu_ras_get_recovery_event(struct amdgpu_device *a
>   if (amdgpu_ras_intr_triggered())
>   return RAS_EVENT_TYPE_ISR;
>   else
> - return RAS_EVENT_TYPE_INVALID;
> + return RAS_EVENT_TYPE_POISON_CONSUMPTION;
>  }
>
>  static void amdgpu_ras_do_recovery(struct work_struct *work) @@ -3975,6
> +3984,7 @@ u64 amdgpu_ras_acquire_event_id(struct amdgpu_device *adev,
> enum ras_event_type
>   switch (type) {
>   case RAS_EVENT_TYPE_ISR:
>   case RAS_EVENT_TYPE_POISON_CREATION:
> + case RAS_EVENT_TYPE_POISON_CONSUMPTION:
>   event_mgr = __get_ras_event_mgr(adev);
>   if (!event_mgr)
>   return RAS_EVENT_INVALID_ID;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 1343cfbc913b..6086da67fa4e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -433,6 +433,7 @@ enum ras_event_type {
>   RAS_EVENT_TYPE_INVALID = 0,
>   RAS_EVENT_TYPE_ISR,
>   RAS_EVENT_TYPE_POISON_CREATION,
> + RAS_EVENT_TYPE_POISON_CONSUMPTION,
>   RAS_EVENT_TYPE_COUNT,
>  };
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index 816800555f7f..8a10a0e42846 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -27,6 +27,7 @@
>  #include "soc15_int.h"
>  #include "kfd_device_queue_manager.h"
>  #include "kfd_smi_events.h"
> +#include "amdgpu_ras.h"
>
>  /*
>   * GFX9 SQ Interrupts
> @@ -144,9 +145,11 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   uint16_t pasid, uint16_t client_id)  {
>   enum amdgpu_ras_block block = 0;
> - int old_poison;
>   uint32_t reset = 0;
>   struct kfd_process *p = kfd_lookup_process_by_pasid(pasid);
> + enum ras_event_type type =
> RAS_EVENT_TYPE_POISON_CONSUMPTION;
> + u64 event_id;
> + int old_poison, ret;
>
&

RE: [PATCH 2/4] drm/amdgpu: add ras POSION_CREATION event id support

2024-07-03 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Wednesday, July 3, 2024 1:52 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> 
> Subject: [PATCH 2/4] drm/amdgpu: add ras POSION_CREATION event id support
>
> add amdgpu ras POSION_CREATION event id support.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 17 ++---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  1 +
>  2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 45ac82a34d49..8a98611d2353 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2116,8 +2116,17 @@ static void
> amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
> static void amdgpu_ras_interrupt_poison_creation_handler(struct ras_manager
> *obj,
>   struct amdgpu_iv_entry *entry)
>  {
> - dev_info(obj->adev->dev,
> - "Poison is created\n");
> + struct amdgpu_device *adev = obj->adev;
> + enum ras_event_type type = RAS_EVENT_TYPE_POISON_CREATION;
> + u64 event_id;
> + int ret;
> +
> + ret = amdgpu_ras_mark_ras_event(adev, type);
> + if (ret)

[Tao] do we need to add warning message here?

> + return;
> +
> + event_id = amdgpu_ras_acquire_event_id(adev, type);
> + RAS_EVENT_LOG(adev, event_id, "Poison is created\n");
>
>   if (amdgpu_ip_version(obj->adev, UMC_HWIP, 0) >= IP_VERSION(12, 0,
> 0)) {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(obj->adev);
> @@ -2889,6 +2898,7 @@ static int amdgpu_ras_poison_creation_handler(struct
> amdgpu_device *adev,
>   uint32_t new_detect_count, total_detect_count;
>   uint32_t need_query_count = poison_creation_count;
>   bool query_data_timeout = false;
> + enum ras_event_type type = RAS_EVENT_TYPE_POISON_CREATION;
>
>   memset(&info, 0, sizeof(info));
>   info.head.block = AMDGPU_RAS_BLOCK__UMC; @@ -2896,7 +2906,7
> @@ static int amdgpu_ras_poison_creation_handler(struct amdgpu_device
> *adev,
>   ecc_log = &ras->umc_ecc_log;
>   total_detect_count = 0;
>   do {
> - ret = amdgpu_ras_query_error_status(adev, &info);
> + ret = amdgpu_ras_query_error_status_with_event(adev, &info,
> type);
>   if (ret)
>   return ret;
>
> @@ -3964,6 +3974,7 @@ u64 amdgpu_ras_acquire_event_id(struct
> amdgpu_device *adev, enum ras_event_type
>
>   switch (type) {
>   case RAS_EVENT_TYPE_ISR:
> + case RAS_EVENT_TYPE_POISON_CREATION:
>   event_mgr = __get_ras_event_mgr(adev);
>   if (!event_mgr)
>   return RAS_EVENT_INVALID_ID;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 88df4be5d122..1343cfbc913b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -432,6 +432,7 @@ struct umc_ecc_info {  enum ras_event_type {
>   RAS_EVENT_TYPE_INVALID = 0,
>   RAS_EVENT_TYPE_ISR,
> + RAS_EVENT_TYPE_POISON_CREATION,
>   RAS_EVENT_TYPE_COUNT,
>  };
>
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: Fix hbm stack id in boot error report

2024-06-28 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

How about:

hbm_id = AMDGPU_RAS_GPU_ERR_HBM_ID(boot_error) - 1;

Anyway, the patch is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Friday, June 28, 2024 5:04 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Fix hbm stack id in boot error report
>
> To align with firmware, hbm id field 0x1 refers to hbm stack 0, 0x2 refers to 
> hbm
> statck 1.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 4edd8e333d36..6d1f974e2987 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -4565,7 +4565,7 @@ static void
> amdgpu_ras_boot_time_error_reporting(struct amdgpu_device *adev,
>
>   socket_id = AMDGPU_RAS_GPU_ERR_SOCKET_ID(boot_error);
>   aid_id = AMDGPU_RAS_GPU_ERR_AID_ID(boot_error);
> - hbm_id = AMDGPU_RAS_GPU_ERR_HBM_ID(boot_error);
> + hbm_id = ((1 == AMDGPU_RAS_GPU_ERR_HBM_ID(boot_error)) ? 0 : 1);
>
>   if (AMDGPU_RAS_GPU_ERR_MEM_TRAINING(boot_error))
>   dev_info(adev->dev,
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: Correct register used to clear fault status

2024-06-28 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Friday, June 28, 2024 5:04 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Correct register used to clear fault status
>
> Driver should write to fault_cntl registers to do one-shot address/status 
> clear.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> index 8d7267a013d2..621761a17ac7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> @@ -569,7 +569,7 @@ static bool
> mmhub_v1_8_query_utcl2_poison_status(struct amdgpu_device *adev,
>   if (!amdgpu_sriov_vf(adev)) {
>   /* clear page fault status and address */
>   WREG32_P(SOC15_REG_OFFSET(MMHUB, hub_inst,
> -  regVM_L2_PROTECTION_FAULT_STATUS), 1, ~1);
> +  regVM_L2_PROTECTION_FAULT_CNTL), 1, ~1);
>   }
>
>   return fed;
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: Fix register access violation

2024-06-20 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Hawking
> Zhang
> Sent: Friday, June 21, 2024 11:30 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Fix register access violation
>
> fault_status is read only register. fault_cntl is not accessible from guest
> environment.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c | 8 +---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c| 3 ++-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c  | 8 +---
>  3 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> index e14acab5cceb..72109abe7c86 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> @@ -629,9 +629,11 @@ static bool
> gfxhub_v1_2_query_utcl2_poison_status(struct amdgpu_device *adev,
>
>   status = RREG32_SOC15(GC, GET_INST(GC, xcc_id),
> regVM_L2_PROTECTION_FAULT_STATUS);
>   fed = REG_GET_FIELD(status, VM_L2_PROTECTION_FAULT_STATUS,
> FED);
> - /* reset page fault status */
> - WREG32_P(SOC15_REG_OFFSET(GC, GET_INST(GC, xcc_id),
> - regVM_L2_PROTECTION_FAULT_STATUS), 1, ~1);
> + if (!amdgpu_sriov_vf(adev)) {
> + /* clear page fault status and address */
> + WREG32_P(SOC15_REG_OFFSET(GC, GET_INST(GC, xcc_id),
> +  regVM_L2_PROTECTION_FAULT_CNTL), 1, ~1);
> + }
>
>   return fed;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 88b4644f8e96..b73136d390cc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -672,7 +672,8 @@ static int gmc_v9_0_process_interrupt(struct
> amdgpu_device *adev,
>   (amdgpu_ip_version(adev, GC_HWIP, 0) >= IP_VERSION(9, 4, 2)))
>   return 0;
>
> - WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);
> + if (!amdgpu_sriov_vf(adev))
> + WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);
>
>   amdgpu_vm_update_fault_cache(adev, entry->pasid, addr, status,
> vmhub);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> index 7a1ff298417a..8d7267a013d2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c
> @@ -566,9 +566,11 @@ static bool
> mmhub_v1_8_query_utcl2_poison_status(struct amdgpu_device *adev,
>
>   status = RREG32_SOC15(MMHUB, hub_inst,
> regVM_L2_PROTECTION_FAULT_STATUS);
>   fed = REG_GET_FIELD(status, VM_L2_PROTECTION_FAULT_STATUS,
> FED);
> - /* reset page fault status */
> - WREG32_P(SOC15_REG_OFFSET(MMHUB, hub_inst,
> - regVM_L2_PROTECTION_FAULT_STATUS), 1, ~1);
> + if (!amdgpu_sriov_vf(adev)) {
> + /* clear page fault status and address */
> + WREG32_P(SOC15_REG_OFFSET(MMHUB, hub_inst,
> +  regVM_L2_PROTECTION_FAULT_STATUS), 1, ~1);
> + }
>
>   return fed;
>  }
> --
> 2.17.1



RE: [PATCH] drm/amdkfd: use mode1 reset for RAS poison consumption

2024-06-13 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, June 13, 2024 4:07 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: use mode1 reset for RAS poison consumption
>
>
>
> On 6/13/2024 12:27 PM, Tao Zhou wrote:
> > Per FW requirement, replace mode2 with mode1.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> > index e1c21d250611..78dde62fb04a 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> > @@ -164,7 +164,7 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
> > case SOC15_IH_CLIENTID_SE3SH:
> > case SOC15_IH_CLIENTID_UTCL2:
> > block = AMDGPU_RAS_BLOCK__GFX;
> > -   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> > +   reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> > break;
> > case SOC15_IH_CLIENTID_VMC:
> > case SOC15_IH_CLIENTID_VMC1:
> > @@ -177,7 +177,7 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
> > case SOC15_IH_CLIENTID_SDMA3:
> > case SOC15_IH_CLIENTID_SDMA4:
> > block = AMDGPU_RAS_BLOCK__SDMA;
> > -   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> > +   reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> > break;
>
> Does this need 9.4.3 IP version check?

[Tao] It's applicable to all gfx9 ASICs.

>
> Thanks,
> Lijo
> > default:
> > dev_warn(dev->adev->dev,


RE: [PATCH] drm/amdgpu: move some aca/mca init functions into ras_init() stage

2024-06-05 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Wednesday, June 5, 2024 5:32 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> 
> Subject: [PATCH] drm/amdgpu: move some aca/mca init functions into ras_init()
> stage
>
> adjust the function position to better match aca/mca fini code in ras_fini().
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 28 ++---
>  1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 8dbfdb767f94..3258feb753ca 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3428,6 +3428,13 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>   goto release_con;
>   }
>
> + if (amdgpu_aca_is_enabled(adev))
> + r = amdgpu_aca_init(adev);
> + else
> + r = amdgpu_mca_init(adev);
> + if (r)
> + goto release_con;
> +
>   dev_info(adev->dev, "RAS INFO: ras initialized successfully, "
>"hardware ability[%x] ras_mask[%x]\n",
>adev->ras_hw_enabled, adev->ras_enabled); @@ -3636,25
> +3643,22 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev)
>
>   amdgpu_ras_event_mgr_init(adev);
>
> - if (amdgpu_aca_is_enabled(adev)) {
> - if (!amdgpu_in_reset(adev)) {
> - r = amdgpu_aca_init(adev);
> + if (amdgpu_in_reset(adev)) {
> + if (!amdgpu_aca_is_enabled(adev)) {
> + r = amdgpu_mca_reset(adev);
>   if (r)
>   return r;
>   }
> + }
>
> - if (!amdgpu_sriov_vf(adev))
> - amdgpu_ras_set_aca_debug_mode(adev, false);
> - } else {
> - if (amdgpu_in_reset(adev))
> - r = amdgpu_mca_reset(adev);
> + if (!amdgpu_sriov_vf(adev)) {
> + if (amdgpu_aca_is_enabled(adev))
> + r = amdgpu_ras_set_aca_debug_mode(adev, false);
>   else
> - r = amdgpu_mca_init(adev);
> + r = amdgpu_ras_set_mca_debug_mode(adev, false);
> +
>   if (r)
>   return r;
> -
> - if (!amdgpu_sriov_vf(adev))
> - amdgpu_ras_set_mca_debug_mode(adev, false);
>   }
>
>   /* Guest side doesn't need init ras feature */
> --
> 2.34.1



RE: [PATCH 1/5] drm/amdgpu: add RAS is_rma flag

2024-06-04 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Ping for the series...

> -Original Message-
> From: Zhou1, Tao 
> Sent: Friday, May 31, 2024 6:49 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH 1/5] drm/amdgpu: add RAS is_rma flag
>
> Set the flag to true if bad page number reaches threshold.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c|  7 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h|  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h |  3 +--
>  4 files changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 2dc47475b8e9..616dc2387f34 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2940,7 +2940,6 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   struct ras_err_handler_data **data;
>   u32  max_eeprom_records_count = 0;
> - bool exc_err_limit = false;
>   int ret;
>
>   if (!con || amdgpu_sriov_vf(adev))
> @@ -2977,12 +2976,12 @@ int amdgpu_ras_recovery_init(struct
> amdgpu_device *adev)
>*/
>   if (adev->gmc.xgmi.pending_reset)
>   return 0;
> - ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &exc_err_limit);
> + ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
>   /*
>* This calling fails when exc_err_limit is true or
>* ret != 0.
>*/
> - if (exc_err_limit || ret)
> + if (con->is_rma || ret)
>   goto free;
>
>   if (con->eeprom_control.ras_num_recs) { @@ -3033,7 +3032,7 @@ int
> amdgpu_ras_recovery_init(struct amdgpu_device *adev)
>* Except error threshold exceeding case, other failure cases in this
>* function would not fail amdgpu driver init.
>*/
> - if (!exc_err_limit)
> + if (!con->is_rma)
>   ret = 0;
>   else
>   ret = -EINVAL;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index d06c01b978cd..437c58c85639 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -521,6 +521,7 @@ struct amdgpu_ras {
>   bool update_channel_flag;
>   /* Record status of smu mca debug mode */
>   bool is_aca_debug_mode;
> + bool is_rma;
>
>   /* Record special requirements of gpu reset caller */
>   uint32_t  gpu_reset_flags;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 9b789dcc2bd1..eae0a555df3c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -750,6 +750,9 @@ amdgpu_ras_eeprom_update_header(struct
> amdgpu_ras_eeprom_control *control)
>   control->tbl_rai.health_percent = 0;
>   }
>
> + if (amdgpu_bad_page_threshold != -1)
> + ras->is_rma = true;
> +
>   /* ignore the -ENOTSUPP return value */
>   amdgpu_dpm_send_rma_reason(adev);
>   }
> @@ -1321,8 +1324,7 @@ static int __read_table_ras_info(struct
> amdgpu_ras_eeprom_control *control)
>   return res == RAS_TABLE_V2_1_INFO_SIZE ? 0 : res;  }
>
> -int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
> -bool *exceed_err_limit)
> +int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control)
>  {
>   struct amdgpu_device *adev = to_amdgpu_device(control);
>   unsigned char buf[RAS_TABLE_HEADER_SIZE] = { 0 }; @@ -1330,7
> +1332,7 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control
> *control,
>   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>   int res;
>
> - *exceed_err_limit = false;
> + ras->is_rma = false;
>
>   if (!__is_ras_eeprom_supported(adev))
>   return 0;
> @@ -1422,7 +1424,7 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
>   dev_warn(adev->dev, "GPU will be initialized
> due to bad_page_threshold = -1.");
>   res = 0;
>   } else {
> - *exceed_err_limit = true;
> + ras->is_rma = true;
>   dev_err(adev->dev,
>

RE: [PATCH] drm/amdgpu: Update programming for boot error reporting

2024-05-30 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

One more question, do we need to consider the compatible with old FW?

> -Original Message-
> From: Hawking Zhang 
> Sent: Thursday, May 30, 2024 4:41 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Update programming for boot error reporting
>
> AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid.
> The polling sequence is also simplifed according to the latest firmware 
> change.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 99 +++--
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +-
>  2 files changed, 46 insertions(+), 57 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index eedf2b613ac2..2c338d39cd45 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -4416,64 +4416,74 @@ int amdgpu_ras_error_statistic_de_count(struct
> ras_err_data *err_data,
>  #define mmMP0_SMN_C2PMSG_92  0x1609C
>  #define mmMP0_SMN_C2PMSG_126 0x160BE
>  static void amdgpu_ras_boot_time_error_reporting(struct amdgpu_device
> *adev,
> -  u32 instance, u32 boot_error)
> +  u32 instance)
>  {
>   u32 socket_id, aid_id, hbm_id;
> - u32 reg_data;
> + u32 fw_status;
> + u32 boot_error;
>   u64 reg_addr;
>
> - socket_id = AMDGPU_RAS_GPU_ERR_SOCKET_ID(boot_error);
> - aid_id = AMDGPU_RAS_GPU_ERR_AID_ID(boot_error);
> - hbm_id = AMDGPU_RAS_GPU_ERR_HBM_ID(boot_error);
> -
>   /* The pattern for smn addressing in other SOC could be different from
>* the one for aqua_vanjaram. We should revisit the code if the pattern
>* is changed. In such case, replace the aqua_vanjaram implementation
>* with more common helper */
>   reg_addr = (mmMP0_SMN_C2PMSG_92 << 2) +
>  aqua_vanjaram_encode_ext_smn_addressing(instance);
> + fw_status = amdgpu_device_indirect_rreg_ext(adev, reg_addr);
> +
> + reg_addr = (mmMP0_SMN_C2PMSG_126 << 2) +
> +aqua_vanjaram_encode_ext_smn_addressing(instance);
> + boot_error = amdgpu_device_indirect_rreg_ext(adev, reg_addr);
>
> - reg_data = amdgpu_device_indirect_rreg_ext(adev, reg_addr);
> - dev_err(adev->dev, "socket: %d, aid: %d, firmware boot failed, fw status
> is 0x%x\n",
> - socket_id, aid_id, reg_data);
> + socket_id = AMDGPU_RAS_GPU_ERR_SOCKET_ID(boot_error);
> + aid_id = AMDGPU_RAS_GPU_ERR_AID_ID(boot_error);
> + hbm_id = AMDGPU_RAS_GPU_ERR_HBM_ID(boot_error);
>
>   if (AMDGPU_RAS_GPU_ERR_MEM_TRAINING(boot_error))
> - dev_info(adev->dev, "socket: %d, aid: %d, hbm: %d, memory
> training failed\n",
> -  socket_id, aid_id, hbm_id);
> + dev_info(adev->dev,
> +  "socket: %d, aid: %d, hbm: %d, fw_status: 0x%x,
> memory training failed\n",
> +  socket_id, aid_id, hbm_id, fw_status);
>
>   if (AMDGPU_RAS_GPU_ERR_FW_LOAD(boot_error))
> - dev_info(adev->dev, "socket: %d, aid: %d, firmware load failed 
> at
> boot time\n",
> -  socket_id, aid_id);
> + dev_info(adev->dev,
> +  "socket: %d, aid: %d, fw_status: 0x%x, firmware load
> failed at boot time\n",
> +  socket_id, aid_id, fw_status);
>
>   if (AMDGPU_RAS_GPU_ERR_WAFL_LINK_TRAINING(boot_error))
> - dev_info(adev->dev, "socket: %d, aid: %d, wafl link training
> failed\n",
> -  socket_id, aid_id);
> + dev_info(adev->dev,
> +  "socket: %d, aid: %d, fw_status: 0x%x, wafl link 
> training
> failed\n",
> +  socket_id, aid_id, fw_status);
>
>   if (AMDGPU_RAS_GPU_ERR_XGMI_LINK_TRAINING(boot_error))
> - dev_info(adev->dev, "socket: %d, aid: %d, xgmi link training
> failed\n",
> -  socket_id, aid_id);
> + dev_info(adev->dev,
> +  "socket: %d, aid: %d, fw_status: 0x%x, xgmi link 
> training
> failed\n",
> +  socket_id, aid_id, fw_status);
>
>   if (AMDGPU_RAS_GPU_ERR_USR_CP_LINK_TRAINING(boot_error))
> - dev_info(adev->dev, "socket: %d, aid: %d, usr cp link training
> failed\n",
> -

RE: [PATCH] drm/amdgpu: Estimate RAS reservation when report capacity v2

2024-05-27 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

I prefer to add comment for AMDGPU_RAS_RESERVED_VRAM_SIZE to explain the value 
of 16MB, anyway the patch is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Zhang, Hawking 
> Sent: Tuesday, May 28, 2024 1:57 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> 
> Subject: [PATCH] drm/amdgpu: Estimate RAS reservation when report capacity v2
>
> Add estimate of how much vram we need to reserve for RAS when caculating the
> total available vram.
>
> v2: apply the change to MP0 v13_0_2 and v13_0_14
>
> Signed-off-by: Hawking Zhang 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  9 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 20 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  4 
>  3 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index e98927529f61..ad813772f8a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -173,6 +173,8 @@ int amdgpu_amdkfd_reserve_mem_limit(struct
> amdgpu_device *adev,  {
>   uint64_t reserved_for_pt =
>   ESTIMATE_PT_SIZE(amdgpu_amdkfd_total_mem_size);
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + uint64_t reserved_for_ras = (con ? con->reserved_pages_in_bytes : 0);
>   size_t system_mem_needed, ttm_mem_needed, vram_needed;
>   int ret = 0;
>   uint64_t vram_size = 0;
> @@ -221,7 +223,7 @@ int amdgpu_amdkfd_reserve_mem_limit(struct
> amdgpu_device *adev,
>   (kfd_mem_limit.ttm_mem_used + ttm_mem_needed >
>kfd_mem_limit.max_ttm_mem_limit) ||
>   (adev && xcp_id >= 0 && adev->kfd.vram_used[xcp_id] +
> vram_needed >
> -  vram_size - reserved_for_pt - atomic64_read(&adev->vram_pin_size)
> +
> +  vram_size - reserved_for_pt - reserved_for_ras -
> +atomic64_read(&adev->vram_pin_size) +
>atomic64_read(&adev->kfd.vram_pinned))) {
>   ret = -ENOMEM;
>   goto release;
> @@ -1694,6 +1696,8 @@ size_t amdgpu_amdkfd_get_available_memory(struct
> amdgpu_device *adev,  {
>   uint64_t reserved_for_pt =
>   ESTIMATE_PT_SIZE(amdgpu_amdkfd_total_mem_size);
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + uint64_t reserved_for_ras = (con ? con->reserved_pages_in_bytes : 0);
>   ssize_t available;
>   uint64_t vram_available, system_mem_available, ttm_mem_available;
>
> @@ -1702,7 +1706,8 @@ size_t amdgpu_amdkfd_get_available_memory(struct
> amdgpu_device *adev,
>   - adev->kfd.vram_used_aligned[xcp_id]
>   - atomic64_read(&adev->vram_pin_size)
>   + atomic64_read(&adev->kfd.vram_pinned)
> - - reserved_for_pt;
> + - reserved_for_pt
> + - reserved_for_ras;
>
>   if (adev->gmc.is_app_apu || adev->flags & AMD_IS_APU) {
>   system_mem_available = no_system_mem_limit ?
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ecce022c657b..f28bf5765380 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3317,6 +3317,24 @@ static void amdgpu_ras_event_mgr_init(struct
> amdgpu_device *adev)
>   amdgpu_put_xgmi_hive(hive);
>  }
>
> +static void amdgpu_ras_init_reserved_vram_size(struct amdgpu_device
> +*adev) {
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +
> + if (!con || (adev->flags & AMD_IS_APU))
> + return;
> +
> + switch (amdgpu_ip_version(adev, MP0_HWIP, 0)) {
> + case IP_VERSION(13, 0, 2):
> + case IP_VERSION(13, 0, 6):
> + case IP_VERSION(13, 0, 14):
> + con->reserved_pages_in_bytes =
> AMDGPU_RAS_RESERVED_VRAM_SIZE;
> + break;
> + default:
> + break;
> + }
> +}
> +
>  int amdgpu_ras_init(struct amdgpu_device *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -3422,6
> +3440,8 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>   /* Get RAS schema for particular SOC */
>   con->schema = amdgpu_get_ras_schema(adev);
>
> + amdgpu_ras_init_reserved_vram_size(adev);
> +
>   if (amdgpu_ras_fs_init(adev)) {
>   r = -EINVAL;
>   goto release_con;
> diff --git a/drivers/

RE: [PATCH] drm/amdgpu: Estimate RAS reservation when report capacity

2024-05-27 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: amd-gfx  On Behalf Of Hawking
> Zhang
> Sent: Tuesday, May 28, 2024 10:21 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao ; Kuehling, Felix
> ; Kasiviswanathan, Harish
> ; Zhang, Hawking
> 
> Subject: [PATCH] drm/amdgpu: Estimate RAS reservation when report capacity
>
> Add estimate of how much vram we need to reserve for RAS when caculating the
> total available vram.
>
> Signed-off-by: Hawking Zhang 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c   |  9 +++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 18 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h|  2 ++
>  3 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index e98927529f61..ad813772f8a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -173,6 +173,8 @@ int amdgpu_amdkfd_reserve_mem_limit(struct
> amdgpu_device *adev,  {
>   uint64_t reserved_for_pt =
>   ESTIMATE_PT_SIZE(amdgpu_amdkfd_total_mem_size);
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + uint64_t reserved_for_ras = (con ? con->reserved_pages_in_bytes : 0);
>   size_t system_mem_needed, ttm_mem_needed, vram_needed;
>   int ret = 0;
>   uint64_t vram_size = 0;
> @@ -221,7 +223,7 @@ int amdgpu_amdkfd_reserve_mem_limit(struct
> amdgpu_device *adev,
>   (kfd_mem_limit.ttm_mem_used + ttm_mem_needed >
>kfd_mem_limit.max_ttm_mem_limit) ||
>   (adev && xcp_id >= 0 && adev->kfd.vram_used[xcp_id] +
> vram_needed >
> -  vram_size - reserved_for_pt - atomic64_read(&adev->vram_pin_size)
> +
> +  vram_size - reserved_for_pt - reserved_for_ras -
> +atomic64_read(&adev->vram_pin_size) +
>atomic64_read(&adev->kfd.vram_pinned))) {
>   ret = -ENOMEM;
>   goto release;
> @@ -1694,6 +1696,8 @@ size_t amdgpu_amdkfd_get_available_memory(struct
> amdgpu_device *adev,  {
>   uint64_t reserved_for_pt =
>   ESTIMATE_PT_SIZE(amdgpu_amdkfd_total_mem_size);
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + uint64_t reserved_for_ras = (con ? con->reserved_pages_in_bytes : 0);
>   ssize_t available;
>   uint64_t vram_available, system_mem_available, ttm_mem_available;
>
> @@ -1702,7 +1706,8 @@ size_t amdgpu_amdkfd_get_available_memory(struct
> amdgpu_device *adev,
>   - adev->kfd.vram_used_aligned[xcp_id]
>   - atomic64_read(&adev->vram_pin_size)
>   + atomic64_read(&adev->kfd.vram_pinned)
> - - reserved_for_pt;
> + - reserved_for_pt
> + - reserved_for_ras;
>
>   if (adev->gmc.is_app_apu || adev->flags & AMD_IS_APU) {
>   system_mem_available = no_system_mem_limit ?
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ecce022c657b..a6334e0e62dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3317,6 +3317,22 @@ static void amdgpu_ras_event_mgr_init(struct
> amdgpu_device *adev)
>   amdgpu_put_xgmi_hive(hive);
>  }
>
> +static void amdgpu_ras_init_reserved_vram_size(struct amdgpu_device
> +*adev) {
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +
> + if (!con || (adev->flags & AMD_IS_APU))
> + return;
> +
> + switch (amdgpu_ip_version(adev, MP0_HWIP, 0)) {
> + case IP_VERSION(13, 0, 6):

[Tao] can we apply the change for all ASICs which support RAS?

> + con->reserved_pages_in_bytes =
> AMDGPU_RAS_RESERVED_VRAM_SIZE;
> + break;
> + default:
> + break;
> + }
> +}
> +
>  int amdgpu_ras_init(struct amdgpu_device *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ -3422,6
> +3438,8 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>   /* Get RAS schema for particular SOC */
>   con->schema = amdgpu_get_ras_schema(adev);
>
> + amdgpu_ras_init_reserved_vram_size(adev);
> +
>   if (amdgpu_ras_fs_init(adev)) {
>   r = -EINVAL;
>   goto release_con;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 6a8c7b1609df..bee622c4268a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
>

RE: [PATCH] drm/amdgpu: fix typo in amdgpu_ras_aca_sysfs_read() function

2024-05-27 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Monday, May 27, 2024 3:47 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Chai, Thomas 
> Subject: [PATCH] drm/amdgpu: fix typo in amdgpu_ras_aca_sysfs_read() function
>
> fix typo "info.ue_count" in amdgpu_ras_aca_sysfs_read() function.
>
> Fixes: edd67b5417f5 ("drm/amdgpu: add aca deferred error type support")
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 8073716bc5ac..db4a811cc0f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1299,7 +1299,7 @@ ssize_t amdgpu_ras_aca_sysfs_read(struct device
> *dev, struct device_attribute *a
>   return -EINVAL;
>
>   return sysfs_emit(buf, "%s: %lu\n%s: %lu\n%s: %lu\n", "ue",
> info.ue_count,
> -   "ce", info.ce_count, "de", info.ue_count);
> +   "ce", info.ce_count, "de", info.de_count);
>  }
>
>  static int amdgpu_ras_query_error_status_helper(struct amdgpu_device *adev,
> --
> 2.34.1



RE: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag

2024-05-26 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

> -Original Message-
> From: Yang, Stanley 
> Sent: Thursday, May 23, 2024 9:57 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: RE: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -Original Message-
> > From: amd-gfx  On Behalf Of Tao
> > Zhou
> > Sent: Thursday, May 23, 2024 6:02 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Zhou1, Tao 
> > Subject: [PATCH 1/2] drm/amdgpu: add RAS is_rma flag
> >
> > Set the flag to true if bad page number reaches threshold.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c|  7 +++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h|  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h |  3 +--
> >  4 files changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index ecce022c657b..934dfb2bf9e5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2940,7 +2940,6 @@ int amdgpu_ras_recovery_init(struct
> > amdgpu_device
> > *adev)
> >   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >   struct ras_err_handler_data **data;
> >   u32  max_eeprom_records_count = 0;
> > - bool exc_err_limit = false;
> >   int ret;
> >
> >   if (!con || amdgpu_sriov_vf(adev)) @@ -2977,12 +2976,12 @@ int
> > amdgpu_ras_recovery_init(struct amdgpu_device *adev)
> >*/
> >   if (adev->gmc.xgmi.pending_reset)
> >   return 0;
> > - ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &exc_err_limit);
> > + ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
> >   /*
> >* This calling fails when exc_err_limit is true or
> >* ret != 0.
> >*/
> > - if (exc_err_limit || ret)
> > + if (con->is_rma || ret)
> >   goto free;
> >
> >   if (con->eeprom_control.ras_num_recs) { @@ -3033,7 +3032,7 @@
> > int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
> >* Except error threshold exceeding case, other failure cases in this
> >* function would not fail amdgpu driver init.
> >*/
> > - if (!exc_err_limit)
> > + if (!con->is_rma)
> >   ret = 0;
> >   else
> >   ret = -EINVAL;
>
> [Stanley]: Should stop device service if device is under RMA during running? 
> the
> amdgpu_ras_recovery_init function only be called during the process of loading
> driver.

[Tao] yes, I plan to stop service in resume stage after mode-1 if run-time RMA 
is reported. But I have no environment to verify the design right now, so this 
is TODO temporarily.

>
> Regards,
> Stanley
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index d06c01b978cd..437c58c85639 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -521,6 +521,7 @@ struct amdgpu_ras {
> >   bool update_channel_flag;
> >   /* Record status of smu mca debug mode */
> >   bool is_aca_debug_mode;
> > + bool is_rma;
> >
> >   /* Record special requirements of gpu reset caller */
> >   uint32_t  gpu_reset_flags;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 9b789dcc2bd1..eae0a555df3c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -750,6 +750,9 @@ amdgpu_ras_eeprom_update_header(struct
> > amdgpu_ras_eeprom_control *control)
> >   control->tbl_rai.health_percent = 0;
> >   }
> >
> > + if (amdgpu_bad_page_threshold != -1)
> > + ras->is_rma = true;
> > +
> >   /* ignore the -ENOTSUPP return value */
> >   amdgpu_dpm_send_rma_reason(adev);
> >   }
> > @@ -1321,8 +1324,7 @@ static int __read_table_ras_info(struct
> > amdgpu_ras_eeprom_control *control)
> >   return res == RAS_TABLE_V2_1_INFO_SIZE ? 0 : res;  }
> >
> > -int amdgpu_ras_eeprom_init(struc

RE: [PATCH] drm/amdgpu: correct hbm field in boot status

2024-05-21 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Zhang, Hawking 
> Sent: Tuesday, May 21, 2024 3:12 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: correct hbm field in boot status
>
> hbm filed takes bit 13 and bit 14 in boot status.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index c8980d5f6540..7021c4a66fb5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -46,7 +46,7 @@ struct amdgpu_iv_entry;
>  #define AMDGPU_RAS_GPU_ERR_HBM_BIST_TEST(x)
>   AMDGPU_GET_REG_FIELD(x, 7, 7)
>  #define AMDGPU_RAS_GPU_ERR_SOCKET_ID(x)
>   AMDGPU_GET_REG_FIELD(x, 10, 8)
>  #define AMDGPU_RAS_GPU_ERR_AID_ID(x)
>   AMDGPU_GET_REG_FIELD(x, 12, 11)
> -#define AMDGPU_RAS_GPU_ERR_HBM_ID(x)
>   AMDGPU_GET_REG_FIELD(x, 13, 13)
> +#define AMDGPU_RAS_GPU_ERR_HBM_ID(x)
>   AMDGPU_GET_REG_FIELD(x, 14, 13)
>  #define AMDGPU_RAS_GPU_ERR_BOOT_STATUS(x)
>   AMDGPU_GET_REG_FIELD(x, 31, 31)
>
>  #define AMDGPU_RAS_BOOT_STATUS_POLLING_LIMIT 1000
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: update type of buf size to u32 for eeprom functions

2024-05-19 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]


The limit variable in amdgpu_eeprom_xfer is not 0, so the buf_size will be cut 
into smaller pieces, u16 is enough for __amdgpu_eeprom_xfer.
Anyway, use u32 for __amdgpu_eeprom_xfer and make sure the msgs[1].len is less 
than U16_MAX is better, will create a new patch for the purpose.

Tao

  _
  From: Zhang, Hawking 
  Sent: Monday, May 20, 2024 11:23 AM
  To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
  Cc: Zhou1, Tao 
  Subject: RE: [PATCH] drm/amdgpu: update type of buf size to u32 for 
eeprom functions


  [AMD Official Use Only - AMD Internal Distribution Only]



  Hmm... but in __amdgpu_eeprom_xfer, the u32 will still be cut to u16.
  __amdgpu_eeprom_xfer(struct i2c_adapter *i2c_adap, u32 eeprom_addr, u8 
*eeprom_buf, u16 buf_size, bool read)

  Regards,
  Hawking

  -Original Message-
  From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 On Behalf Of Tao Zhou
  Sent: Monday, May 20, 2024 10:46
  To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
      Cc: Zhou1, Tao mailto:tao.zh...@amd.com>>
  Subject: [PATCH] drm/amdgpu: update type of buf size to u32 for eeprom 
functions

  Avoid overflow issue.

  Signed-off-by: Tao Zhou mailto:tao.zh...@amd.com>>
  ---
   drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c | 6 +++---  
drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.h | 4 ++--
   2 files changed, 5 insertions(+), 5 deletions(-)

  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c
  index e71768661ca8..09a34c7258e2 100644
  --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c
  +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.c
  @@ -179,7 +179,7 @@ static int __amdgpu_eeprom_xfer(struct i2c_adapter 
*i2c_adap, u32 eeprom_addr,
* Returns the number of bytes read/written; -errno on error.
*/
   static int amdgpu_eeprom_xfer(struct i2c_adapter *i2c_adap, u32 
eeprom_addr,
  -   u8 *eeprom_buf, u16 buf_size, bool read)
  +   u8 *eeprom_buf, u32 buf_size, bool read)
   {
const struct i2c_adapter_quirks *quirks = i2c_adap->quirks;
u16 limit;
  @@ -225,7 +225,7 @@ static int amdgpu_eeprom_xfer(struct i2c_adapter 
*i2c_adap, u32 eeprom_addr,

   int amdgpu_eeprom_read(struct i2c_adapter *i2c_adap,
   u32 eeprom_addr, u8 *eeprom_buf,
  -u16 bytes)
  +u32 bytes)
   {
return amdgpu_eeprom_xfer(i2c_adap, eeprom_addr, eeprom_buf, bytes,
  true);
  @@ -233,7 +233,7 @@ int amdgpu_eeprom_read(struct i2c_adapter *i2c_adap,

   int amdgpu_eeprom_write(struct i2c_adapter *i2c_adap,
u32 eeprom_addr, u8 *eeprom_buf,
  - u16 bytes)
  + u32 bytes)
   {
return amdgpu_eeprom_xfer(i2c_adap, eeprom_addr, eeprom_buf, bytes,
  false);
  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.h
  index 6935adb2be1f..8083b8253ef4 100644
  --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.h
  +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_eeprom.h
  @@ -28,10 +28,10 @@

   int amdgpu_eeprom_read(struct i2c_adapter *i2c_adap,
   u32 eeprom_addr, u8 *eeprom_buf,
  -u16 bytes);
  +u32 bytes);

   int amdgpu_eeprom_write(struct i2c_adapter *i2c_adap,
u32 eeprom_addr, u8 *eeprom_buf,
  - u16 bytes);
  + u32 bytes);

   #endif
  --
  2.34.1



RE: [PATCH 3/3] drm/amdgpu: fix ACA no query result after gpu reset

2024-05-17 Thread Zhou1, Tao
[AMD Official Use Only - AMD Internal Distribution Only]

The series is Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Friday, May 17, 2024 11:41 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Chai, Thomas 
> Subject: [PATCH 3/3] drm/amdgpu: fix ACA no query result after gpu reset
>
> fix ACA no query result after gpu reset.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 7 ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 1 -
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 -
>  3 files changed, 4 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> index 05062f2581a1..6c6c387e5a06 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> @@ -712,13 +712,6 @@ void amdgpu_aca_fini(struct amdgpu_device *adev)
>   atomic_set(&aca->ue_update_flag, 0);
>  }
>
> -int amdgpu_aca_reset(struct amdgpu_device *adev) -{
> - amdgpu_aca_fini(adev);
> -
> - return amdgpu_aca_init(adev);
> -}
> -
>  void amdgpu_aca_set_smu_funcs(struct amdgpu_device *adev, const struct
> aca_smu_funcs *smu_funcs)  {
>   struct amdgpu_aca *aca = &adev->aca;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> index ba724c2a997d..4327ce1ceacf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> @@ -192,7 +192,6 @@ struct aca_info {
>
>  int amdgpu_aca_init(struct amdgpu_device *adev);  void amdgpu_aca_fini(struct
> amdgpu_device *adev); -int amdgpu_aca_reset(struct amdgpu_device *adev);
> void amdgpu_aca_set_smu_funcs(struct amdgpu_device *adev, const struct
> aca_smu_funcs *smu_funcs);  bool amdgpu_aca_is_enabled(struct
> amdgpu_device *adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 06c5f6e2ef7c..5af813eacfb3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3617,12 +3617,11 @@ int amdgpu_ras_late_init(struct amdgpu_device
> *adev)
>   amdgpu_ras_event_mgr_init(adev);
>
>   if (amdgpu_aca_is_enabled(adev)) {
> - if (amdgpu_in_reset(adev))
> - r = amdgpu_aca_reset(adev);
> -  else
> + if (!amdgpu_in_reset(adev)) {
>   r = amdgpu_aca_init(adev);
> - if (r)
> - return r;
> + if (r)
> + return r;
> + }
>
>   if (!amdgpu_sriov_vf(adev))
>   amdgpu_ras_set_aca_debug_mode(adev, false);
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: add ACA error query support for umc_v12_0

2024-04-25 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Wednesday, April 17, 2024 11:10 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Chai, Thomas 
> Subject: [PATCH] drm/amdgpu: add ACA error query support for umc_v12_0
>
> add ACA error query support for umc_v12_0.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  6 +++---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  | 18 ++
>  3 files changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 352ce16a0963..46b7f0c5cd8a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1268,9 +1268,9 @@ int amdgpu_ras_unbind_aca(struct amdgpu_device
> *adev, enum amdgpu_ras_block blk)
>   return 0;
>  }
>
> -static int amdgpu_aca_log_ras_error_data(struct amdgpu_device *adev, enum
> amdgpu_ras_block blk,
> -  enum aca_error_type type, struct
> ras_err_data *err_data,
> -  struct ras_query_context *qctx)
> +int amdgpu_aca_log_ras_error_data(struct amdgpu_device *adev, enum
> amdgpu_ras_block blk,
> +   enum aca_error_type type, struct ras_err_data
> *err_data,
> +   struct ras_query_context *qctx)
>  {
>   struct ras_manager *obj;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 8d26989c75c8..487548879c49 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -898,6 +898,10 @@ int amdgpu_ras_unbind_aca(struct amdgpu_device
> *adev, enum amdgpu_ras_block blk)  ssize_t amdgpu_ras_aca_sysfs_read(struct
> device *dev, struct device_attribute *attr,
> struct aca_handle *handle, char *buf, void
> *data);
>
> +int amdgpu_aca_log_ras_error_data(struct amdgpu_device *adev, enum
> amdgpu_ras_block blk,
> +   enum aca_error_type type, struct ras_err_data
> *err_data,
> +   struct ras_query_context *qctx);

[Tao] is it used in this patch?

> +
>  void amdgpu_ras_add_mca_err_addr(struct ras_err_info *err_info,
>   struct ras_err_addr *err_addr);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index f69871902233..9f2c46814a4f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -317,16 +317,26 @@ static int umc_v12_0_err_cnt_init_per_channel(struct
> amdgpu_device *adev,  static void
> umc_v12_0_ecc_info_query_ras_error_count(struct amdgpu_device *adev,
>   void *ras_error_status)
>  {
> + struct ras_err_data *err_data = (struct ras_err_data
> +*)ras_error_status;
>   struct ras_query_context qctx;
>
>   memset(&qctx, 0, sizeof(qctx));
>   qctx.event_id = amdgpu_ras_acquire_event_id(adev,
> amdgpu_ras_intr_triggered() ?
>   RAS_EVENT_TYPE_ISR :
> RAS_EVENT_TYPE_INVALID);
>
> - amdgpu_mca_smu_log_ras_error(adev,
> - AMDGPU_RAS_BLOCK__UMC,
> AMDGPU_MCA_ERROR_TYPE_CE, ras_error_status, &qctx);
> - amdgpu_mca_smu_log_ras_error(adev,
> - AMDGPU_RAS_BLOCK__UMC,
> AMDGPU_MCA_ERROR_TYPE_UE, ras_error_status, &qctx);
> + if (amdgpu_aca_is_enabled(adev)) {
> + amdgpu_aca_get_error_data(adev,
> AMDGPU_RAS_BLOCK__UMC, ACA_ERROR_TYPE_CE,
> +   err_data, &qctx);
> + amdgpu_aca_get_error_data(adev,
> AMDGPU_RAS_BLOCK__UMC, ACA_ERROR_TYPE_UE,
> +   err_data, &qctx);
> + amdgpu_aca_get_error_data(adev,
> AMDGPU_RAS_BLOCK__UMC, ACA_ERROR_TYPE_DEFERRED,
> +   err_data, &qctx);
> + } else {
> + amdgpu_mca_smu_log_ras_error(adev,
> AMDGPU_RAS_BLOCK__UMC, AMDGPU_MCA_ERROR_TYPE_CE,
> +  err_data, &qctx);
> + amdgpu_mca_smu_log_ras_error(adev,
> AMDGPU_RAS_BLOCK__UMC, AMDGPU_MCA_ERROR_TYPE_UE,
> +  err_data, &qctx);
> + }
>  }
>
>  static void umc_v12_0_ecc_info_query_ras_error_address(struct
> amdgpu_device *adev,
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: skip to create ras xxx_err_count node when ACA is enabled

2024-04-25 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

[Tao] it's better to add comment to explain how to get error count when aca is 
enabled.

BTW, according to the change, do we need to update ras tool?

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Wednesday, April 24, 2024 10:50 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> 
> Subject: [PATCH] drm/amdgpu: skip to create ras xxx_err_count node when ACA
> is enabled
>
> skip to create 'xxx_err_count' node when ACA is enabled.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 1e2b866751c3..96a8359b703b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1756,6 +1756,9 @@ int amdgpu_ras_sysfs_create(struct amdgpu_device
> *adev,
>   if (!obj || obj->attr_inuse)
>   return -EINVAL;
>
> + if (amdgpu_aca_is_enabled(adev))
> + return 0;
> +
>   get_obj(obj);
>
>   snprintf(obj->fs_data.sysfs_name, sizeof(obj->fs_data.sysfs_name), @@ -
> 1790,6 +1793,9 @@ int amdgpu_ras_sysfs_remove(struct amdgpu_device *adev,
>   if (!obj || !obj->attr_inuse)
>   return -EINVAL;
>
> + if (amdgpu_aca_is_enabled(adev))
> + return 0;
> +
>   if (adev->dev->kobj.sd)
>   sysfs_remove_file_from_group(&adev->dev->kobj,
>   &obj->sysfs_attr.attr,
> --
> 2.34.1



RE: [PATCH 4/4] drm/amdgpu: avoid dump mca bank log muti times during ras ISR

2024-04-25 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Tuesday, April 23, 2024 4:27 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice 
> Subject: [PATCH 4/4] drm/amdgpu: avoid dump mca bank log muti times during
> ras ISR
>
> because the ue valid mca count will only be cleared after gpu reset, so only 
> dump
> mca log on the first time to get mca bank after receive RAS interrupt.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c | 28
> +  drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h |
> 1 +
>  2 files changed, 29 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> index 264f56fd4f66..b581523fa8d7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> @@ -229,6 +229,8 @@ int amdgpu_mca_init(struct amdgpu_device *adev)
>   struct mca_bank_cache *mca_cache;
>   int i;
>
> + atomic_set(&mca->ue_update_flag, 0);
> +
>   for (i = 0; i < ARRAY_SIZE(mca->mca_caches); i++) {
>   mca_cache = &mca->mca_caches[i];
>   mutex_init(&mca_cache->lock);
> @@ -244,6 +246,8 @@ void amdgpu_mca_fini(struct amdgpu_device *adev)
>   struct mca_bank_cache *mca_cache;
>   int i;
>
> + atomic_set(&mca->ue_update_flag, 0);
> +
>   for (i = 0; i < ARRAY_SIZE(mca->mca_caches); i++) {
>   mca_cache = &mca->mca_caches[i];
>   amdgpu_mca_bank_set_release(&mca_cache->mca_set);
> @@ -325,6 +329,27 @@ static int amdgpu_mca_smu_get_mca_entry(struct
> amdgpu_device *adev, enum amdgpu_
>   return mca_funcs->mca_get_mca_entry(adev, type, idx, entry);  }
>
> +static bool amdgpu_mca_bank_should_update(struct amdgpu_device *adev,
> +enum amdgpu_mca_error_type type) {
> + struct amdgpu_mca *mca = &adev->mca;
> + bool ret = true;
> +
> + /*
> +  * Because the UE Valid MCA count will only be cleared after reset,
> +  * in order to avoid repeated counting of the error count,
> +  * the aca bank is only updated once during the gpu recovery stage.
> +  */
> + if (type == AMDGPU_MCA_ERROR_TYPE_UE) {
> + if (amdgpu_ras_intr_triggered())
> + ret = atomic_cmpxchg(&mca->ue_update_flag, 0, 1) ==
> 0;
> + else
> + atomic_set(&mca->ue_update_flag, 0);
> + }
> +
> + return ret;
> +}
> +
> +

[Tao] redundant line, with this fixed, the patch is:

Reviewed-by: Tao Zhou 

>  static int amdgpu_mca_smu_get_mca_set(struct amdgpu_device *adev, enum
> amdgpu_mca_error_type type, struct mca_bank_set *mca_set,
> struct ras_query_context *qctx)  { @@ -
> 335,6 +360,9 @@ static int amdgpu_mca_smu_get_mca_set(struct
> amdgpu_device *adev, enum amdgpu_mc
>   if (!mca_set)
>   return -EINVAL;
>
> + if (!amdgpu_mca_bank_should_update(adev, type))
> + return 0;
> +
>   ret = amdgpu_mca_smu_get_valid_mca_count(adev, type, &count);
>   if (ret)
>   return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> index 9b97cfa28e05..e80323ff90c1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> @@ -93,6 +93,7 @@ struct amdgpu_mca {
>   struct amdgpu_mca_ras mpio;
>   const struct amdgpu_mca_smu_funcs *mca_funcs;
>   struct mca_bank_cache mca_caches[AMDGPU_MCA_ERROR_TYPE_DE];
> + atomic_t ue_update_flag;
>  };
>
>  enum mca_reg_idx {
> --
> 2.34.1



RE: [PATCH 15/15] drm/amdgpu: Use new interface to reserve bad page

2024-04-22 Thread Zhou1, Tao
[AMD Official Use Only - General]

With my concern fixed, the series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Chai, Thomas 
> Sent: Thursday, April 18, 2024 5:35 PM
> To: Christian König ; amd-
> g...@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Li, Candice ; Wang, Yang(Kevin)
> ; Yang, Stanley 
> Subject: RE: [PATCH 15/15] drm/amdgpu: Use new interface to reserve bad page
>
> [AMD Official Use Only - General]
>
> -
> Best Regards,
> Thomas
>
> -Original Message-
> From: Christian König 
> Sent: Thursday, April 18, 2024 5:01 PM
> To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley 
> Subject: Re: [PATCH 15/15] drm/amdgpu: Use new interface to reserve bad page
>
> Am 18.04.24 um 04:58 schrieb YiPeng Chai:
> > Use new interface to reserve bad page.
> >
> > Signed-off-by: YiPeng Chai 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +---
> >   1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index d1a2ab944b7d..dee66db10fa2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2548,9 +2548,7 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
> >   goto out;
> >   }
> >
> > - amdgpu_vram_mgr_reserve_range(&adev->mman.vram_mgr,
> > - bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT,
> > - AMDGPU_GPU_PAGE_SIZE);
>
> > Were is the call to reserve the VRAM range now moved?
>
> [Thomas] Called in amdgpu_ras_reserve_page,  amdgpu_ras_reserve_page  refer
> to " [PATCH 01/15] drm/amdgpu: Add interface to reserve bad page "
>
> Regards,
> Christian.
>
> > + amdgpu_ras_reserve_page(adev, bps[i].retired_page);
> >
> >   memcpy(&data->bps[data->count], &bps[i], sizeof(*data->bps));
> >   data->count++;
>



RE: [PATCH 10/15] drm/amdgpu: retire bad pages for umc v12_0

2024-04-22 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Chai, Thomas 
> Sent: Thursday, April 18, 2024 10:59 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley ; Chai, Thomas 
> Subject: [PATCH 10/15] drm/amdgpu: retire bad pages for umc v12_0
>
> Retire bad pages for umc v12_0.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 57
> +-
>  1 file changed, 55 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index 6c2b61ef5b57..bd917eb6ea24 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -28,6 +28,8 @@
>  #include "umc/umc_12_0_0_sh_mask.h"
>  #include "mp/mp_13_0_6_sh_mask.h"
>
> +#define MAX_ECC_NUM_PER_RETIREMENT  16

[Tao] we already have UMC_V12_0_BAD_PAGE_NUM_PER_CHANNEL for the purposes

> +
>  static inline uint64_t get_umc_v12_0_reg_offset(struct amdgpu_device *adev,
>   uint32_t node_inst,
>   uint32_t umc_inst,
> @@ -633,6 +635,58 @@ static int umc_v12_0_update_ecc_status(struct
> amdgpu_device *adev,
>   return 0;
>  }
>
> +static int umc_v12_0_fill_error_record(struct amdgpu_device *adev,
> + struct ras_ecc_err *ecc_err, void
> *ras_error_status) {
> + struct ras_err_data *err_data = (struct ras_err_data *)ras_error_status;
> + uint32_t i = 0;
> + int ret = 0;
> +
> + if (!err_data || !ecc_err)
> + return -EINVAL;
> +
> + for (i = 0; i < ecc_err->err_pages.count; i++) {
> + ret = amdgpu_umc_fill_error_record(err_data,
> + ecc_err->addr,
> + ecc_err->err_pages.pfn[i] <<
> AMDGPU_GPU_PAGE_SHIFT,
> + MCA_IPID_2_UMC_CH(ecc_err->ipid),
> + MCA_IPID_2_UMC_INST(ecc_err->ipid));
> + if (ret)
> + break;
> + }
> +
> + err_data->de_count++;
> +
> + return ret;
> +}
> +
> +static void umc_v12_0_query_ras_ecc_err_addr(struct amdgpu_device *adev,
> + void *ras_error_status)
> +{
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + struct ras_ecc_err *entries[MAX_ECC_NUM_PER_RETIREMENT];
> + struct radix_tree_root *ecc_tree;
> + int new_detected, ret, i;
> +
> + ecc_tree = &con->umc_ecc_log.de_page_tree;
> +
> + mutex_lock(&con->umc_ecc_log.lock);
> + new_detected = radix_tree_gang_lookup_tag(ecc_tree, (void **)entries,
> + 0, ARRAY_SIZE(entries),
> UMC_ECC_NEW_DETECTED_TAG);
> + for (i = 0; i < new_detected; i++) {
> + if (!entries[i])
> + continue;
> +
> + ret = umc_v12_0_fill_error_record(adev, entries[i],
> ras_error_status);
> + if (ret) {
> + dev_err(adev->dev, "Fail to fill umc error record,
> ret:%d\n", ret);
> + break;
> + }
> + radix_tree_tag_clear(ecc_tree, entries[i]->hash_index,
> UMC_ECC_NEW_DETECTED_TAG);
> + }
> + mutex_unlock(&con->umc_ecc_log.lock);
> +}
> +
>  struct amdgpu_umc_ras umc_v12_0_ras = {
>   .ras_block = {
>   .hw_ops = &umc_v12_0_ras_hw_ops,
> @@ -640,8 +694,7 @@ struct amdgpu_umc_ras umc_v12_0_ras = {
>   },
>   .err_cnt_init = umc_v12_0_err_cnt_init,
>   .query_ras_poison_mode = umc_v12_0_query_ras_poison_mode,
> - .ecc_info_query_ras_error_count =
> umc_v12_0_ecc_info_query_ras_error_count,
> - .ecc_info_query_ras_error_address =
> umc_v12_0_ecc_info_query_ras_error_address,
> + .ecc_info_query_ras_error_address =
> umc_v12_0_query_ras_ecc_err_addr,
>   .check_ecc_err_status = umc_v12_0_check_ecc_err_status,
>   .update_ecc_status = umc_v12_0_update_ecc_status,  };
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: Use driver mode reset for data poison

2024-04-15 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Tuesday, April 16, 2024 2:16 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Use driver mode reset for data poison
>
> mode-2 reset is the only reliable method that can get GC/SDMA back when
> poison is consumed. mmhub requires
> mode-1 reset.
>
> Signed-off-by: Hawking Zhang 
> ---
>  .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 27 ++-
>  1 file changed, 8 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index c368c70df3f4a..c3beb872adf8d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -144,7 +144,7 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   uint16_t pasid, uint16_t client_id)  {
>   enum amdgpu_ras_block block = 0;
> - int old_poison, ret = -EINVAL;
> + int old_poison;
>   uint32_t reset = 0;
>   struct kfd_process *p = kfd_lookup_process_by_pasid(pasid);
>
> @@ -163,17 +163,13 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SE2SH:
>   case SOC15_IH_CLIENTID_SE3SH:
>   case SOC15_IH_CLIENTID_UTCL2:
> - ret = kfd_dqm_evict_pasid(dev->dqm, pasid);
>   block = AMDGPU_RAS_BLOCK__GFX;
> - if (ret)
> - reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
>   break;
>   case SOC15_IH_CLIENTID_VMC:
>   case SOC15_IH_CLIENTID_VMC1:
> - ret = kfd_dqm_evict_pasid(dev->dqm, pasid);
>   block = AMDGPU_RAS_BLOCK__MMHUB;
> - if (ret)
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
>   break;
>   case SOC15_IH_CLIENTID_SDMA0:
>   case SOC15_IH_CLIENTID_SDMA1:
> @@ -184,22 +180,15 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
>   break;
>   default:
> - break;
> + dev_warn(dev->adev->dev,
> +  "client %d does not support poison consumption\n",
> client_id);
> + return;
>   }
>
>   kfd_signal_poison_consumed_event(dev, pasid);
>
> - /* resetting queue passes, do page retirement without gpu reset
> -  * resetting queue fails, fallback to gpu reset solution
> -  */
> - if (!ret)
> - dev_warn(dev->adev->dev,
> - "RAS poison consumption, unmap queue flow
> succeeded: client id %d\n",
> - client_id);
> - else
> - dev_warn(dev->adev->dev,
> - "RAS poison consumption, fall back to gpu reset flow:
> client id %d\n",
> - client_id);
> + dev_warn(dev->adev->dev,
> +  "poison is consumed by client %d, kick off gpu reset flow\n",
> +client_id);
>
>   amdgpu_amdkfd_ras_poison_consumption_handler(dev->adev, block,
> reset);  }
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: Use driver mode reset for data poison handling

2024-04-15 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Tuesday, April 16, 2024 12:34 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Use driver mode reset for data poison handling
>
> mode-2 reset is the only reliable method that can get GC/SDMA back when
> poison is consumed. mmhub requires
> mode-1 reset.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> index c368c70df3f4a..b6caf6eda8a0c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
> @@ -163,17 +163,13 @@ static void
> event_interrupt_poison_consumption_v9(struct kfd_node *dev,
>   case SOC15_IH_CLIENTID_SE2SH:
>   case SOC15_IH_CLIENTID_SE3SH:
>   case SOC15_IH_CLIENTID_UTCL2:
> - ret = kfd_dqm_evict_pasid(dev->dqm, pasid);
>   block = AMDGPU_RAS_BLOCK__GFX;
> - if (ret)
> - reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
> + reset = AMDGPU_RAS_GPU_RESET_MODE2_RESET;
>   break;
>   case SOC15_IH_CLIENTID_VMC:
>   case SOC15_IH_CLIENTID_VMC1:
> - ret = kfd_dqm_evict_pasid(dev->dqm, pasid);
>   block = AMDGPU_RAS_BLOCK__MMHUB;
> - if (ret)
> - reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
> + reset = AMDGPU_RAS_GPU_RESET_MODE1_RESET;
>   break;
>   case SOC15_IH_CLIENTID_SDMA0:
>   case SOC15_IH_CLIENTID_SDMA1:
> --
> 2.17.1



RE: [PATCH V2] drm/amdgpu: Fix incorrect return value

2024-04-12 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Chai, Thomas 
> Sent: Friday, April 12, 2024 4:56 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley ; Chai, Thomas 
> Subject: [PATCH V2] drm/amdgpu: Fix incorrect return value
>
> [Why]
>   After calling amdgpu_vram_mgr_reserve_range multiple times with the same
> address, calling amdgpu_vram_mgr_query_page_status will always return -
> EBUSY.
>   From the second call to amdgpu_vram_mgr_reserve_range, the same address
> will be added to the reservations_pending list again and is never moved to the
> reserved_pages list because the address had been reserved.
>
> [How]
>   First add the address status check before calling
> amdgpu_vram_mgr_do_reserve, if the address is already reserved, do nothing; If
> the address is already in the reservations_pending list, directly reserve 
> memory;
> only add new nodes for the addresses that are not in the reserved_pages list 
> and
> reservations_pending list.
>
> V2:
>  Avoid repeated locking/unlocking.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 25 +---
>  1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 1e36c428d254..a636d3f650b1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct
> ttm_resource_manager *man)
>
>   dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",
>   rsv->start, rsv->size);
> -
>   vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
>   atomic64_add(vis_usage, &mgr->vis_usage);
>   spin_lock(&man->bdev->lru_lock);
> @@ -340,19 +339,27 @@ int amdgpu_vram_mgr_reserve_range(struct
> amdgpu_vram_mgr *mgr,
> uint64_t start, uint64_t size)
>  {
>   struct amdgpu_vram_reservation *rsv;
> + int ret = 0;
>
> - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> - if (!rsv)
> - return -ENOMEM;
> + ret = amdgpu_vram_mgr_query_page_status(mgr, start);
> + if (!ret)
> + return 0;
>
> - INIT_LIST_HEAD(&rsv->allocated);
> - INIT_LIST_HEAD(&rsv->blocks);
> + if (ret == -ENOENT) {
> + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> + if (!rsv)
> + return -ENOMEM;
>
> - rsv->start = start;
> - rsv->size = size;
> + INIT_LIST_HEAD(&rsv->allocated);
> + INIT_LIST_HEAD(&rsv->blocks);
> +
> + rsv->start = start;
> + rsv->size = size;
> + }
>
>   mutex_lock(&mgr->lock);
> - list_add_tail(&rsv->blocks, &mgr->reservations_pending);
> + if (ret == -ENOENT)
> + list_add_tail(&rsv->blocks, &mgr->reservations_pending);
>   amdgpu_vram_mgr_do_reserve(&mgr->manager);
>   mutex_unlock(&mgr->lock);
>
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: add new aca smu callback func parse_error_code{}

2024-04-11 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Yang
> Wang
> Sent: Friday, April 12, 2024 10:54 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Wang, Yang(Kevin) 
> Subject: [PATCH] drm/amdgpu: add new aca smu callback func
> parse_error_code{}
>
> add new aca smu callback parse_error_code{} to avoid specific asic check in
> amdgpu_aca.c file
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c   | 23 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h   |  1 +
>  .../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c  | 13 +++
>  3 files changed, 22 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> index cb6a40a042e1..d1059e4d54d0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> @@ -753,23 +753,13 @@ int aca_bank_info_decode(struct aca_bank *bank,
> struct aca_bank_info *info)
>
>  static int aca_bank_get_error_code(struct amdgpu_device *adev, struct
> aca_bank *bank)  {
> - int error_code;
> -
> - switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) {
> - case IP_VERSION(13, 0, 6):
> - if (!(adev->flags & AMD_IS_APU) && adev->pm.fw_version >=
> 0x00555600) {
> - error_code =
> ACA_REG__SYND__ERRORINFORMATION(bank->regs[ACA_REG_IDX_SYND]);
> - return error_code & 0xff;
> - }
> - break;
> - default:
> - break;
> - }
> + struct amdgpu_aca *aca = &adev->aca;
> + const struct aca_smu_funcs *smu_funcs = aca->smu_funcs;
>
> - /* NOTE: the true error code is encoded in status.errorcode[0:7] */
> - error_code = ACA_REG__STATUS__ERRORCODE(bank-
> >regs[ACA_REG_IDX_STATUS]);
> + if (!smu_funcs || !smu_funcs->parse_error_code)
> + return -EOPNOTSUPP;
>
> - return error_code & 0xff;
> + return smu_funcs->parse_error_code(adev, bank);
>  }
>
>  int aca_bank_check_error_codes(struct amdgpu_device *adev, struct aca_bank
> *bank, int *err_codes, int size) @@ -780,6 +770,9 @@ int
> aca_bank_check_error_codes(struct amdgpu_device *adev, struct aca_bank
> *bank
>   return -EINVAL;
>
>   error_code = aca_bank_get_error_code(adev, bank);
> + if (error_code < 0)
> + return error_code;
> +
>   for (i = 0; i < size; i++) {
>   if (err_codes[i] == error_code)
>   return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> index 3765843ea648..5ef6b745f222 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> @@ -173,6 +173,7 @@ struct aca_smu_funcs {
>   int (*set_debug_mode)(struct amdgpu_device *adev, bool enable);
>   int (*get_valid_aca_count)(struct amdgpu_device *adev, enum
> aca_smu_type type, u32 *count);
>   int (*get_valid_aca_bank)(struct amdgpu_device *adev, enum
> aca_smu_type type, int idx, struct aca_bank *bank);
> + int (*parse_error_code)(struct amdgpu_device *adev, struct aca_bank
> +*bank);
>  };
>
>  struct amdgpu_aca {
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index d6d5be26e222..59e5c6256ea2 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -3119,12 +3119,25 @@ static int aca_smu_get_valid_aca_bank(struct
> amdgpu_device *adev,
>   return 0;
>  }
>
> +static int aca_smu_parse_error_code(struct amdgpu_device *adev, struct
> +aca_bank *bank) {
> + int error_code;
> +
> + if (!(adev->flags & AMD_IS_APU) && adev->pm.fw_version >=
> 0x00555600)
> + error_code = ACA_REG__SYND__ERRORINFORMATION(bank-
> >regs[ACA_REG_IDX_SYND]);
> + else
> + error_code =
> +ACA_REG__STATUS__ERRORCODE(bank->regs[ACA_REG_IDX_STATUS]);
> +
> + return error_code & 0xff;
> +}
> +
>  static const struct aca_smu_funcs smu_v13_0_6_aca_smu_funcs = {
>   .max_ue_bank_count = 12,
>   .max_ce_bank_count = 12,
>   .set_debug_mode = aca_smu_set_debug_mode,
>   .get_valid_aca_count = aca_smu_get_valid_aca_count,
>   .get_valid_aca_bank = aca_smu_get_valid_aca_bank,
> + .parse_error_code = aca_smu_parse_error_code,
>  };
>
>  static int smu_v13_0_6_select_xgmi_plpd_policy(struct smu_context *smu,
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: Fix incorrect return value

2024-04-08 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, April 3, 2024 3:07 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley ; Chai, Thomas 
> Subject: [PATCH] drm/amdgpu: Fix incorrect return value
>
> [Why]
>   After calling amdgpu_vram_mgr_reserve_range multiple times with the same
> address, calling amdgpu_vram_mgr_query_page_status will always return -
> EBUSY.
>   From the second call to amdgpu_vram_mgr_reserve_range, the same address
> will be added to the reservations_pending list again and is never moved to the
> reserved_pages list because the address had been reserved.
>
> [How]
>   First add the address status check before calling
> amdgpu_vram_mgr_do_reserve, if the address is already reserved, do nothing; If
> the address is already in the reservations_pending list, directly reserve 
> memory;
> only add new nodes for the addresses that are not in the reserved_pages list 
> and
> reservations_pending list.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 +---
>  1 file changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 1e36c428d254..0bf3f4092900 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct
> ttm_resource_manager *man)
>
>   dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",
>   rsv->start, rsv->size);
> -
>   vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
>   atomic64_add(vis_usage, &mgr->vis_usage);
>   spin_lock(&man->bdev->lru_lock);
> @@ -340,19 +339,30 @@ int amdgpu_vram_mgr_reserve_range(struct
> amdgpu_vram_mgr *mgr,
> uint64_t start, uint64_t size)
>  {
>   struct amdgpu_vram_reservation *rsv;
> + int ret = 0;
>
> - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> - if (!rsv)
> - return -ENOMEM;
> + ret = amdgpu_vram_mgr_query_page_status(mgr, start);
> + if (!ret)
> + return 0;
> +
> + if (ret == -ENOENT) {
> + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> + if (!rsv)
> + return -ENOMEM;
>
> - INIT_LIST_HEAD(&rsv->allocated);
> - INIT_LIST_HEAD(&rsv->blocks);
> + INIT_LIST_HEAD(&rsv->allocated);
> + INIT_LIST_HEAD(&rsv->blocks);
>
> - rsv->start = start;
> - rsv->size = size;
> + rsv->start = start;
> + rsv->size = size;
> +
> + mutex_lock(&mgr->lock);
> + list_add_tail(&rsv->blocks, &mgr->reservations_pending);
> + mutex_unlock(&mgr->lock);

[Tao] we can drop the mutex_unlock and add if (ret != -ENOENT) for the second 
mutex_lock to avoid unlocking/locking repeatedly.

> +
> + }
>
>   mutex_lock(&mgr->lock);
> - list_add_tail(&rsv->blocks, &mgr->reservations_pending);
>   amdgpu_vram_mgr_do_reserve(&mgr->manager);
>   mutex_unlock(&mgr->lock);
>
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: Fix incorrect return value

2024-04-08 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Chai, Thomas 
> Sent: Sunday, April 7, 2024 10:21 AM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley 
> Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value
>
> [AMD Official Use Only - General]
>
> -
> Best Regards,
> Thomas
>
> -Original Message-
> From: Zhou1, Tao 
> Sent: Wednesday, April 3, 2024 6:36 PM
> To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley 
> Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value
>
> [AMD Official Use Only - General]
>
> > -Original Message-
> > From: Chai, Thomas 
> > Sent: Wednesday, April 3, 2024 3:07 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Chai, Thomas ; Zhang, Hawking
> > ; Zhou1, Tao ; Li, Candice
> > ; Wang, Yang(Kevin) ;
> > Yang, Stanley ; Chai, Thomas
> > 
> > Subject: [PATCH] drm/amdgpu: Fix incorrect return value
> >
> > [Why]
> >   After calling amdgpu_vram_mgr_reserve_range multiple times with the
> > same address, calling amdgpu_vram_mgr_query_page_status will always
> > return - EBUSY.
>
> >[Tao] could you explain why we call amdgpu_vram_mgr_reserve_range multiple
> times with the same  address? IIRC, we skip duplicate address before reserve
> memory.
>
> [Thomas]
>When poison creation interrupt is received, since some poisoning addresses 
> may
> have been allocated by some processes, reserving these memories will fail.
> These memory will be tried to reserve again after killing the poisoned 
> process in
> the subsequent poisoning consumption interrupt handler.
> so amdgpu_vram_mgr_reserve_range needs to be called multiple times with the
> same address.
>
> >   From the second call to amdgpu_vram_mgr_reserve_range, the same
> > address will be added to the reservations_pending list again and is
> > never moved to the reserved_pages list because the address had been
> reserved.

[Tao] but if a page is added to reservations_pending list, it should also be 
put in data->bps array, and when we call amdgpu_ras_add_bad_pages again, 
amdgpu_ras_check_bad_page_unlock could ignore this page.
So except for amdgpu_ras_add_bad_pages, would you like to call 
amdgpu_vram_mgr_reserve_range in other place?

> >
> > [How]
> >   First add the address status check before calling
> > amdgpu_vram_mgr_do_reserve, if the address is already reserved, do
> > nothing; If the address is already in the reservations_pending list,
> > directly reserve memory; only add new nodes for the addresses that are
> > not in the reserved_pages list and reservations_pending list.
> >
> > Signed-off-by: YiPeng Chai 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28
> > +---
> >  1 file changed, 19 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > index 1e36c428d254..0bf3f4092900 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> > @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct
> > ttm_resource_manager *man)
> >
> >   dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",
> >   rsv->start, rsv->size);
> > -
> >   vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
> >   atomic64_add(vis_usage, &mgr->vis_usage);
> >   spin_lock(&man->bdev->lru_lock); @@ -340,19 +339,30 @@
> > int amdgpu_vram_mgr_reserve_range(struct
> > amdgpu_vram_mgr *mgr,
> > uint64_t start, uint64_t size)  {
> >   struct amdgpu_vram_reservation *rsv;
> > + int ret = 0;
> >
> > - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> > - if (!rsv)
> > - return -ENOMEM;
> > + ret = amdgpu_vram_mgr_query_page_status(mgr, start);
> > + if (!ret)
> > + return 0;
> > +
> > + if (ret == -ENOENT) {
> > + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> > + if (!rsv)
> > + return -ENOMEM;
> >
> > - INIT_LIST_HEAD(&rsv->allocated);
> > - INIT_LIST_HEAD(&rsv->blocks);
> > + INIT_LIST_HEAD(&rsv->allocated);
> > + INIT_LIST_HEAD(&rsv->blocks);
> >
> > - rsv->start = start;
> > - rsv->size = size;
> > + rsv->start = start;
> > + rsv->size = size;
> > +
> > + mutex_lock(&mgr->lock);
> > + list_add_tail(&rsv->blocks, &mgr->reservations_pending);
> > + mutex_unlock(&mgr->lock);
> > +
> > + }
> >
> >   mutex_lock(&mgr->lock);
> > - list_add_tail(&rsv->blocks, &mgr->reservations_pending);
> >   amdgpu_vram_mgr_do_reserve(&mgr->manager);
> >   mutex_unlock(&mgr->lock);
> >
> > --
> > 2.34.1
>
>



RE: [PATCH] drm/amdgpu: Fix incorrect return value

2024-04-03 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, April 3, 2024 3:07 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Wang, Yang(Kevin) ; Yang,
> Stanley ; Chai, Thomas 
> Subject: [PATCH] drm/amdgpu: Fix incorrect return value
>
> [Why]
>   After calling amdgpu_vram_mgr_reserve_range multiple times with the same
> address, calling amdgpu_vram_mgr_query_page_status will always return -
> EBUSY.

[Tao] could you explain why we call amdgpu_vram_mgr_reserve_range multiple 
times with the same
 address? IIRC, we skip duplicate address before reserve memory.

>   From the second call to amdgpu_vram_mgr_reserve_range, the same address
> will be added to the reservations_pending list again and is never moved to the
> reserved_pages list because the address had been reserved.
>
> [How]
>   First add the address status check before calling
> amdgpu_vram_mgr_do_reserve, if the address is already reserved, do nothing; If
> the address is already in the reservations_pending list, directly reserve 
> memory;
> only add new nodes for the addresses that are not in the reserved_pages list 
> and
> reservations_pending list.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 +---
>  1 file changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 1e36c428d254..0bf3f4092900 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct
> ttm_resource_manager *man)
>
>   dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",
>   rsv->start, rsv->size);
> -
>   vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
>   atomic64_add(vis_usage, &mgr->vis_usage);
>   spin_lock(&man->bdev->lru_lock);
> @@ -340,19 +339,30 @@ int amdgpu_vram_mgr_reserve_range(struct
> amdgpu_vram_mgr *mgr,
> uint64_t start, uint64_t size)
>  {
>   struct amdgpu_vram_reservation *rsv;
> + int ret = 0;
>
> - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> - if (!rsv)
> - return -ENOMEM;
> + ret = amdgpu_vram_mgr_query_page_status(mgr, start);
> + if (!ret)
> + return 0;
> +
> + if (ret == -ENOENT) {
> + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> + if (!rsv)
> + return -ENOMEM;
>
> - INIT_LIST_HEAD(&rsv->allocated);
> - INIT_LIST_HEAD(&rsv->blocks);
> + INIT_LIST_HEAD(&rsv->allocated);
> + INIT_LIST_HEAD(&rsv->blocks);
>
> - rsv->start = start;
> - rsv->size = size;
> + rsv->start = start;
> + rsv->size = size;
> +
> + mutex_lock(&mgr->lock);
> + list_add_tail(&rsv->blocks, &mgr->reservations_pending);
> + mutex_unlock(&mgr->lock);
> +
> + }
>
>   mutex_lock(&mgr->lock);
> - list_add_tail(&rsv->blocks, &mgr->reservations_pending);
>   amdgpu_vram_mgr_do_reserve(&mgr->manager);
>   mutex_unlock(&mgr->lock);
>
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: Update EEPROM RAS table for mismatched table version

2024-03-29 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: amd-gfx  On Behalf Of Candice Li
> Sent: Wednesday, March 27, 2024 2:16 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Li, Candice 
> Subject: [PATCH] drm/amdgpu: Update EEPROM RAS table for mismatched table
> version
>
> Update table version and restore bad page records to EEPROM RAS table for
> mismatched table version case. Otherwise force to reset the table.
>
> Signed-off-by: Candice Li 
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 88 ---
>  1 file changed, 78 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 06a62a8a992e9b..42d0ef2f512474 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1319,6 +1319,37 @@ static int __read_table_ras_info(struct
> amdgpu_ras_eeprom_control *control)
>   return res == RAS_TABLE_V2_1_INFO_SIZE ? 0 : res;  }
>
> +static bool amdgpu_ras_eeprom_table_version_validate(struct
> +amdgpu_ras_eeprom_control *control) {
> + struct amdgpu_device *adev = to_amdgpu_device(control);
> + struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
> +
> + switch (amdgpu_ip_version(adev, UMC_HWIP, 0)) {
> + case IP_VERSION(8, 10, 0):
> + case IP_VERSION(12, 0, 0):
> + return hdr->version == RAS_TABLE_VER_V2_1;
> + default:
> + return hdr->version == RAS_TABLE_VER_V1;
> + }
> +}
> +
> +static void amdgpu_ras_update_eeprom_control(struct
> +amdgpu_ras_eeprom_table_header *hdr) {
> + struct amdgpu_ras_eeprom_control *control =
> + container_of(hdr, struct amdgpu_ras_eeprom_control, tbl_hdr);
> +
> + if (hdr->version == RAS_TABLE_VER_V2_1) {
> + control->ras_num_recs = RAS_NUM_RECS_V2_1(hdr);
> + control->ras_record_offset = RAS_RECORD_START_V2_1;
> + control->ras_max_record_count =
> RAS_MAX_RECORD_COUNT_V2_1;
> + } else {
> + control->ras_num_recs = RAS_NUM_RECS(hdr);
> + control->ras_record_offset = RAS_RECORD_START;
> + control->ras_max_record_count = RAS_MAX_RECORD_COUNT;
> + }
> + control->ras_fri = RAS_OFFSET_TO_INDEX(control,
> +hdr->first_rec_offset); }
> +
>  int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>  bool *exceed_err_limit)
>  {
> @@ -1326,7 +1357,9 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
>   unsigned char buf[RAS_TABLE_HEADER_SIZE] = { 0 };
>   struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
>   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
> - int res;
> + int res, res1;
> + struct eeprom_table_record *bps;
> + u32 num_recs;
>
>   *exceed_err_limit = false;
>
> @@ -1355,16 +1388,51 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
>
>   __decode_table_header_from_buf(hdr, buf);
>
> - if (hdr->version == RAS_TABLE_VER_V2_1) {
> - control->ras_num_recs = RAS_NUM_RECS_V2_1(hdr);
> - control->ras_record_offset = RAS_RECORD_START_V2_1;
> - control->ras_max_record_count =
> RAS_MAX_RECORD_COUNT_V2_1;
> - } else {
> - control->ras_num_recs = RAS_NUM_RECS(hdr);
> - control->ras_record_offset = RAS_RECORD_START;
> - control->ras_max_record_count = RAS_MAX_RECORD_COUNT;
> + amdgpu_ras_update_eeprom_control(hdr);
> +
> + if (!amdgpu_ras_eeprom_table_version_validate(control)) {
> + num_recs = control->ras_num_recs;
> + if (num_recs && amdgpu_bad_page_threshold) {
> + /* Save bad page records existed in EEPROM */
> + bps = kcalloc(num_recs, sizeof(*bps), GFP_KERNEL);
> + if (!bps)
> + return -ENOMEM;
> +
> + res1 = amdgpu_ras_eeprom_read(control, bps,
> num_recs);
> + if (res1)
> + dev_warn(adev->dev, "Fail to load EEPROM
> table, force to reset
> +it.");
> +
> + res = amdgpu_ras_eeprom_reset_table(control);
> + if (res) {
> + dev_err(adev->dev, "Failed to create a new
> EEPROM table.");
> + kfree(bps);
> + return res < 0 ? res : 0;
> + }
> +
> + if (!res1) {
> + /* Update the EEPROM table with correct table
> version and
> +  * original bad page records
> +  */
> + amdgpu_ras_update_eeprom_control(hdr);
> + res = amdgpu_ras_eeprom_append(control, bps,
> num_recs);
> +
> + if (res) {
> +  

RE: [PATCH] drm/amdgpu: refine function signature of amdgpu_aca_get_error_data()

2024-03-27 Thread Zhou1, Tao
[AMD Official Use Only - General]

I think argument is more proper than signature here, with this fixed, the patch 
is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Yang
> Wang
> Sent: Thursday, March 28, 2024 1:53 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Wang, Yang(Kevin)
> 
> Subject: [PATCH] drm/amdgpu: refine function signature of
> amdgpu_aca_get_error_data()
>
> refine function signature of amdgpu_aca_get_error_data();
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 8 +++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 6 +-
>  2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> index 920ddbb35c3d..cb6a40a042e1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> @@ -525,10 +525,9 @@ static bool aca_handle_is_valid(struct aca_handle
> *handle)  }
>
>  int amdgpu_aca_get_error_data(struct amdgpu_device *adev, struct aca_handle
> *handle,
> -   enum aca_error_type type, void *data, void *qctx)
> +   enum aca_error_type type, struct ras_err_data
> *err_data,
> +   struct ras_query_context *qctx)
>  {
> - struct ras_err_data *err_data = (struct ras_err_data *)data;
> -
>   if (!handle || !err_data)
>   return -EINVAL;
>
> @@ -538,8 +537,7 @@ int amdgpu_aca_get_error_data(struct amdgpu_device
> *adev, struct aca_handle *han
>   if (!(BIT(type) & handle->mask))
>   return  0;
>
> - return __aca_get_error_data(adev, handle, type, err_data,
> - (struct ras_query_context *)qctx);
> + return __aca_get_error_data(adev, handle, type, err_data, qctx);
>  }
>
>  static void aca_error_init(struct aca_error *aerr, enum aca_error_type type) 
> diff
> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> index 247968d6a925..3765843ea648 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h
> @@ -26,6 +26,9 @@
>
>  #include 
>
> +struct ras_err_data;
> +struct ras_query_context;
> +
>  #define ACA_MAX_REGS_COUNT   (16)
>
>  #define ACA_REG_FIELD(x, h, l)   (((x) & GENMASK_ULL(h, 
> l)) >>
> l)
> @@ -198,7 +201,8 @@ int amdgpu_aca_add_handle(struct amdgpu_device
> *adev, struct aca_handle *handle,
> const char *name, const struct aca_info *aca_info, 
> void
> *data);  void amdgpu_aca_remove_handle(struct aca_handle *handle);  int
> amdgpu_aca_get_error_data(struct amdgpu_device *adev, struct aca_handle
> *handle,
> -   enum aca_error_type type, void *data, void *qctx);
> +   enum aca_error_type type, struct ras_err_data
> *err_data,
> +   struct ras_query_context *qctx);
>  int amdgpu_aca_smu_set_debug_mode(struct amdgpu_device *adev, bool en);
> void amdgpu_aca_smu_debugfs_init(struct amdgpu_device *adev, struct dentry
> *root);  int aca_error_cache_log_bank_error(struct aca_handle *handle, struct
> aca_bank_info *info,
> --
> 2.34.1



RE: [PATCH 3/3] drm/amdgpu: make reset method configurable for RAS poison

2024-03-17 Thread Zhou1, Tao
[AMD Official Use Only - General]

I can remove the support for SOC15_IH_CLIENTID_VMC from v10, but the reset type 
should be changed from bool to uint32 for all versions.

Regards,
Tao

> -Original Message-
> From: Zhang, Hawking 
> Sent: Sunday, March 17, 2024 6:10 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: RE: [PATCH 3/3] drm/amdgpu: make reset method configurable for RAS
> poison
>
> [AMD Official Use Only - General]
>
> Let's not copy kfd interrupt handler and the work queue implementation from v9
> to v10 since the firmware/hardware design are totally different.
>
> We shall have another patch to fix kfd int v10 for poison consumption handling
> and also v11.
>
> Regards,
> Hawking
>
> -Original Message-
> From: amd-gfx  On Behalf Of Tao Zhou
> Sent: Wednesday, March 13, 2024 17:12
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH 3/3] drm/amdgpu: make reset method configurable for RAS
> poison
>
> Each RAS block has different requirement for gpu reset in poison consumption
> handling.
> Add support for mmhub RAS poison consumption handling.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  4 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c   | 14 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h   |  4 ++--
>  .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c  | 20 ++-
>  .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 20 ++-
>  7 files changed, 42 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 9687650b0fe3..262d20167039 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -760,7 +760,7 @@ bool amdgpu_amdkfd_is_fed(struct amdgpu_device
> *adev)  }
>
>  void amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device
> *adev,
> -   enum amdgpu_ras_block block, bool reset)
> +   enum amdgpu_ras_block block, uint32_t reset)
>  {
> amdgpu_umc_poison_handler(adev, block, reset);  } diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 03bf20e0e3da..ad50c7bbc326 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -400,7 +400,7 @@ void amdgpu_amdkfd_debug_mem_fence(struct
> amdgpu_device *adev);  int amdgpu_amdkfd_get_tile_config(struct
> amdgpu_device *adev,
> struct tile_config *config);  void
> amdgpu_amdkfd_ras_poison_consumption_handler(struct amdgpu_device
> *adev,
> -   enum amdgpu_ras_block block, bool reset);
> +   enum amdgpu_ras_block block, uint32_t reset);
>  bool amdgpu_amdkfd_is_fed(struct amdgpu_device *adev);  bool
> amdgpu_amdkfd_bo_mapped_to_dev(struct amdgpu_device *adev, struct
> kgd_mem *mem);  void amdgpu_amdkfd_block_mmu_notifications(void *p); diff
> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e32a186c2de1..58fe7bebdf1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2045,7 +2045,7 @@ static void
> amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
> }
> }
>
> -   amdgpu_umc_poison_handler(adev, obj->head.block, false);
> +   amdgpu_umc_poison_handler(adev, obj->head.block, 0);
>
> if (block_obj->hw_ops && block_obj->hw_ops->handle_poison_consumption)
> poison_stat = 
> block_obj->hw_ops->handle_poison_consumption(adev);
> @@ -2698,7 +2698,7 @@ static int amdgpu_ras_page_retirement_thread(void
> *param)
> atomic_dec(&con->page_retirement_req_cnt);
>
> amdgpu_umc_bad_page_polling_timeout(adev,
> -   false, MAX_UMC_POISON_POLLING_TIME_ASYNC);
> +   0, MAX_UMC_POISON_POLLING_TIME_ASYNC);
> }
>
> return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 20436f81856a..2c02585dcbff 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -186,9 +186,7 @@ static int amdgpu_umc_do_page_retirement(struct
> amdgpu_device *adev,
> amdgpu_umc_handle_bad_pages(adev, ras_error_status);
>
>

RE: [PATCH] drm/amdgpu: add ras event id support for ACA

2024-03-17 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Monday, March 18, 2024 10:25 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Wang, Yang(Kevin) 
> Subject: [PATCH] drm/amdgpu: add ras event id support for ACA
>
> add ras event id support for ACA.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 29 ++---
> drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h |  2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 +-
>  3 files changed, 23 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> index 53ad76f590a1..ddcb68e60a73 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
> @@ -116,20 +116,22 @@ static struct aca_regs_dump {
>   {"CONTROL_MASK",ACA_REG_IDX_CTL_MASK},
>  };
>
> -static void aca_smu_bank_dump(struct amdgpu_device *adev, int idx, int total,
> struct aca_bank *bank)
> +static void aca_smu_bank_dump(struct amdgpu_device *adev, int idx, int total,
> struct aca_bank *bank,
> +   struct ras_query_context *qctx)
>  {
> + u64 event_id = qctx ? qctx->event_id: 0ULL;
>   int i;
>
> - dev_info(adev->dev, HW_ERR "Accelerator Check Architecture events
> logged\n");
> + RAS_EVENT_LOG(adev, event_id, HW_ERR "Accelerator Check
> Architecture
> +events logged\n");
>   /* plus 1 for output format, e.g: ACA[08/08]:  */
>   for (i = 0; i < ARRAY_SIZE(aca_regs); i++)
> - dev_info(adev->dev, HW_ERR
> "ACA[%02d/%02d].%s=0x%016llx\n",
> -  idx + 1, total, aca_regs[i].name, bank-
> >regs[aca_regs[i].reg_idx]);
> + RAS_EVENT_LOG(adev, event_id, HW_ERR
> "ACA[%02d/%02d].%s=0x%016llx\n",
> +   idx + 1, total, aca_regs[i].name,
> +bank->regs[aca_regs[i].reg_idx]);
>  }
>
>  static int aca_smu_get_valid_aca_banks(struct amdgpu_device *adev, enum
> aca_smu_type type,
>  int start, int count,
> -struct aca_banks *banks)
> +struct aca_banks *banks, struct
> ras_query_context *qctx)
>  {
>   struct amdgpu_aca *aca = &adev->aca;
>   const struct aca_smu_funcs *smu_funcs = aca->smu_funcs; @@ -165,7
> +167,7 @@ static int aca_smu_get_valid_aca_banks(struct amdgpu_device
> *adev, enum aca_smu_
>
>   bank.type = type;
>
> - aca_smu_bank_dump(adev, i, count, &bank);
> + aca_smu_bank_dump(adev, i, count, &bank, qctx);
>
>   ret = aca_banks_add_bank(banks, &bank);
>   if (ret)
> @@ -390,7 +392,7 @@ static bool aca_bank_should_update(struct
> amdgpu_device *adev, enum aca_smu_type  }
>
>  static int aca_banks_update(struct amdgpu_device *adev, enum aca_smu_type
> type,
> - bank_handler_t handler, void *data)
> + bank_handler_t handler, struct ras_query_context
> *qctx, void
> +*data)
>  {
>   struct amdgpu_aca *aca = &adev->aca;
>   struct aca_banks banks;
> @@ -412,7 +414,7 @@ static int aca_banks_update(struct amdgpu_device
> *adev, enum aca_smu_type type,
>
>   aca_banks_init(&banks);
>
> - ret = aca_smu_get_valid_aca_banks(adev, type, 0, count, &banks);
> + ret = aca_smu_get_valid_aca_banks(adev, type, 0, count, &banks, qctx);
>   if (ret)
>   goto err_release_banks;
>
> @@ -489,7 +491,7 @@ static int aca_log_aca_error(struct aca_handle *handle,
> enum aca_error_type type  }
>
>  static int __aca_get_error_data(struct amdgpu_device *adev, struct aca_handle
> *handle, enum aca_error_type type,
> - struct ras_err_data *err_data)
> + struct ras_err_data *err_data, struct
> ras_query_context *qctx)
>  {
>   enum aca_smu_type smu_type;
>   int ret;
> @@ -507,7 +509,7 @@ static int __aca_get_error_data(struct amdgpu_device
> *adev, struct aca_handle *h
>   }
>
>   /* udpate aca bank to aca source error_cache first */
> - ret = aca_banks_update(adev, smu_type, handler_aca_log_bank_error,
> NULL);
> + ret = aca_banks_update(adev, smu_type, handler_aca_log_bank_error,
> +qctx, NULL);
>   if (ret)
>   return ret;
>
> @@ -523,7 +525,7 @@ static bool aca_handle_is_valid(struct aca_handle
> *handle)  }
>
>  int amdgpu_ac

RE: [PATCH] drm/amdgpu: add ras event id support

2024-03-14 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Yang
> Wang
> Sent: Thursday, March 14, 2024 4:12 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Wang, Yang(Kevin) ; Zhang, Hawking
> 
> Subject: [PATCH] drm/amdgpu: add ras event id support
>
> add amdgpu ras event id support to better distinguish different error 
> information
> sources in dmesg logs.
>
> the following log will be identify by event id:
> {event_id} interrupt to inform RAS event {event_id} ACA logs {event_id} errors
> statistic since from current injection/error query {event_id} errors 
> statistic since
> from gpu load
>
> Signed-off-by: Yang Wang 
> Reviewed-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c  |  32 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h  |   3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  | 203 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h  |  30 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h |   1 +
>  drivers/gpu/drm/amd/amdgpu/umc_v12_0.c   |  10 +-
>  6 files changed, 191 insertions(+), 88 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> index 24ad4b97177b..0734490347db 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
> @@ -210,22 +210,26 @@ int amdgpu_mca_smu_set_debug_mode(struct
> amdgpu_device *adev, bool enable)
>   return -EOPNOTSUPP;
>  }
>
> -static void amdgpu_mca_smu_mca_bank_dump(struct amdgpu_device *adev,
> int idx, struct mca_bank_entry *entry)
> +static void amdgpu_mca_smu_mca_bank_dump(struct amdgpu_device *adev,
> int idx, struct mca_bank_entry *entry,
> +  struct ras_query_context *qctx)
>  {
> - dev_info(adev->dev, HW_ERR "Accelerator Check Architecture events
> logged\n");
> - dev_info(adev->dev, HW_ERR "aca entry[%02d].STATUS=0x%016llx\n",
> -  idx, entry->regs[MCA_REG_IDX_STATUS]);
> - dev_info(adev->dev, HW_ERR "aca entry[%02d].ADDR=0x%016llx\n",
> -  idx, entry->regs[MCA_REG_IDX_ADDR]);
> - dev_info(adev->dev, HW_ERR "aca entry[%02d].MISC0=0x%016llx\n",
> -  idx, entry->regs[MCA_REG_IDX_MISC0]);
> - dev_info(adev->dev, HW_ERR "aca entry[%02d].IPID=0x%016llx\n",
> -  idx, entry->regs[MCA_REG_IDX_IPID]);
> - dev_info(adev->dev, HW_ERR "aca entry[%02d].SYND=0x%016llx\n",
> -  idx, entry->regs[MCA_REG_IDX_SYND]);
> + u64 event_id = qctx->event_id;
> +
> + RAS_EVENT_LOG(adev, event_id, HW_ERR "Accelerator Check
> Architecture events logged\n");
> + RAS_EVENT_LOG(adev, event_id, HW_ERR "aca
> entry[%02d].STATUS=0x%016llx\n",
> +   idx, entry->regs[MCA_REG_IDX_STATUS]);
> + RAS_EVENT_LOG(adev, event_id, HW_ERR "aca
> entry[%02d].ADDR=0x%016llx\n",
> +   idx, entry->regs[MCA_REG_IDX_ADDR]);
> + RAS_EVENT_LOG(adev, event_id, HW_ERR "aca
> entry[%02d].MISC0=0x%016llx\n",
> +   idx, entry->regs[MCA_REG_IDX_MISC0]);
> + RAS_EVENT_LOG(adev, event_id, HW_ERR "aca
> entry[%02d].IPID=0x%016llx\n",
> +   idx, entry->regs[MCA_REG_IDX_IPID]);
> + RAS_EVENT_LOG(adev, event_id, HW_ERR "aca
> entry[%02d].SYND=0x%016llx\n",
> +   idx, entry->regs[MCA_REG_IDX_SYND]);
>  }
>
> -int amdgpu_mca_smu_log_ras_error(struct amdgpu_device *adev, enum
> amdgpu_ras_block blk, enum amdgpu_mca_error_type type, struct ras_err_data
> *err_data)
> +int amdgpu_mca_smu_log_ras_error(struct amdgpu_device *adev, enum
> amdgpu_ras_block blk, enum amdgpu_mca_error_type type,
> +  struct ras_err_data *err_data, struct
> ras_query_context *qctx)
>  {
>   struct amdgpu_smuio_mcm_config_info mcm_info;
>   struct ras_err_addr err_addr = {0};
> @@ -244,7 +248,7 @@ int amdgpu_mca_smu_log_ras_error(struct
> amdgpu_device *adev, enum amdgpu_ras_blo
>   list_for_each_entry(node, &mca_set.list, node) {
>   entry = &node->entry;
>
> - amdgpu_mca_smu_mca_bank_dump(adev, i++, entry);
> + amdgpu_mca_smu_mca_bank_dump(adev, i++, entry, qctx);
>
>   count = 0;
>   ret = amdgpu_mca_smu_parse_mca_error_count(adev, blk, type,
> entry, &count); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> index b964110ed1e0..e5bf07ce3451 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.h
> @@ -169,6 +169,7 @@ void amdgpu_mca_smu_debugfs_init(struct
> amdgpu_device *adev, struct dentry *root  void
> amdgpu_mca_bank_set_init(struct mca_bank_set *mca_set);  int
> amdgpu_mca_bank_set_add_entry(struct mca_bank_set *mca_set, struct
> mca_bank_entry *entry);  void amdgpu_mca_bank_set_release(struct
> mca_bank_set *mca_set); -int amdgpu_mca_smu_log_ras_error(struct
> amdgpu_device *adev, enum amdgpu_ras_block blk, enum
> a

RE: [PATCH 1/5] drm/amdgpu: add new bit definitions for GC 9.0 PROTECTION_FAULT_STATUS

2024-03-10 Thread Zhou1, Tao
[AMD Official Use Only - General]

Ping for the series...

> -Original Message-
> From: Zhou1, Tao 
> Sent: Friday, February 23, 2024 4:24 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH 1/5] drm/amdgpu: add new bit definitions for GC 9.0
> PROTECTION_FAULT_STATUS
>
> Add UCE and FED bit definitions.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/include/asic_reg/gc/gc_9_0_sh_mask.h | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_9_0_sh_mask.h
> b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_9_0_sh_mask.h
> index efc16ddf274a..2dfa0e5b1aa3 100644
> --- a/drivers/gpu/drm/amd/include/asic_reg/gc/gc_9_0_sh_mask.h
> +++ b/drivers/gpu/drm/amd/include/asic_reg/gc/gc_9_0_sh_mask.h
> @@ -6822,6 +6822,8 @@
>  #define VM_L2_PROTECTION_FAULT_STATUS__VMID__SHIFT
> 0x14
>  #define VM_L2_PROTECTION_FAULT_STATUS__VF__SHIFT
> 0x18
>  #define VM_L2_PROTECTION_FAULT_STATUS__VFID__SHIFT
> 0x19
> +#define VM_L2_PROTECTION_FAULT_STATUS__UCE__SHIFT
> 0x1d
> +#define VM_L2_PROTECTION_FAULT_STATUS__FED__SHIFT
> 0x1e
>  #define VM_L2_PROTECTION_FAULT_STATUS__MORE_FAULTS_MASK
> 0x0001L
>  #define VM_L2_PROTECTION_FAULT_STATUS__WALKER_ERROR_MASK
> 0x000EL
>  #define VM_L2_PROTECTION_FAULT_STATUS__PERMISSION_FAULTS_MASK
> 0x00F0L
> @@ -6832,6 +6834,8 @@
>  #define VM_L2_PROTECTION_FAULT_STATUS__VMID_MASK
> 0x00F0L
>  #define VM_L2_PROTECTION_FAULT_STATUS__VF_MASK
> 0x0100L
>  #define VM_L2_PROTECTION_FAULT_STATUS__VFID_MASK
> 0x1E00L
> +#define VM_L2_PROTECTION_FAULT_STATUS__UCE_MASK
> 0x2000L
> +#define VM_L2_PROTECTION_FAULT_STATUS__FED_MASK
> 0x4000L
>  //VM_L2_PROTECTION_FAULT_ADDR_LO32
>  #define
> VM_L2_PROTECTION_FAULT_ADDR_LO32__LOGICAL_PAGE_ADDR_LO32__SHIF
> T   0x0
>  #define
> VM_L2_PROTECTION_FAULT_ADDR_LO32__LOGICAL_PAGE_ADDR_LO32_MASK
> 0xL
> --
> 2.34.1



RE: [PATCH Review 1/1] drm/amdgpu: Fix ineffective ras_mask settings

2024-02-21 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Wednesday, February 21, 2024 10:27 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Fix ineffective ras_mask settings
>
> Check amdgpu_ras_mask to fix ineffective ras_mask setting due to special asic
> without sram ecc enable but with poison supported.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d991b3467c47..b85014e7f96b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3629,6 +3629,7 @@ int amdgpu_ras_is_supported(struct amdgpu_device
> *adev,
>block == AMDGPU_RAS_BLOCK__SDMA ||
>block == AMDGPU_RAS_BLOCK__VCN ||
>block == AMDGPU_RAS_BLOCK__JPEG) &&
> + (amdgpu_ras_mask & (1 << block)) &&
>   amdgpu_ras_is_poison_mode_supported(adev) &&
>   amdgpu_ras_get_ras_block(adev, block, 0))
>   ret = 1;
> --
> 2.25.1



RE: [PATCH 5/5] drm/amdgpu: skip GFX FED error in page fault handling

2024-02-19 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Lazar, Lijo 
> Sent: Monday, February 19, 2024 8:40 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 5/5] drm/amdgpu: skip GFX FED error in page fault handling
>
>
>
> On 2/19/2024 1:45 PM, Tao Zhou wrote:
> > Let kfd interrupt handler process it.
> >
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > index 773725a92cf1..70defc394b7b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > @@ -552,7 +552,7 @@ static int gmc_v9_0_process_interrupt(struct
> > amdgpu_device *adev,  {
> > bool retry_fault = !!(entry->src_data[1] & 0x80);
> > bool write_fault = !!(entry->src_data[1] & 0x20);
> > -   uint32_t status = 0, cid = 0, rw = 0;
> > +   uint32_t status = 0, cid = 0, rw = 0, fed = 0;
> > struct amdgpu_task_info task_info;
> > struct amdgpu_vmhub *hub;
> > const char *mmhub_cid;
> > @@ -663,6 +663,14 @@ static int gmc_v9_0_process_interrupt(struct
> amdgpu_device *adev,
> > status = RREG32(hub->vm_l2_pro_fault_status);
> > cid = REG_GET_FIELD(status, VM_L2_PROTECTION_FAULT_STATUS, CID);
> > rw = REG_GET_FIELD(status, VM_L2_PROTECTION_FAULT_STATUS, RW);
> > +   fed = REG_GET_FIELD(status, VM_L2_PROTECTION_FAULT_STATUS,
> FED);
> > +
> > +   /* for gfx fed error, kfd will handle it, return directly */
> > +   if (fed && amdgpu_ras_is_poison_mode_supported(adev) &&
> > +   amdgpu_ip_version(adev, GC_HWIP, 0) >= IP_VERSION(9, 4, 2) &&
> > +   !strcmp(hub_name, "gfxhub0"))
> > +   return 1;
>
> amdgpu_irq_dispatch() gives the impression that return value of 1 is treated 
> as
> handled, hence won't be passed to kfd. The commit description says it is 
> intended
> to pass to kfd for handling.

[Tao] good catch, it should return 0 here, will update it in v2, thanks.

>
> Also, FED status check may be moved up so that it's not misunderstood as a
> regular page fault with the extra prints coming to dmesg log.
> Otherwise, poison status also needs to be added to dmesg.

[Tao] there is poison consumption dmesg log in kfd interrupt handler, no neeed 
to add extra print here.
My intention is to skip " WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1)", moving 
up the check will make the change a little bit more and I think the page fault 
log is acceptable.

>
> Thanks,
> Lijo
>
> > +
> > WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);  #ifdef
> > HAVE_STRUCT_XARRAY
> > amdgpu_vm_update_fault_cache(adev, entry->pasid, addr, status,
> vmhub);


RE: [PATCH] drm/amdgpu: Do not enable/disable bif ras irq from guest

2024-02-17 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Zhang, Hawking 
> Sent: Sunday, February 18, 2024 3:31 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley ; Chai, Thomas 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Do not enable/disable bif ras irq from guest
>
> Only do this from host side.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/soc15.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 15033efec2ba..2c8702560090 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -1278,7 +1278,8 @@ static int soc15_common_hw_fini(void *handle)
>   if (amdgpu_sriov_vf(adev))
>   xgpu_ai_mailbox_put_irq(adev);
>
> - if (adev->nbio.ras_if &&
> + if ((!amdgpu_sriov_vf(adev)) &&
> + adev->nbio.ras_if &&
>   amdgpu_ras_is_supported(adev, adev->nbio.ras_if->block)) {
>   if (adev->nbio.ras &&
>   adev->nbio.ras->init_ras_controller_interrupt)
> --
> 2.17.1



RE: [PATCH] drm/amd/pm: Retrieve UMC ODECC error count from aca bank

2024-02-03 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Candice Li
> Sent: Friday, February 2, 2024 7:13 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Li, Candice 
> Subject: [PATCH] drm/amd/pm: Retrieve UMC ODECC error count from aca bank
>
> Instead of software managed counters.
>
> Signed-off-by: Candice Li 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index d6e14a5f406e63..03873d784be6d6 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -2552,8 +2552,12 @@ static int mca_umc_mca_get_err_count(const struct
> mca_ras_info *mca_ras, struct
>enum amdgpu_mca_error_type type, struct
> mca_bank_entry *entry, uint32_t *count)  {
>   uint64_t status0;
> + uint32_t ext_error_code;
> + uint32_t odecc_err_cnt;
>
>   status0 = entry->regs[MCA_REG_IDX_STATUS];
> + ext_error_code = MCA_REG__STATUS__ERRORCODEEXT(status0);
> + odecc_err_cnt =
> +MCA_REG__MISC0__ERRCNT(entry->regs[MCA_REG_IDX_MISC0]);
>
>   if (!REG_GET_FIELD(status0, MCMP1_STATUST0, Val)) {
>   *count = 0;
> @@ -2563,7 +2567,7 @@ static int mca_umc_mca_get_err_count(const struct
> mca_ras_info *mca_ras, struct
>   if (umc_v12_0_is_deferred_error(adev, status0) ||
>   umc_v12_0_is_uncorrectable_error(adev, status0) ||
>   umc_v12_0_is_correctable_error(adev, status0))
> - *count = 1;
> + *count = (ext_error_code == 0) ? odecc_err_cnt : 1;
>
>   return 0;
>  }
> --
> 2.25.1



RE: [PATCH] drm/amdgpu: skip call ras_late_init if ras block is not supported

2024-01-21 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Monday, January 22, 2024 1:29 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Wang, Yang(Kevin) 
> Subject: [PATCH] drm/amdgpu: skip call ras_late_init if ras block is not 
> supported
>
> skip call ras_late_init callback if ras block is not supported.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index f4fcb008d7ba..61ba7cd8345d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3346,12 +3346,15 @@ int amdgpu_ras_late_init(struct amdgpu_device
> *adev)
>   amdgpu_ras_set_mca_debug_mode(adev, false);
>
>   list_for_each_entry_safe(node, tmp, &adev->ras_list, node) {
> - if (!node->ras_obj) {
> + obj = node->ras_obj;
> + if (!obj) {
>   dev_warn(adev->dev, "Warning: abnormal ras list
> node.\n");
>   continue;
>   }
>
> - obj = node->ras_obj;
> + if (!amdgpu_ras_is_supported(adev, obj->ras_comm.block))
> + continue;
> +
>   if (obj->ras_late_init) {
>   r = obj->ras_late_init(adev, &obj->ras_comm);
>   if (r) {
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: skip call ras_late_init if ras feature is not enabled

2024-01-18 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Thursday, January 18, 2024 3:50 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Wang, Yang(Kevin) 
> Subject: [PATCH] drm/amdgpu: skip call ras_late_init if ras feature is not 
> enabled
>
> skip call ras_late_init callback if ras feature is not enabled.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 5c817c155d72..5c73d0871220 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -3312,6 +3312,9 @@ int amdgpu_ras_late_init(struct amdgpu_device
> *adev)
>   }
>
>   obj = node->ras_obj;
> + if (!amdgpu_ras_is_feature_enabled(adev, &obj->ras_comm))
> + continue;
> +
>   if (obj->ras_late_init) {
>   r = obj->ras_late_init(adev, &obj->ras_comm);
>   if (r) {
> --
> 2.34.1



RE: [PATCH 3/5] drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoning

2024-01-17 Thread Zhou1, Tao
[AMD Official Use Only - General]


  _
  From: Chai, Thomas 
  Sent: Thursday, January 18, 2024 11:06 AM
  To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org
  Cc: Zhou1, Tao ; Li, Candice ; 
Wang, Yang(Kevin) ; Yang, Stanley 
  Subject: RE: [PATCH 3/5] drm/amdgpu: Use asynchronous polling to handle 
umc_v12_0 poisoning


  [AMD Official Use Only - General]






  -
  Best Regards,
  Thomas


  _
  From: Zhang, Hawking mailto:hawking.zh...@amd.com>>
  Sent: Wednesday, January 17, 2024 7:54 PM
  To: Chai, Thomas mailto:yipeng.c...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
  Cc: Zhou1, Tao mailto:tao.zh...@amd.com>>; Li, Candice 
mailto:candice...@amd.com>>; Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>; Yang, Stanley 
mailto:stanley.y...@amd.com>>
  Subject: RE: [PATCH 3/5] drm/amdgpu: Use asynchronous polling to handle 
umc_v12_0 poisoning


  [AMD Official Use Only - General]



  Please check my comments inline

  Regards,
  Hawking

  -Original Message-
  From: Chai, Thomas mailto:yipeng.c...@amd.com>>
  Sent: Tuesday, January 16, 2024 16:21
  To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
  Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>; 
Zhang, Hawking mailto:hawking.zh...@amd.com>>; Zhou1, 
Tao mailto:tao.zh...@amd.com>>; Li, Candice 
mailto:candice...@amd.com>>; Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>; Yang, Stanley 
mailto:stanley.y...@amd.com>>; Chai, Thomas 
mailto:yipeng.c...@amd.com>>
  Subject: [PATCH 3/5] drm/amdgpu: Use asynchronous polling to handle 
umc_v12_0 poisoning

  Use asynchronous polling to handle umc_v12_0 poisoning.

  Signed-off-by: YiPeng Chai 
mailto:yipeng.c...@amd.com>>
  ---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |   5 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 143 +++-
   drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |   3 +
   3 files changed, 120 insertions(+), 31 deletions(-)

  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
  index 856206e95842..44929281840e 100644
  --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
  +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
  @@ -118,6 +118,8 @@ const char *get_ras_block_str(struct ras_common_if 
*ras_block)
   /* typical ECC bad page rate is 1 bad page per 100MB VRAM */
   #define RAS_BAD_PAGE_COVER  (100 * 1024 * 1024ULL)

  +#define MAX_UMC_POISON_POLLING_TIME_ASYNC  100  //ms
  +
   enum amdgpu_ras_retire_page_reservation {
AMDGPU_RAS_RETIRE_PAGE_RESERVED,
AMDGPU_RAS_RETIRE_PAGE_PENDING,
  @@ -2670,6 +2672,9 @@ static int amdgpu_ras_page_retirement_thread(void 
*param)
atomic_read(&con->page_retirement_req_cnt));

atomic_dec(&con->page_retirement_req_cnt);
  +
  + amdgpu_umc_poison_retire_page_polling_timeout(adev,
  + false, MAX_UMC_POISON_POLLING_TIME_ASYNC);
}

return 0;
  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
  index 9d1cf41cf483..2dde29cb807d 100644
  --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
  +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
  @@ -23,6 +23,7 @@

   #include "amdgpu.h"
   #include "umc_v6_7.h"
  +#define MAX_UMC_POISON_POLLING_TIME_SYNC   20  //ms

   static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
struct ras_err_data *err_data, uint64_t 
err_addr, @@ -85,17 +86,14 @@ int amdgpu_umc_page_retirement_mca(struct 
amdgpu_device *adev,
return ret;
   }

  -static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
  - void *ras_error_status,
  - struct amdgpu_iv_entry *entry,
  - bool reset)
  +static void amdgpu_umc_handle_bad_pages(struct amdgpu_device *adev,
  + void *ras_error_status)
   {
struct ras_err_data *err_data = (struct ras_err_data *)ras_error_status;
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
int ret = 0;
unsigned long err_count;
  -
  - kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
  + mutex_lock(&con->page_retirement_lock);
ret = amdgpu_dpm_get_ecc_info(adev, (void *)&(con->umc_ecc));
if (ret == -EOPNOTSUPP) {
if (adev->umc.ras && adev->umc.ras->ras_block.hw_ops && @@ 
-163,19 +161,86 @@ static int amdgpu_umc_do_page_retir

RE: [PATCH 2/2] update check condition of query for ras page retire

2024-01-17 Thread Zhou1, Tao
[AMD Official Use Only - General]

Sure, will revert related patch in the next version.

Regards,
Tao

> -Original Message-
> From: Zhang, Hawking 
> Sent: Wednesday, January 17, 2024 8:09 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: RE: [PATCH 2/2] update check condition of query for ras page retire
>
> [AMD Official Use Only - General]
>
> static ssize_t smu_v13_0_6_get_ecc_info(struct smu_context *smu,
> void *table)
>  {
> -   /* Support ecc info by default */
> -   return 0;
> +   /* we use debug mode flag instead of this interface */
> +   return -EOPNOTSUPP;
>  }
>
> Shall we just drop the callback implementation? smu_get_ecc_info will return -
> EOPNOTSUPP if the callback is not supported.
>
> Regards,
> Hawking
>
> -Original Message-
> From: amd-gfx  On Behalf Of Tao Zhou
> Sent: Wednesday, January 17, 2024 17:15
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao 
> Subject: [PATCH 2/2] update check condition of query for ras page retire
>
> Support page retirement handling in debug mode.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c  | 9 +++--
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 4 ++--
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 41139bac7643..6df32f0afd89 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -90,12 +90,16 @@ static void amdgpu_umc_handle_bad_pages(struct
> amdgpu_device *adev,  {
> struct ras_err_data *err_data = (struct ras_err_data 
> *)ras_error_status;
> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +   unsigned int error_query_mode;
> int ret = 0;
>
> +   amdgpu_ras_get_error_query_mode(adev, &error_query_mode);
> +
> mutex_lock(&con->page_retirement_lock);
>
> ret = amdgpu_dpm_get_ecc_info(adev, (void *)&(con->umc_ecc));
> -   if (ret == -EOPNOTSUPP) {
> +   if (ret == -EOPNOTSUPP &&
> +   error_query_mode == AMDGPU_RAS_DIRECT_ERROR_QUERY) {
> if (adev->umc.ras && adev->umc.ras->ras_block.hw_ops &&
> adev->umc.ras->ras_block.hw_ops->query_ras_error_count)
> 
> adev->umc.ras->ras_block.hw_ops->query_ras_error_count(adev,
> ras_error_status); @@ -119,7 +123,8 @@ static void
> amdgpu_umc_handle_bad_pages(struct amdgpu_device *adev,
>  */
> adev->umc.ras->ras_block.hw_ops-
> >query_ras_error_address(adev, ras_error_status);
> }
> -   } else if (!ret) {
> +   } else if (error_query_mode == AMDGPU_RAS_FIRMWARE_ERROR_QUERY
> ||
> +   (!ret && error_query_mode == AMDGPU_RAS_DIRECT_ERROR_QUERY)) {
> if (adev->umc.ras &&
> adev->umc.ras->ecc_info_query_ras_error_count)
> adev->umc.ras->ecc_info_query_ras_error_count(adev,
> ras_error_status); diff --git
> a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index c560f4af214d..d86c9e7fc64b 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -2909,8 +2909,8 @@ static int
> smu_v13_0_6_select_xgmi_plpd_policy(struct smu_context *smu,  static ssize_t
> smu_v13_0_6_get_ecc_info(struct smu_context *smu,
> void *table)
>  {
> -   /* Support ecc info by default */
> -   return 0;
> +   /* we use debug mode flag instead of this interface */
> +   return -EOPNOTSUPP;
>  }
>
>  static const struct pptable_funcs smu_v13_0_6_ppt_funcs = {
> --
> 2.35.1
>



RE: [PATCH] drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[]

2024-01-16 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Yang
> Wang
> Sent: Tuesday, January 16, 2024 7:02 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Wang, Yang(Kevin) ; Zhang, Hawking
> 
> Subject: [PATCH] drm/amdgpu: fix UBSAN array-index-out-of-bounds for
> ras_block_string[]
>
> fix array index out of bounds issue for ras_block_string[] array.
>
> Fixes: 2e3675fe4e3ee ("drm/amdgpu: Align ras block enum with firmware")
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index ff6f84714f68..8004863719d0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -74,6 +74,8 @@ const char *ras_block_string[] = {
>   "mca",
>   "vcn",
>   "jpeg",
> + "ih",
> + "mpio",
>  };
>
>  const char *ras_mca_block_string[] = {
> @@ -95,7 +97,8 @@ const char *get_ras_block_str(struct ras_common_if
> *ras_block)
>   if (!ras_block)
>   return "NULL";
>
> - if (ras_block->block >= AMDGPU_RAS_BLOCK_COUNT)
> + if (ras_block->block >= AMDGPU_RAS_BLOCK_COUNT ||
> + ras_block->block >= ARRAY_SIZE(ras_block_string))
>   return "OUT OF RANGE";
>
>   if (ras_block->block == AMDGPU_RAS_BLOCK__MCA)
> --
> 2.34.1



RE: [PATCH] drm/amdgpu: Drop unnecessary sentences about CE and deferred error.

2024-01-03 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Candice Li
> Sent: Thursday, January 4, 2024 1:25 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Li, Candice 
> Subject: [PATCH] drm/amdgpu: Drop unnecessary sentences about CE and
> deferred error.
>
> Remove "no user action is needed" for correctable and deferred error to avoid
> confusion.
>
> Signed-off-by: Candice Li 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +-
> drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c  |  3 +--
> drivers/gpu/drm/amd/amdgpu/nbio_v7_9.c  |  3 +--
>  drivers/gpu/drm/amd/amdgpu/umc_v6_7.c   |  2 +-
>  4 files changed, 8 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index b21eadd7c975df..caf00df669bf7e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1069,8 +1069,7 @@ static void amdgpu_ras_error_print_error_data(struct
> amdgpu_device *adev,
>   mcm_info = &err_info->mcm_info;
>   if (err_info->ce_count) {
>   dev_info(adev->dev, "socket: %d, die: %d, "
> -  "%lld new correctable hardware errors
> detected in %s block, "
> -  "no user action is needed\n",
> +  "%lld new correctable hardware errors
> detected in %s block\n",
>mcm_info->socket_id,
>mcm_info->die_id,
>err_info->ce_count,
> @@ -1082,8 +1081,7 @@ static void amdgpu_ras_error_print_error_data(struct
> amdgpu_device *adev,
>   err_info = &err_node->err_info;
>   mcm_info = &err_info->mcm_info;
>   dev_info(adev->dev, "socket: %d, die: %d, "
> -  "%lld correctable hardware errors detected in
> total in %s block, "
> -  "no user action is needed\n",
> +  "%lld correctable hardware errors detected in
> total in %s
> +block\n",
>mcm_info->socket_id, mcm_info->die_id,
> err_info->ce_count, blk_name);
>   }
>   break;
> @@ -1139,16 +1137,14 @@ static void
> amdgpu_ras_error_generate_report(struct amdgpu_device *adev,
>  adev->smuio.funcs->get_die_id) {
>   dev_info(adev->dev, "socket: %d, die: %d "
>"%ld correctable hardware errors "
> -  "detected in %s block, no user "
> -  "action is needed.\n",
> +  "detected in %s block\n",
>adev->smuio.funcs->get_socket_id(adev),
>adev->smuio.funcs->get_die_id(adev),
>ras_mgr->err_data.ce_count,
>blk_name);
>   } else {
>   dev_info(adev->dev, "%ld correctable hardware errors "
> -  "detected in %s block, no user "
> -  "action is needed.\n",
> +  "detected in %s block\n",
>ras_mgr->err_data.ce_count,
>blk_name);
>   }
> @@ -1978,7 +1974,7 @@ static void
> amdgpu_ras_interrupt_poison_creation_handler(struct ras_manager *obj
>   struct amdgpu_iv_entry *entry)
>  {
>   dev_info(obj->adev->dev,
> - "Poison is created, no user action is needed.\n");
> + "Poison is created\n");
>  }
>
>  static void amdgpu_ras_interrupt_umc_handler(struct ras_manager *obj, diff --
> git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index 6d24c84924cb5d..19986ff6a48d7e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -401,8 +401,7 @@ static void
> nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
>
>   if (err_data.ce_count)
>   dev_info(adev->dev, "%ld correctable hardware
> "
> - "errors detected in %s block, "
> - "no user action is needed.\n",
> + "errors detected in %s
> block\n",
>   obj->err_data.ce_count,
>   get_ras_block_str(adev-
> >nbio.ras_if));
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_9.c
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_9.c
> index 25a3da83e0fb97..e90f3378080345 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_9.c

RE: [PATCH] drm/amdgpu: Support poison error injection via ras_ctrl debugfs

2024-01-03 Thread Zhou1, Tao
[AMD Official Use Only - General]

Please also update the description of error type in "DOC: AMDGPU RAS debugfs 
control interface" for ras_debugfs_ctrl_write.

With that fixed, the patch is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Candice Li
> Sent: Thursday, January 4, 2024 1:16 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Li, Candice 
> Subject: [PATCH] drm/amdgpu: Support poison error injection via ras_ctrl 
> debugfs
>
> Support poison error injection.
>
> Signed-off-by: Candice Li 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index caf00df669bf7e..5851c7a80a5a8c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -305,11 +305,13 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct
> file *f,
>   return -EINVAL;
>
>   data->head.block = block_id;
> - /* only ue and ce errors are supported */
> + /* only ue, ce and poison errors are supported */
>   if (!memcmp("ue", err, 2))
>   data->head.type =
> AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
>   else if (!memcmp("ce", err, 2))
>   data->head.type =
> AMDGPU_RAS_ERROR__SINGLE_CORRECTABLE;
> + else if (!memcmp("poison", err, 6))
> + data->head.type = AMDGPU_RAS_ERROR__POISON;
>   else
>   return -EINVAL;
>
> --
> 2.25.1



RE: [PATCH 05/14] drm/amdgpu: add amdgpu ras aca query interface

2024-01-03 Thread Zhou1, Tao
[AMD Official Use Only - General]

We check debug mode to decide which path is selected currently. The legacy path 
is still needed even ACA framework is supported (if debug mode is enabled).

"it should help us to differentiate aca from legacy ras when implementing other 
features", is there a scenario where the aca flag is a must?

Regards,
Tao

> -Original Message-
> From: Zhang, Hawking 
> Sent: Wednesday, January 3, 2024 8:00 PM
> To: Wang, Yang(Kevin) ; amd-
> g...@lists.freedesktop.org
> Cc: Zhou1, Tao ; Chai, Thomas 
> Subject: RE: [PATCH 05/14] drm/amdgpu: add amdgpu ras aca query interface
>
> [AMD Official Use Only - General]
>
> I assume we are leveraging error_query_mode to differentiate aca path from
> legacy ras path, right?
>
> But given in-band error reporting is just the start of transition from legacy 
> ras to
> aca, do we need a flag in amdgpu_aca to indicate whether aca is supported or
> not? Accordingly, we can initialize the flag in amdgpu_ras_check_supported. it
> should help us to differentiate aca from legacy ras when implementing other
> features, thoughts?
>
> Regards,
> Hawking
>
> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Wednesday, January 3, 2024 16:02
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Chai, Thomas ; Wang,
> Yang(Kevin) 
> Subject: [PATCH 05/14] drm/amdgpu: add amdgpu ras aca query interface
>
> use new ACA error query interface to instead of legacy MCA query.
>
> Signed-off-by: Yang Wang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 88 -
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 12 +++-
>  2 files changed, 79 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 038bd1b17cef..bbae41f86e00 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1168,6 +1168,53 @@ static void
> amdgpu_rasmgr_error_data_statistic_update(struct ras_manager *obj, s
> }
>  }
>
> +static struct ras_manager *get_ras_manager(struct amdgpu_device *adev,
> +enum amdgpu_ras_block blk) {
> +   struct ras_common_if head;
> +
> +   memset(&head, 0, sizeof(head));
> +   head.block = blk;
> +
> +   return amdgpu_ras_find_obj(adev, &head); }
> +
> +int amdgpu_ras_bind_aca(struct amdgpu_device *adev, enum
> amdgpu_ras_block blk,
> +   const struct aca_info *aca_info, void *data) {
> +   struct ras_manager *obj;
> +
> +   obj = get_ras_manager(adev, blk);
> +   if (!obj)
> +   return -EINVAL;
> +
> +   return amdgpu_aca_add_handle(adev, &obj->aca_handle,
> +ras_block_str(blk), aca_info, data); }
> +
> +int amdgpu_ras_unbind_aca(struct amdgpu_device *adev, enum
> +amdgpu_ras_block blk) {
> +   struct ras_manager *obj;
> +
> +   obj = get_ras_manager(adev, blk);
> +   if (!obj)
> +   return -EINVAL;
> +
> +   amdgpu_aca_remove_handle(&obj->aca_handle);
> +
> +   return 0;
> +}
> +
> +static int amdgpu_aca_log_ras_error_data(struct amdgpu_device *adev, enum
> amdgpu_ras_block blk,
> +enum aca_error_type type, struct 
> ras_err_data *err_data)
> {
> +   struct ras_manager *obj;
> +
> +   obj = get_ras_manager(adev, blk);
> +   if (!obj)
> +   return -EINVAL;
> +
> +   return amdgpu_aca_get_error_data(adev, &obj->aca_handle, type,
> +err_data); }
> +
>  static int amdgpu_ras_query_error_status_helper(struct amdgpu_device *adev,
> struct ras_query_if *info,
> struct ras_err_data 
> *err_data, @@ -1175,6 +1222,7
> @@ static int amdgpu_ras_query_error_status_helper(struct amdgpu_device
> *adev,  {
> enum amdgpu_ras_block blk = info ? info->head.block :
> AMDGPU_RAS_BLOCK_COUNT;
> struct amdgpu_ras_block_object *block_obj = NULL;
> +   int ret;
>
> if (blk == AMDGPU_RAS_BLOCK_COUNT)
> return -EINVAL;
> @@ -1204,9 +1252,13 @@ static int
> amdgpu_ras_query_error_status_helper(struct amdgpu_device *adev,
> }
> }
> } else {
> -   /* FIXME: add code to check return value later */
> -   amdgpu_mca_smu_log_ras_error(adev, blk,
> AMDGPU_MCA_ERROR_TYPE_UE, err_data);
> -   amdgpu_mca_smu_log_ras_error(adev, blk,
> AMDGPU_MCA_ERROR_TYPE_CE, err_data);
> +   ret = amdgpu_aca_log_

RE: [PATCH 3/3] drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported

2024-01-02 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Tuesday, January 2, 2024 10:16 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley ; Wang, Yang(Kevin)
> ; Chai, Thomas ; Li,
> Candice 
> Cc: Zhang, Hawking ; Deucher, Alexander
> ; Lazar, Lijo ; Ma, Le
> 
> Subject: [PATCH 3/3] drm/amdgpu: Centralize ras cap query to
> amdgpu_ras_check_supported
>
> Move ras capablity check to amdgpu_ras_check_supported.
> Driver will query ras capablity through psp interace, or vbios interface, or 
> specific
> ip callbacks.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 170 +---
>  1 file changed, 93 insertions(+), 77 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index a901b00d4949..2ee82baaf7d6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -39,6 +39,7 @@
>  #include "nbio_v7_9.h"
>  #include "atom.h"
>  #include "amdgpu_reset.h"
> +#include "amdgpu_psp.h"
>
>  #ifdef CONFIG_X86_MCE_AMD
>  #include 
> @@ -2680,6 +2681,87 @@ static void amdgpu_ras_get_quirks(struct
> amdgpu_device *adev)
>   adev->ras_hw_enabled |= (1 << AMDGPU_RAS_BLOCK__GFX);  }
>
> +/* Query ras capablity via atomfirmware interface */ static void
> +amdgpu_ras_query_ras_capablity_from_vbios(struct amdgpu_device *adev) {
> + /* mem_ecc cap */
> + if (amdgpu_atomfirmware_mem_ecc_supported(adev)) {
> + dev_info(adev->dev, "MEM ECC is active.\n");
> + adev->ras_hw_enabled |= (1 << AMDGPU_RAS_BLOCK__UMC |
> +  1 << AMDGPU_RAS_BLOCK__DF);
> + } else {
> + dev_info(adev->dev, "MEM ECC is not presented.\n");
> + }
> +
> + /* sram_ecc cap */
> + if (amdgpu_atomfirmware_sram_ecc_supported(adev)) {
> + dev_info(adev->dev, "SRAM ECC is active.\n");
> + if (!amdgpu_sriov_vf(adev))
> + adev->ras_hw_enabled |= ~(1 <<
> AMDGPU_RAS_BLOCK__UMC |
> +   1 <<
> AMDGPU_RAS_BLOCK__DF);
> + else
> + adev->ras_hw_enabled |= (1 <<
> AMDGPU_RAS_BLOCK__PCIE_BIF |
> +  1 <<
> AMDGPU_RAS_BLOCK__SDMA |
> +  1 <<
> AMDGPU_RAS_BLOCK__GFX);
> +
> + /*
> +  * VCN/JPEG RAS can be supported on both bare metal and
> +  * SRIOV environment
> +  */
> + if (amdgpu_ip_version(adev, VCN_HWIP, 0) == IP_VERSION(2, 6,
> 0) ||
> + amdgpu_ip_version(adev, VCN_HWIP, 0) == IP_VERSION(4, 0,
> 0) ||
> + amdgpu_ip_version(adev, VCN_HWIP, 0) == IP_VERSION(4, 0,
> 3))
> + adev->ras_hw_enabled |= (1 <<
> AMDGPU_RAS_BLOCK__VCN |
> +  1 <<
> AMDGPU_RAS_BLOCK__JPEG);
> + else
> + adev->ras_hw_enabled &= ~(1 <<
> AMDGPU_RAS_BLOCK__VCN |
> +   1 <<
> AMDGPU_RAS_BLOCK__JPEG);
> +
> + /*
> +  * XGMI RAS is not supported if xgmi num physical nodes
> +  * is zero
> +  */
> + if (!adev->gmc.xgmi.num_physical_nodes)
> + adev->ras_hw_enabled &= ~(1 <<
> AMDGPU_RAS_BLOCK__XGMI_WAFL);
> + } else {
> + dev_info(adev->dev, "SRAM ECC is not presented.\n");
> + }
> +}
> +
> +/* Query poison mode from umc/df IP callbacks */ static void
> +amdgpu_ras_query_poison_mode(struct amdgpu_device *adev) {
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + bool df_poison, umc_poison;
> +
> + /* poison setting is useless on SRIOV guest */
> + if (amdgpu_sriov_vf(adev) || !con)
> + return;
> +
> + /* Init poison supported flag, the default value is false */
> + if (adev->gmc.xgmi.connected_to_cpu ||
> + adev->gmc.is_app_apu) {
> + /* enabled by default when GPU is connected to CPU */
> + con->poison_supported = true;
> + } else if (adev->df.funcs &&
> + adev->df.funcs->query_ras_poison_mode &&
> + adev->umc.ras &&
> +  

RE: [PATCH 3/3] drm/amdgpu: Replace DRM_* with dev_* in amdgpu_psp.c

2024-01-02 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Hawking
> Zhang
> Sent: Tuesday, January 2, 2024 11:45 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley ; Wang, Yang(Kevin)
> ; Chai, Thomas ; Li,
> Candice 
> Cc: Deucher, Alexander ; Ma, Le
> ; Lazar, Lijo ; Zhang, Hawking
> 
> Subject: [PATCH 3/3] drm/amdgpu: Replace DRM_* with dev_* in amdgpu_psp.c
>
> So kernel message has the device pcie bdf information, which helps issue
> debugging especially in multiple GPU system.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 144 
>  1 file changed, 75 insertions(+), 69 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 8a3847d3041f..0d871479ff34 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -291,21 +291,22 @@ static int psp_memory_training_init(struct psp_context
> *psp)
>   struct psp_memory_training_context *ctx = &psp->mem_train_ctx;
>
>   if (ctx->init != PSP_MEM_TRAIN_RESERVE_SUCCESS) {
> - DRM_DEBUG("memory training is not supported!\n");
> + dev_dbg(psp->adev->dev, "memory training is not
> supported!\n");
>   return 0;
>   }
>
>   ctx->sys_cache = kzalloc(ctx->train_data_size, GFP_KERNEL);
>   if (ctx->sys_cache == NULL) {
> - DRM_ERROR("alloc mem_train_ctx.sys_cache failed!\n");
> + dev_err(psp->adev->dev, "alloc mem_train_ctx.sys_cache
> failed!\n");
>   ret = -ENOMEM;
>   goto Err_out;
>   }
>
> -
>   DRM_DEBUG("train_data_size:%llx,p2c_train_data_offset:%llx,c2p_train
> _data_offset:%llx.\n",
> -   ctx->train_data_size,
> -   ctx->p2c_train_data_offset,
> -   ctx->c2p_train_data_offset);
> + dev_dbg(psp->adev->dev,
> +
>   "train_data_size:%llx,p2c_train_data_offset:%llx,c2p_train_data_offset:
> %llx.\n",
> + ctx->train_data_size,
> + ctx->p2c_train_data_offset,
> + ctx->c2p_train_data_offset);
>   ctx->init = PSP_MEM_TRAIN_INIT_SUCCESS;
>   return 0;
>
> @@ -407,7 +408,7 @@ static int psp_sw_init(void *handle)
>
>   psp->cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
>   if (!psp->cmd) {
> - DRM_ERROR("Failed to allocate memory to command
> buffer!\n");
> + dev_err(adev->dev, "Failed to allocate memory to command
> buffer!\n");
>   ret = -ENOMEM;
>   }
>
> @@ -454,13 +455,13 @@ static int psp_sw_init(void *handle)
>   if (mem_training_ctx->enable_mem_training) {
>   ret = psp_memory_training_init(psp);
>   if (ret) {
> - DRM_ERROR("Failed to initialize memory training!\n");
> + dev_err(adev->dev, "Failed to initialize memory
> training!\n");
>   return ret;
>   }
>
>   ret = psp_mem_training(psp, PSP_MEM_TRAIN_COLD_BOOT);
>   if (ret) {
> - DRM_ERROR("Failed to process memory training!\n");
> + dev_err(adev->dev, "Failed to process memory
> training!\n");
>   return ret;
>   }
>   }
> @@ -675,9 +676,11 @@ psp_cmd_submit_buf(struct psp_context *psp,
>*/
>   if (!skip_unsupport && (psp->cmd_buf_mem->resp.status || !timeout)
> && !ras_intr) {
>   if (ucode)
> - DRM_WARN("failed to load ucode %s(0x%X) ",
> -   amdgpu_ucode_name(ucode->ucode_id),
> ucode->ucode_id);
> - DRM_WARN("psp gfx command %s(0x%X) failed and response
> status is (0x%X)\n",
> + dev_warn(psp->adev->dev,
> +  "failed to load ucode %s(0x%X) ",
> +  amdgpu_ucode_name(ucode->ucode_id),
> ucode->ucode_id);
> + dev_warn(psp->adev->dev,
> +  "psp gfx command %s(0x%X) failed and response status
> is (0x%X)\n",
>psp_gfx_cmd_name(psp->cmd_buf_mem->cmd_id),
> psp->cmd_buf_mem->cmd_id,
>psp->cmd_buf_mem->resp.status);
>   /* 

RE: [PATCH 3/3] drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported

2024-01-02 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Hawking
> Zhang
> Sent: Tuesday, January 2, 2024 11:45 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley ; Wang, Yang(Kevin)
> ; Chai, Thomas ; Li,
> Candice 
> Cc: Deucher, Alexander ; Ma, Le
> ; Lazar, Lijo ; Zhang, Hawking
> 
> Subject: [PATCH 3/3] drm/amdgpu: Centralize ras cap query to
> amdgpu_ras_check_supported
>
> Move ras capablity check to amdgpu_ras_check_supported.
> Driver will query ras capablity through psp interace, or vbios interface, or 
> specific
> ip callbacks.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 170 +---
>  1 file changed, 93 insertions(+), 77 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 5f302b7693b3..72b6e41329b0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -39,6 +39,7 @@
>  #include "nbio_v7_9.h"
>  #include "atom.h"
>  #include "amdgpu_reset.h"
> +#include "amdgpu_psp.h"
>
>  #ifdef CONFIG_X86_MCE_AMD
>  #include 
> @@ -2680,6 +2681,87 @@ static void amdgpu_ras_get_quirks(struct
> amdgpu_device *adev)
>   adev->ras_hw_enabled |= (1 << AMDGPU_RAS_BLOCK__GFX);  }
>
> +/* Query ras capablity via atomfirmware interface */ static void
> +amdgpu_ras_query_ras_capablity_from_vbios(struct amdgpu_device *adev) {
> + /* mem_ecc cap */
> + if (amdgpu_atomfirmware_mem_ecc_supported(adev)) {
> + dev_info(adev->dev, "MEM ECC is active.\n");
> + adev->ras_hw_enabled |= (1 << AMDGPU_RAS_BLOCK__UMC |
> +  1 << AMDGPU_RAS_BLOCK__DF);
> + } else {
> + dev_info(adev->dev, "MEM ECC is not presented.\n");
> + }
> +
> + /* sram_ecc cap */
> + if (amdgpu_atomfirmware_sram_ecc_supported(adev)) {
> + dev_info(adev->dev, "SRAM ECC is active.\n");
> + if (!amdgpu_sriov_vf(adev))
> + adev->ras_hw_enabled |= ~(1 <<
> AMDGPU_RAS_BLOCK__UMC |
> +   1 <<
> AMDGPU_RAS_BLOCK__DF);
> + else
> + adev->ras_hw_enabled |= (1 <<
> AMDGPU_RAS_BLOCK__PCIE_BIF |
> +  1 <<
> AMDGPU_RAS_BLOCK__SDMA |
> +  1 <<
> AMDGPU_RAS_BLOCK__GFX);
> +
> + /*
> +  * VCN/JPEG RAS can be supported on both bare metal and
> +  * SRIOV environment
> +  */
> + if (amdgpu_ip_version(adev, VCN_HWIP, 0) == IP_VERSION(2, 6,
> 0) ||
> + amdgpu_ip_version(adev, VCN_HWIP, 0) == IP_VERSION(4, 0,
> 0) ||
> + amdgpu_ip_version(adev, VCN_HWIP, 0) == IP_VERSION(4, 0,
> 3))
> + adev->ras_hw_enabled |= (1 <<
> AMDGPU_RAS_BLOCK__VCN |
> +  1 <<
> AMDGPU_RAS_BLOCK__JPEG);
> + else
> + adev->ras_hw_enabled &= ~(1 <<
> AMDGPU_RAS_BLOCK__VCN |
> +   1 <<
> AMDGPU_RAS_BLOCK__JPEG);
> +
> + /*
> +  * XGMI RAS is not supported if xgmi num physical nodes
> +  * is zero
> +  */
> + if (!adev->gmc.xgmi.num_physical_nodes)
> + adev->ras_hw_enabled &= ~(1 <<
> AMDGPU_RAS_BLOCK__XGMI_WAFL);
> + } else {
> + dev_info(adev->dev, "SRAM ECC is not presented.\n");
> + }
> +}
> +
> +/* Query poison mode from umc/df IP callbacks */ static void
> +amdgpu_ras_query_poison_mode(struct amdgpu_device *adev) {
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + bool df_poison, umc_poison;
> +
> + /* poison setting is useless on SRIOV guest */
> + if (amdgpu_sriov_vf(adev) || !con)
> + return;
> +
> + /* Init poison supported flag, the default value is false */
> + if (adev->gmc.xgmi.connected_to_cpu ||
> + adev->gmc.is_app_apu) {
> + /* enabled by default when GPU is connected to CPU */
> + con->poison_supported = true;
> + } else if (adev->df.funcs &&
> + adev->df.funcs->query_ras_poison_mode &&
> + adev->umc.ras &

RE: [PATCH 2/3] drm/amdgpu: Query ras capablity from psp

2024-01-02 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Zhang, Hawking 
> Sent: Tuesday, January 2, 2024 1:38 PM
> To: Wang, Yang(Kevin) ; amd-
> g...@lists.freedesktop.org; Zhou1, Tao ; Yang, Stanley
> ; Chai, Thomas ; Li, Candice
> 
> Cc: Deucher, Alexander ; Lazar, Lijo
> ; Ma, Le 
> Subject: RE: [PATCH 2/3] drm/amdgpu: Query ras capablity from psp
>
> [AMD Official Use Only - General]
>
> The ret gives us a chance to fallback to legacy query approach (from vbios).
>
>
> You might want to see patch #3 of the series for more details, go to the 
> following
> lines in patch #3
>
> +   /* query ras capability from psp */
> +   if (amdgpu_psp_get_ras_capability(&adev->psp))
> +   goto init_ras_enabled_flag;
>
>
> Regards,
> Hawking
>
> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Tuesday, January 2, 2024 13:19
> To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org;
> Zhou1, Tao ; Yang, Stanley ;
> Chai, Thomas ; Li, Candice 
> Cc: Zhang, Hawking ; Deucher, Alexander
> ; Lazar, Lijo ; Ma, Le
> 
> Subject: RE: [PATCH 2/3] drm/amdgpu: Query ras capablity from psp
>
> [AMD Official Use Only - General]
>
> -----Original Message-
> From: Hawking Zhang 
> Sent: Tuesday, January 2, 2024 11:45 AM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley ; Wang, Yang(Kevin)
> ; Chai, Thomas ; Li,
> Candice 
> Cc: Zhang, Hawking ; Deucher, Alexander
> ; Lazar, Lijo ; Ma, Le
> 
> Subject: [PATCH 2/3] drm/amdgpu: Query ras capablity from psp
>
> Instead of traditional atomfirmware interfaces for RAS capability, host 
> driver can
> query ras capability from psp starting from psp v13_0_6.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 13 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h |  2 ++
> drivers/gpu/drm/amd/amdgpu/psp_v13_0.c  | 26 +
>  3 files changed, 41 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 94b536e3cada..8a3847d3041f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -2125,6 +2125,19 @@ int amdgpu_psp_wait_for_bootloader(struct
> amdgpu_device *adev)
> return ret;
>  }
>
> +bool amdgpu_psp_get_ras_capability(struct psp_context *psp) {
> +   bool ret;
> +
> +   if (psp->funcs &&
> +   psp->funcs->get_ras_capability) {
> +   ret = psp->funcs->get_ras_capability(psp);
> +   return ret;

[Tao] I think the code can be simplified as:

return psp->funcs->get_ras_capability(psp);

and drop the ret variable.

> [kevin]:
> This variable 'ret' seems to have no other purpose, can we remove it and 
> return
> directly ?
>
> Best Regards,
> Kevin
> +   } else {
> +   return false;
> +   }
> +}
> +
>  static int psp_hw_start(struct psp_context *psp)  {
> struct amdgpu_device *adev = psp->adev; diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> index 09d1f8f72a9c..652b0a01854a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h
> @@ -134,6 +134,7 @@ struct psp_funcs {
> int (*update_spirom)(struct psp_context *psp, uint64_t 
> fw_pri_mc_addr);
> int (*vbflash_stat)(struct psp_context *psp);
> int (*fatal_error_recovery_quirk)(struct psp_context *psp);
> +   bool (*get_ras_capability)(struct psp_context *psp);
>  };
>
>  struct ta_funcs {
> @@ -537,4 +538,5 @@ int psp_spatial_partition(struct psp_context *psp, int
> mode);  int is_psp_fw_valid(struct psp_bin_desc bin);
>
>  int amdgpu_psp_wait_for_bootloader(struct amdgpu_device *adev);
> +bool amdgpu_psp_get_ras_capability(struct psp_context *psp);
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> index 676bec2cc157..722b6066ce07 100644
> --- a/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/psp_v13_0.c
> @@ -27,6 +27,7 @@
>  #include "amdgpu_ucode.h"
>  #include "soc15_common.h"
>  #include "psp_v13_0.h"
> +#include "amdgpu_ras.h"
>
>  #include "mp/mp_13_0_2_offset.h"
>  #include "mp/mp_13_0_2_sh_mask.h"
> @@ -770,6 +771,30 @@ static int psp_v13_0_fatal_error_recovery_quirk(struct
> psp_context *psp)
> return 0;
>  }
>
> +static bool psp_v13_0_get_ras_capability(struct psp_context *psp) {
> +   struct

RE: [PATCH Review V3 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired

2023-12-21 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Thursday, December 21, 2023 2:05 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking 
> Cc: Yang, Stanley 
> Subject: [PATCH Review V3 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired
>
> The ecc_irq is disabled while GPU mode2 reset suspending process, but not be
> enabled during GPU mode2 reset resume process.
>
> Changed from V1:
>   only do sdma/gfx ras_late_init in aldebaran_mode2_restore_ip
>   delete amdgpu_ras_late_resume function
>
> Changed from V2:
>   check umc ras supported before put ecc_irq
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/aldebaran.c | 28 +-
> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |  4 
> drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c |  5 +
> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  4 
>  4 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> index 02f4c6f9d4f6..b60a3c1bd0f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c
> @@ -330,6 +330,7 @@ aldebaran_mode2_restore_hwcontext(struct
> amdgpu_reset_control *reset_ctl,  {
>   struct list_head *reset_device_list = reset_context->reset_device_list;
>   struct amdgpu_device *tmp_adev = NULL;
> + struct amdgpu_ras *con;
>   int r;
>
>   if (reset_device_list == NULL)
> @@ -355,7 +356,32 @@ aldebaran_mode2_restore_hwcontext(struct
> amdgpu_reset_control *reset_ctl,
>*/
>   amdgpu_register_gpu_instance(tmp_adev);
>
> - /* Resume RAS */
> + /* Resume RAS, ecc_irq */
> + con = amdgpu_ras_get_context(tmp_adev);
> + if (!amdgpu_sriov_vf(tmp_adev) && con) {
> + if (tmp_adev->sdma.ras &&
> + amdgpu_ras_is_supported(tmp_adev,
> AMDGPU_RAS_BLOCK__SDMA) &&
> + tmp_adev->sdma.ras->ras_block.ras_late_init) {
> + r = tmp_adev->sdma.ras-
> >ras_block.ras_late_init(tmp_adev,
> + &tmp_adev->sdma.ras-
> >ras_block.ras_comm);
> + if (r) {
> + dev_err(tmp_adev->dev, "SDMA failed
> to execute ras_late_init! ret:%d\n", r);
> + goto end;
> + }
> + }
> +
> + if (tmp_adev->gfx.ras &&
> + amdgpu_ras_is_supported(tmp_adev,
> AMDGPU_RAS_BLOCK__GFX) &&
> + tmp_adev->gfx.ras->ras_block.ras_late_init) {
> + r = tmp_adev->gfx.ras-
> >ras_block.ras_late_init(tmp_adev,
> + &tmp_adev->gfx.ras-
> >ras_block.ras_comm);
> + if (r) {
> + dev_err(tmp_adev->dev, "GFX failed to
> execute ras_late_init! ret:%d\n", r);
> + goto end;
> + }
> + }
> + }
> +
>   amdgpu_ras_resume(tmp_adev);
>
>   /* Update PSP FW topology after reset */ diff --git
> a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index 09cbca596bb5..4048539205cb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -1043,6 +1043,10 @@ static int gmc_v10_0_hw_fini(void *handle)
>
>   amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
>
> + if (adev->gmc.ecc_irq.funcs &&
> + amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC))
> + amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
> +
>   return 0;
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> index 416f3e4f0438..e1ca5a599971 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> @@ -941,6 +941,11 @@ static int gmc_v11_0_hw_fini(void *handle)
>   }
>
>   amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
> +
> + if (adev->gmc.ecc_irq.funcs &&
> + amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__UMC))
> + amdgpu_irq_put(adev, &adev->gmc.ecc_irq, 0);
> +
>   gmc_v11_0_gart_disable(adev);
>
>   return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 205db28a9803..f00e5c8c79b0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -2388,6 +2388,10 @@ static int gmc_v9_0_hw_fini(void *handle)
>
>   amdgpu_irq_put(adev, &adev->gmc.vm_fault, 0);
>
> + if (adev->gmc.ecc_irq.funcs &&
> + amdgpu_ras_is_supported(adev, AMDGPU_RAS_

RE: [PATCH] drm/amdgpu: Drop redundant unsigned >=0 comparision 'amdgpu_gfx_rlc_init_microcode()'

2023-12-21 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Srinivasan
> Shanmugam
> Sent: Wednesday, December 20, 2023 10:40 PM
> To: Deucher, Alexander ; Koenig, Christian
> 
> Cc: SHANMUGAM, SRINIVASAN ; amd-
> g...@lists.freedesktop.org
> Subject: [PATCH] drm/amdgpu: Drop redundant unsigned >=0 comparision
> 'amdgpu_gfx_rlc_init_microcode()'
>
> unsigned int "version_minor" is always >= 0
>
> Fixes the below:
> drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.c:534
> amdgpu_gfx_rlc_init_microcode() warn: always true condition '(version_minor >=
> 0) => (0-u16max >= 0)'
>
> Cc: Christian König 
> Cc: Alex Deucher 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.c
> index 35e0ae9acadc..2c3675d91614 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.c
> @@ -531,13 +531,12 @@ int amdgpu_gfx_rlc_init_microcode(struct
> amdgpu_device *adev,
>   if (version_major == 2 && version_minor == 1)
>   adev->gfx.rlc.is_rlc_v2_1 = true;
>
> - if (version_minor >= 0) {
> - err = amdgpu_gfx_rlc_init_microcode_v2_0(adev);
> - if (err) {
> - dev_err(adev->dev, "fail to init rlc v2_0 microcode\n");
> - return err;
> - }
> + err = amdgpu_gfx_rlc_init_microcode_v2_0(adev);
> + if (err) {
> + dev_err(adev->dev, "fail to init rlc v2_0 microcode\n");
> + return err;
>   }
> +
>   if (version_minor >= 1)
>   amdgpu_gfx_rlc_init_microcode_v2_1(adev);
>   if (version_minor >= 2)
> --
> 2.34.1

<>

RE: [PATCH] drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.c

2023-12-20 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Srinivasan
> Shanmugam
> Sent: Tuesday, December 19, 2023 10:12 PM
> To: Deucher, Alexander ; Koenig, Christian
> 
> Cc: SHANMUGAM, SRINIVASAN ; amd-
> g...@lists.freedesktop.org
> Subject: [PATCH] drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in
> amdgpu_ras.c
>
> Fixes the below smatch warnings:
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init()
> warn: Please consider using kzalloc instead of kmalloc
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn:
> Please consider using kzalloc instead of kmalloc
>
> Cc: Christian König 
> Cc: Alex Deucher 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index bad62141f708..e541e6925918 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2540,7 +2540,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>   return 0;
>
>   data = &con->eh_data;
> - *data = kmalloc(sizeof(**data), GFP_KERNEL | __GFP_ZERO);
> + *data = kzalloc(sizeof(**data), GFP_KERNEL);
>   if (!*data) {
>   ret = -ENOMEM;
>   goto out;
> @@ -2827,10 +2827,10 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>   if (con)
>   return 0;
>
> - con = kmalloc(sizeof(struct amdgpu_ras) +
> + con = kzalloc(sizeof(*con) +
>   sizeof(struct ras_manager) *
> AMDGPU_RAS_BLOCK_COUNT +
>   sizeof(struct ras_manager) *
> AMDGPU_RAS_MCA_BLOCK_COUNT,
> - GFP_KERNEL|__GFP_ZERO);
> + GFP_KERNEL);
>   if (!con)
>   return -ENOMEM;
>
> --
> 2.34.1

<>

RE: [PATCH] drm/amdgpu: handle extra UE register entries for gfx v9_4_3

2023-10-31 Thread Zhou1, Tao
[AMD Official Use Only - General]

In fact, the UE list has only one extra entry compared with CE list.
The code structure of handling CE and UE list one by one is more simple. The 
current approach has less loop cycles, either way is fine to me.

Regards,
Tao

> -Original Message-
> From: Yang, Stanley 
> Sent: Tuesday, October 31, 2023 7:02 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhou1, Tao 
> Subject: RE: [PATCH] drm/amdgpu: handle extra UE register entries for gfx
> v9_4_3
>
> [AMD Official Use Only - General]
>
> Is it better to handle CE and UE list separately?
> Anyway Reviewed-by: Stanley.Yang 
>
> Regards,
> Stanley
> > -Original Message-
> > From: amd-gfx  On Behalf Of Tao
> > Zhou
> > Sent: Tuesday, October 31, 2023 3:09 PM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Chai, Thomas ; Zhou1, Tao
> 
> > Subject: [PATCH] drm/amdgpu: handle extra UE register entries for gfx
> > v9_4_3
> >
> > The UE registe list is larger than CE list.
> >
> > Reported-by: yipeng.c...@amd.com
> > Signed-off-by: Tao Zhou 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 38
> > +
> >  1 file changed, 38 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > index 41bbabd9ad4d..046ae95b366a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > @@ -3799,6 +3799,27 @@ static void
> > gfx_v9_4_3_inst_query_ras_err_count(struct amdgpu_device *adev,
> >   }
> >   }
> >
> > + /* handle extra register entries of UE */
> > + for (; i < ARRAY_SIZE(gfx_v9_4_3_ue_reg_list); i++) {
> > + for (j = 0; j < gfx_v9_4_3_ue_reg_list[i].se_num; j++) {
> > + for (k = 0; k <
> > gfx_v9_4_3_ue_reg_list[i].reg_entry.reg_inst; k++) {
> > + /* no need to select if instance number is 1 
> > */
> > + if (gfx_v9_4_3_ue_reg_list[i].se_num > 1
> > + ||
> > +
> >   gfx_v9_4_3_ue_reg_list[i].reg_entry.reg_inst > 1)
> > +
> > + gfx_v9_4_3_xcc_select_se_sh(adev, j,
> > 0, k, xcc_id);
> > +
> > +
> >   amdgpu_ras_inst_query_ras_error_count(adev,
> > +
> >   &(gfx_v9_4_3_ue_reg_list[i].reg_entry),
> > + 1,
> > +
> >   gfx_v9_4_3_ras_mem_list_array[gfx_v9_4_3_ue_reg_list[i].mem_id_t
> > ype].mem_id_ent,
> > +
> >   gfx_v9_4_3_ras_mem_list_array[gfx_v9_4_3_ue_reg_list[i].mem_id_t
> > ype].size,
> > + GET_INST(GC, xcc_id),
> > +
> >   AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE,
> > + &ue_count);
> > + }
> > + }
> > + }
> > +
> >   gfx_v9_4_3_xcc_select_se_sh(adev, 0x, 0x, 0x,
> >   xcc_id);
> >   mutex_unlock(&adev->grbm_idx_mutex);
> > @@ -3838,6 +3859,23 @@ static void
> > gfx_v9_4_3_inst_reset_ras_err_count(struct amdgpu_device *adev,
> >   }
> >   }
> >
> > + /* handle extra register entries of UE */
> > + for (; i < ARRAY_SIZE(gfx_v9_4_3_ue_reg_list); i++) {
> > + for (j = 0; j < gfx_v9_4_3_ue_reg_list[i].se_num; j++) {
> > + for (k = 0; k <
> > gfx_v9_4_3_ue_reg_list[i].reg_entry.reg_inst; k++) {
> > + /* no need to select if instance number is 1 
> > */
> > + if (gfx_v9_4_3_ue_reg_list[i].se_num > 1
> > + ||
> > +
> >   gfx_v9_4_3_ue_reg_list[i].reg_entry.reg_inst > 1)
> > +
> > + gfx_v9_4_3_xcc_select_se_sh(adev, j,
> > 0, k, xcc_id);
> > +
> > +
> >   amdgpu_ras_inst_reset_ras_error_count(adev,
> > +
> >   &(gfx_v9_4_3_ue_reg_list[i].reg_entry),
> > + 1,
> > + GET_INST(GC, xcc_id));
> > + }
> > + }
> > + }
> > +
> >   gfx_v9_4_3_xcc_select_se_sh(adev, 0x, 0x, 0x,
> >   xcc_id);
> >   mutex_unlock(&adev->grbm_idx_mutex);
> > --
> > 2.35.1
>



RE: [PATCH Review 1/1] drm/amdgpu: Enable mca debug mode mode for apu

2023-10-18 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Wednesday, October 18, 2023 8:22 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Enable mca debug mode mode for apu
[Tao] the "mode" is duplicated.

>
> Enable smu_v13_0_6 mca debug mode when GFX RAS feature is enabled on APU.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> index cab5a5569bc6..62589ba3c4a5 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
> @@ -2298,7 +2298,8 @@ static int smu_v13_0_6_post_init(struct smu_context
> *smu)  {
>   struct amdgpu_device *adev = smu->adev;
>
> - if (!amdgpu_sriov_vf(adev) && amdgpu_ras_is_supported(adev,
> AMDGPU_RAS_BLOCK__UMC))
> + if (!amdgpu_sriov_vf(adev) && (amdgpu_ras_is_supported(adev,
> AMDGPU_RAS_BLOCK__UMC) ||
> + (adev->is_app_apu && amdgpu_ras_is_supported(adev,
> +AMDGPU_RAS_BLOCK__GFX
[Tao] Can "adev->is_app_apu" be dropped?

>   return smu_v13_0_6_mca_set_debug_mode(smu, true);
>
>   return 0;
> --
> 2.25.1



RE: [PATCH Review 1/1] drm/amdgpu: Workaround to skip kiq ring test during ras gpu recovery

2023-10-17 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Tuesday, October 17, 2023 10:37 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Workaround to skip kiq ring test
> during ras gpu recovery
>
> This is workaround, kiq ring test failed in suspend stage when do ras 
> recovery for
> gfx v9_4_3.
>
> Change-Id: I8de9900aa76706f59bc029d4e9e8438c6e1db8e0
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 21 +
>  1 file changed, 21 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 9a158018ae16..902e60203809 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -29,6 +29,7 @@
>  #include "amdgpu_rlc.h"
>  #include "amdgpu_ras.h"
>  #include "amdgpu_xcp.h"
> +#include "amdgpu_xgmi.h"
>
>  /* delay 0.1 second to enable gfx off feature */
>  #define GFX_OFF_DELAY_ENABLE msecs_to_jiffies(100)
> @@ -501,6 +502,9 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device
> *adev, int xcc_id)  {
>   struct amdgpu_kiq *kiq = &adev->gfx.kiq[xcc_id];
>   struct amdgpu_ring *kiq_ring = &kiq->ring;
> + struct amdgpu_hive_info *hive;
> + struct amdgpu_ras *ras;
> + int hive_ras_recovery;
>   int i, r = 0;
>   int j;
>
> @@ -521,6 +525,23 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device
> *adev, int xcc_id)
>  RESET_QUEUES, 0, 0);
>   }
>
> + /**
> +  * This is workaround: only skip kiq_ring test
> +  * during ras recovery in suspend stage for gfx v9_4_3
> +  */
> + hive = amdgpu_get_xgmi_hive(adev);
> + if (hive) {
[Tao] the hive_ras_recovery should has default value if !hive.
With that fixed, the patch is:

Reviewed-by: Tao Zhou 

> + hive_ras_recovery = atomic_read(&hive->ras_recovery);
> + amdgpu_put_xgmi_hive(hive);
> + }
> +
> + ras = amdgpu_ras_get_context(adev);
> + if ((amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3)) &&
> + ras && (atomic_read(&ras->in_recovery) || hive_ras_recovery))
> {
> + spin_unlock(&kiq->ring_lock);
> + return 0;
> + }
> +
>   if (kiq_ring->sched.ready && !adev->job_hang)
>   r = amdgpu_ring_test_helper(kiq_ring);
>   spin_unlock(&kiq->ring_lock);
> --
> 2.25.1



Re: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions

2023-10-13 Thread Zhou1, Tao
[AMD Official Use Only - General]

How about this condition:

if ((amdgpu_in_reset(adev) || amdgpu_ras_intr_triggered()) &&
   mca_funcs && mca_funcs->mca_set_debug_mode)

I use amdgpu_in_reset to skip touching it in all gpu resets, not only for the 
resets triggered by ras fatal error.

Regards,
Tao


From: Zhang, Hawking 
Sent: Thursday, October 12, 2023 9:14 PM
To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org 
; Yang, Stanley ; Li, 
Candice ; Chai, Thomas ; Wang, 
Yang(Kevin) 
Subject: RE: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions

[AMD Official Use Only - General]

-   if (!amdgpu_ras_is_supported(adev, block))
+   /* skip ras error reset in gpu reset */
+   if (amdgpu_in_reset(adev) &&
+   mca_funcs && mca_funcs->mca_set_debug_mode)
+   return 0;

We should check RAS in_recovery flag in such case. Reset domain is locked in 
relative late phase, at least *after* error counter harvest. Please double 
check.

Regards,
Hawking
-----Original Message-
From: Zhou1, Tao 
Sent: Thursday, October 12, 2023 17:01
To: amd-gfx@lists.freedesktop.org; Yang, Stanley ; Zhang, 
Hawking ; Li, Candice ; Chai, Thomas 
; Wang, Yang(Kevin) 
Cc: Zhou1, Tao 
Subject: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions

PMFW is responsible for RAS error reset in some conditions, driver can skip the 
operation.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 91ed4fd96ee1..6dddb0423411 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1105,11 +1105,18 @@ int amdgpu_ras_reset_error_count(struct amdgpu_device 
*adev,
enum amdgpu_ras_block block)
 {
struct amdgpu_ras_block_object *block_obj = 
amdgpu_ras_get_ras_block(adev, block, 0);
+   const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs;

if (!block_obj || !block_obj->hw_ops)
return 0;

-   if (!amdgpu_ras_is_supported(adev, block))
+   /* skip ras error reset in gpu reset */
+   if (amdgpu_in_reset(adev) &&
+   mca_funcs && mca_funcs->mca_set_debug_mode)
+   return 0;
+
+   if (!amdgpu_ras_is_supported(adev, block) ||
+   !amdgpu_ras_get_mca_debug_mode(adev))
return 0;

if (block_obj->hw_ops->reset_ras_error_count)
@@ -1122,6 +1129,7 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
enum amdgpu_ras_block block)
 {
struct amdgpu_ras_block_object *block_obj = 
amdgpu_ras_get_ras_block(adev, block, 0);
+   const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs;

if (!block_obj || !block_obj->hw_ops) {
dev_dbg_once(adev->dev, "%s doesn't config RAS function\n", @@ 
-1129,7 +1137,13 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
return 0;
}

-   if (!amdgpu_ras_is_supported(adev, block))
+   /* skip ras error reset in gpu reset */
+   if (amdgpu_in_reset(adev) &&
+   mca_funcs && mca_funcs->mca_set_debug_mode)
+   return 0;
+
+   if (!amdgpu_ras_is_supported(adev, block) ||
+   !amdgpu_ras_get_mca_debug_mode(adev))
return 0;

if (block_obj->hw_ops->reset_ras_error_count)
--
2.35.1



RE: [PATCH Review 1/1] drm/amdgpu: Fix potential null pointer derefernce

2023-09-27 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Thursday, September 28, 2023 11:46 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Fix potential null pointer derefernce
>
> The amdgpu_ras_get_context may return NULL if device not support ras feature,
> so add check before using.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index cca3faf4dc23..60f8a18592b7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5391,7 +5391,8 @@ int amdgpu_device_gpu_recover(struct
> amdgpu_device *adev,
>* Flush RAM to disk so that after reboot
>* the user can read log and see why the system rebooted.
>*/
> - if (need_emergency_restart && amdgpu_ras_get_context(adev)-
> >reboot) {
> + if (need_emergency_restart && amdgpu_ras_get_context(adev) &&
> + amdgpu_ras_get_context(adev)->reboot) {
>   DRM_WARN("Emergency reboot.");
>
>   ksys_sync_helper();
> --
> 2.25.1



RE: [PATCH Review 1/1] drm/amdgpu: Skip ring test during ras in recovery

2023-09-27 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Thursday, September 28, 2023 11:42 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Skip ring test during ras in recovery
>
> This is workaround due to ring test failed during ras do gpu recovery for aqua
> vanjaram.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index fbfe0a1c4b19..9fff58d073a7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -248,10 +248,16 @@ static int gfx_v9_4_3_ring_test_ring(struct
> amdgpu_ring *ring)  {
>   uint32_t scratch_reg0_offset, xcc_offset;
>   struct amdgpu_device *adev = ring->adev;
> + struct amdgpu_ras *ras;
>   uint32_t tmp = 0;
>   unsigned i;
>   int r;
>
> + /* This is workaround: ring test failed during ras recovery */
> + ras = amdgpu_ras_get_context(adev);
> + if (ras && atomic_read(&ras->in_recovery))
> + return 0;
> +
>   /* Use register offset which is local to XCC in the packet */
>   xcc_offset = SOC15_REG_OFFSET(GC, 0, regSCRATCH_REG0);
>   scratch_reg0_offset = SOC15_REG_OFFSET(GC, GET_INST(GC, ring-
> >xcc_id), regSCRATCH_REG0);
> --
> 2.25.1



RE: [PATCH 3/3] drm/amdgpu: change if condition for bad channel bitmap update

2023-09-19 Thread Zhou1, Tao
[AMD Official Use Only - General]

Thanks for catch it, will update the patch.

Tao

> -Original Message-
> From: Wang, Yang(Kevin) 
> Sent: Tuesday, September 19, 2023 11:34 PM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org; Zhang,
> Hawking ; Yang, Stanley ;
> Li, Candice ; Chai, Thomas 
> Subject: RE: [PATCH 3/3] drm/amdgpu: change if condition for bad channel
> bitmap update
>
> [AMD Official Use Only - General]
>
> Hi Tao,
>
> Based on your description, I think you should use BITS_PER_TYPE() instead of
> sizeof(), right?
>
> Best Regards,
> Kevin
>
> -Original Message-
> From: Zhou1, Tao 
> Sent: Tuesday, September 19, 2023 6:10 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ;
> Yang, Stanley ; Li, Candice ;
> Wang, Yang(Kevin) ; Chai, Thomas
> 
> Cc: Zhou1, Tao 
> Subject: [PATCH 3/3] drm/amdgpu: change if condition for bad channel bitmap
> update
>
> The amdgpu_ras_eeprom_control.bad_channel_bitmap is u32 type, but the
> channel index could be larger than 32. For the ASICs whose channel number is
> more than 32, the amdgpu_dpm_send_hbm_bad_channel_flag
> interface is not supported, so we simply bypass channel bitmap update under 
> this
> condition.
>
> Signed-off-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 8ced4be784e0..1c4433f22f4b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -616,7 +616,8 @@ amdgpu_ras_eeprom_append_table(struct
> amdgpu_ras_eeprom_control *control,
> __encode_table_record_to_buf(control, &record[i], pp);
>
> /* update bad channel bitmap */
> -   if (!(control->bad_channel_bitmap & (1 << 
> record[i].mem_channel))) {
> +   if ((record[i].mem_channel < 
> sizeof(control->bad_channel_bitmap)) &&
> +   !(control->bad_channel_bitmap & (1 << 
> record[i].mem_channel))) {
> control->bad_channel_bitmap |= 1 << 
> record[i].mem_channel;
> con->update_channel_flag = true;
> }
> @@ -969,7 +970,8 @@ int amdgpu_ras_eeprom_read(struct
> amdgpu_ras_eeprom_control *control,
> __decode_table_record_from_buf(control, &record[i], pp);
>
> /* update bad channel bitmap */
> -   if (!(control->bad_channel_bitmap & (1 << 
> record[i].mem_channel))) {
> +   if ((record[i].mem_channel < 
> sizeof(control->bad_channel_bitmap)) &&
> +   !(control->bad_channel_bitmap & (1 << 
> record[i].mem_channel))) {
> control->bad_channel_bitmap |= 1 << 
> record[i].mem_channel;
> con->update_channel_flag = true;
> }
> --
> 2.35.1
>



RE: [PATCH Review V2 1/1] drm/amdgpu: Fix false positive error log

2023-09-17 Thread Zhou1, Tao
[AMD Official Use Only - General]

The update is fine for me, but since "!block_obj || !block_obj->hw_ops" is not 
considered as error status, can we change the dev_dbg_once to dev_info_once?
With that fixed, the patch is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Friday, September 15, 2023 7:07 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Yang, Stanley 
> Subject: [PATCH Review V2 1/1] drm/amdgpu: Fix false positive error log
>
> It should first check block ras obj whether be set, it should return 0 
> directly if
> block ras obj or hw_ops is not set.
>
> Changed from V1:
>   return 0 directly if block ras obj or hw ops is not set
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 4a6df4e24243..25514af6cf8f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1105,15 +1105,15 @@ int amdgpu_ras_reset_error_status(struct
> amdgpu_device *adev,  {
>   struct amdgpu_ras_block_object *block_obj =
> amdgpu_ras_get_ras_block(adev, block, 0);
>
> - if (!amdgpu_ras_is_supported(adev, block))
> - return -EINVAL;
> -
> - if (!block_obj || !block_obj->hw_ops)   {
> + if (!block_obj || !block_obj->hw_ops) {
>   dev_dbg_once(adev->dev, "%s doesn't config RAS function\n",
>ras_block_str(block));
> - return -EINVAL;
> + return 0;
>   }
>
> + if (!amdgpu_ras_is_supported(adev, block))
> + return -EINVAL;
> +
>   if (block_obj->hw_ops->reset_ras_error_count)
>   block_obj->hw_ops->reset_ras_error_count(adev);
>
> --
> 2.25.1



RE: [PATCH] drm/amdgpu: Correct se_num and reg_inst for gfx v9_4_3 ras counters

2023-09-06 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Hawking
> Zhang
> Sent: Wednesday, September 6, 2023 6:12 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley ; Li, Candice ; Chai,
> Thomas 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Correct se_num and reg_inst for gfx v9_4_3 ras
> counters
>
> gfx_v9_4_3_ue|ce_reg_list is an array per gfx core instance correct the 
> settings of
> se_num and reg_inst for some of gfx ras counters so all the available register
> instances can be polled for ras status.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 40 -
>  1 file changed, 20 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index 0a26a00074a6..a60d1a8405d4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -3653,19 +3653,19 @@ static const struct amdgpu_gfx_ras_reg_entry
> gfx_v9_4_3_ce_reg_list[] = {
>   AMDGPU_GFX_GC_CANE_MEM, 1},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regSPI_CE_ERR_STATUS_LO,
> regSPI_CE_ERR_STATUS_HI),
>   1, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "SPI"},
> - AMDGPU_GFX_SPI_MEM, 8},
> + AMDGPU_GFX_SPI_MEM, 1},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regSP0_CE_ERR_STATUS_LO,
> regSP0_CE_ERR_STATUS_HI),
>   10, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "SP0"},
> - AMDGPU_GFX_SP_MEM, 1},
> + AMDGPU_GFX_SP_MEM, 4},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regSP1_CE_ERR_STATUS_LO,
> regSP1_CE_ERR_STATUS_HI),
>   10, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "SP1"},
> - AMDGPU_GFX_SP_MEM, 1},
> + AMDGPU_GFX_SP_MEM, 4},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regSQ_CE_ERR_STATUS_LO,
> regSQ_CE_ERR_STATUS_HI),
>   10, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "SQ"},
> - AMDGPU_GFX_SQ_MEM, 8},
> + AMDGPU_GFX_SQ_MEM, 4},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regSQC_CE_EDC_LO,
> regSQC_CE_EDC_HI),
>   5, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "SQC"},
> - AMDGPU_GFX_SQC_MEM, 8},
> + AMDGPU_GFX_SQC_MEM, 4},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regTCX_CE_ERR_STATUS_LO,
> regTCX_CE_ERR_STATUS_HI),
>   2, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "TCX"},
>   AMDGPU_GFX_TCX_MEM, 1},
> @@ -3674,22 +3674,22 @@ static const struct amdgpu_gfx_ras_reg_entry
> gfx_v9_4_3_ce_reg_list[] = {
>   AMDGPU_GFX_TCC_MEM, 1},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regTA_CE_EDC_LO,
> regTA_CE_EDC_HI),
>   10, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "TA"},
> - AMDGPU_GFX_TA_MEM, 8},
> + AMDGPU_GFX_TA_MEM, 4},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regTCI_CE_EDC_LO_REG,
> regTCI_CE_EDC_HI_REG),
> - 31, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "TCI"},
> + 27, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID),
> +"TCI"},
>   AMDGPU_GFX_TCI_MEM, 1},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regTCP_CE_EDC_LO_REG,
> regTCP_CE_EDC_HI_REG),
>   10, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "TCP"},
> - AMDGPU_GFX_TCP_MEM, 8},
> + AMDGPU_GFX_TCP_MEM, 4},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regTD_CE_EDC_LO,
> regTD_CE_EDC_HI),
>   10, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "TD"},
> - AMDGPU_GFX_TD_MEM, 8},
> + AMDGPU_GFX_TD_MEM, 4},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regGCEA_CE_ERR_STATUS_LO,
> regGCEA_CE_ERR_STATUS_HI),
>   16, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "GCEA"},
>   AMDGPU_GFX_GCEA_MEM, 1},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regLDS_CE_ERR_STATUS_LO,
> regLDS_CE_ERR_STATUS_HI),
>   10, (AMDGPU_RAS_ERR_INFO_VALID |
> AMDGPU_RAS_ERR_STATUS_VALID), "LDS"},
> - AMDGPU_GFX_LDS_MEM, 1},
> + AMDGPU_GFX_LDS_MEM, 4},
>  };
>
>  static const struct amdgpu_gfx_ras_reg_entry gfx_v9_4_3_ue_reg_list[] = { @@
> -3713,19 +3713,19 @@ static const struct amdgpu_gfx_ras_reg_entry
> gfx_v9_4_3_ue_reg_list[] = {
>   AMDGPU_GFX_GC_CANE_MEM, 1},
>   {{AMDGPU_RAS_REG_ENTRY(GC, 0, regSPI_UE_ERR_STATUS_LO,
> regSPI_UE_ERR_STATUS_HI),
>   1, (AMDGPU_RAS_ERR_INFO_VA

RE: [PATCH 3/3] drm/amdgpu: Add umc v12_0 ras functions

2023-09-04 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Li, Candice 
> Sent: Monday, September 4, 2023 3:20 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Li, Candice ; Zhou1, Tao 
> Subject: [PATCH 3/3] drm/amdgpu: Add umc v12_0 ras functions
>
> Add umc v12_0 ras error querying.
>
> Signed-off-by: Candice Li 
> Reviewed-by: Tao Zhou 
> ---
>  drivers/gpu/drm/amd/amdgpu/Makefile|   2 +-
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  16 +-
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 256
> +  drivers/gpu/drm/amd/amdgpu/umc_v12_0.h |
> 56 ++
>  4 files changed, 327 insertions(+), 3 deletions(-)  create mode 100644
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
>  create mode 100644 drivers/gpu/drm/amd/amdgpu/umc_v12_0.h
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile
> b/drivers/gpu/drm/amd/amdgpu/Makefile
> index ce0188b329cdeb..adf5470aa81020 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Makefile
> +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
> @@ -121,7 +121,7 @@ amdgpu-y += \
>
>  # add UMC block
>  amdgpu-y += \
> - umc_v6_0.o umc_v6_1.o umc_v6_7.o umc_v8_7.o umc_v8_10.o
> + umc_v6_0.o umc_v6_1.o umc_v6_7.o umc_v8_7.o umc_v8_10.o
> umc_v12_0.o
>
>  # add IH block
>  amdgpu-y += \
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 8447fcada8bb92..41e1759b5f1eaa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -56,6 +56,7 @@
>  #include "umc_v6_1.h"
>  #include "umc_v6_0.h"
>  #include "umc_v6_7.h"
> +#include "umc_v12_0.h"
>  #include "hdp_v4_0.h"
>  #include "mca_v3_0.h"
>
> @@ -737,7 +738,8 @@ static void gmc_v9_0_set_irq_funcs(struct
> amdgpu_device *adev)
>   adev->gmc.vm_fault.funcs = &gmc_v9_0_irq_funcs;
>
>   if (!amdgpu_sriov_vf(adev) &&
> - !adev->gmc.xgmi.connected_to_cpu) {
> + !adev->gmc.xgmi.connected_to_cpu &&
> + !adev->gmc.is_app_apu) {
>   adev->gmc.ecc_irq.num_types = 1;
>   adev->gmc.ecc_irq.funcs = &gmc_v9_0_ecc_funcs;
>   }
> @@ -1487,6 +1489,15 @@ static void gmc_v9_0_set_umc_funcs(struct
> amdgpu_device *adev)
>   else
>   adev->umc.channel_idx_tbl =
> &umc_v6_7_channel_idx_tbl_second[0][0];
>   break;
> + case IP_VERSION(12, 0, 0):
> + adev->umc.max_ras_err_cnt_per_query =
> UMC_V12_0_TOTAL_CHANNEL_NUM(adev);
> + adev->umc.channel_inst_num =
> UMC_V12_0_CHANNEL_INSTANCE_NUM;
> + adev->umc.umc_inst_num =
> UMC_V12_0_UMC_INSTANCE_NUM;
> + adev->umc.node_inst_num /=
> UMC_V12_0_UMC_INSTANCE_NUM;
> + adev->umc.channel_offs =
> UMC_V12_0_PER_CHANNEL_OFFSET;
> + adev->umc.active_mask = adev->aid_mask;
> + if (!adev->gmc.xgmi.connected_to_cpu && !adev-
> >gmc.is_app_apu)
> + adev->umc.ras = &umc_v12_0_ras;
>   default:
>   break;
>   }
> @@ -2131,7 +2142,8 @@ static int gmc_v9_0_sw_init(void *handle)
>   return r;
>
>   if (!amdgpu_sriov_vf(adev) &&
> - !adev->gmc.xgmi.connected_to_cpu) {
> + !adev->gmc.xgmi.connected_to_cpu &&
> + !adev->gmc.is_app_apu) {
>   /* interrupt sent to DF. */
>   r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_DF, 0,
> &adev->gmc.ecc_irq);
> diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> new file mode 100644
> index 00..b3d6db14b351f1
> --- /dev/null
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -0,0 +1,256 @@
> +/*
> + * Copyright 2023 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person
> +obtaining a
> + * copy of this software and associated documentation files (the
> +"Software"),
> + * to deal in the Software without restriction, including without
> +limitation
> + * the rights to use, copy, modify, merge, publish, distribute,
> +sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom
> +the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> +included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "A

RE: [PATCH] drm/amdgpu: Allow issue disable gfx ras cmd to firmware

2023-08-23 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Hawking
> Zhang
> Sent: Thursday, August 24, 2023 9:49 AM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ;
> Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Allow issue disable gfx ras cmd to firmware
>
> Disable gfx ras command is needed in some use cases like live migration.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 378478cf9c21..7db6baa16236 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -769,9 +769,10 @@ int amdgpu_ras_feature_enable(struct amdgpu_device
> *adev,
>   if (!con)
>   return -EINVAL;
>
> - /* Do not enable ras feature if it is not allowed */
> - if (enable &&
> - head->block != AMDGPU_RAS_BLOCK__GFX &&
> + /* For non-gfx ip, do not enable ras feature if it is not allowed.
> +  * For gfx ip, regardless of feature support status,
> +  * force issue enable or disable ras feature commands */
> + if (head->block != AMDGPU_RAS_BLOCK__GFX &&
>   !amdgpu_ras_is_feature_allowed(adev, head))
>   goto out;
>
> --
> 2.17.1



RE: [PATCH] drm/amdgpu: Remove unnecessary ras cap check

2023-08-09 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Hawking Zhang 
> Sent: Wednesday, August 9, 2023 7:22 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH] drm/amdgpu: Remove unnecessary ras cap check
>
> RAS global isr will only be invoked by hardware interrupt. Don't need to 
> query ras
> capability in isr In addition, amdgpu_ras_interrupt_fatal_error_handler
> ensures the isr won't be called from guest linux side by accident. The RAS cap
> check in isr that introduced to fix sriov crash is not needed any more
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 
>  1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 00658c2816dc..c58b31121fd7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2970,10 +2970,6 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
>
>  void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev)  {
> - amdgpu_ras_check_supported(adev);
> - if (!adev->ras_hw_enabled)
> - return;
> -
>   if (atomic_cmpxchg(&amdgpu_ras_in_intr, 0, 1) == 0) {
>   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>
> --
> 2.17.1



RE: [PATCH Review 1/1] drm/amdgpu: Check APU flag to disable RAS

2023-07-23 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of
> Stanley.Yang
> Sent: Friday, July 21, 2023 9:18 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ;
> Zhou1, Tao ; Chai, Thomas ; Li,
> Candice 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: Check APU flag to disable RAS
>
> Only disable RAS by default for aqua vanjaram on APU platform.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 2221460e23e4..00a3863a6017 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2529,7 +2529,8 @@ static void amdgpu_ras_check_supported(struct
> amdgpu_device *adev)
>* Disable ras feature for aqua vanjaram
>* by default on apu platform.
>*/
> - if (adev->ip_versions[MP0_HWIP][0] == IP_VERSION(13, 0, 6))
> + if (adev->ip_versions[MP0_HWIP][0] == IP_VERSION(13, 0, 6) &&
> + adev->gmc.is_app_apu)
>   adev->ras_enabled = amdgpu_ras_enable != 1 ? 0 :
>   adev->ras_hw_enabled & amdgpu_ras_mask;
>   else
> --
> 2.25.1



RE: [PATCH 2/2] drm/amdgpu: not update the same version ras ta

2023-07-20 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, July 19, 2023 8:40 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Yang, Stanley ; Chai, Thomas
> 
> Subject: [PATCH 2/2] drm/amdgpu: not update the same version ras ta
>
> not update the same version ras ta.

[Tao] don't update ras ta with same version

>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
> index 049d34fd5ba0..c27574239fde 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
> @@ -120,6 +120,7 @@ static const struct file_operations
> ta_invoke_debugfs_fops = {
>   *   Transmit buffer:
>   *- TA type (4bytes)
>   *- TA bin length (4bytes)
> + *- TA bin version (4bytes)

[Tao] the patch is fine for me, but since the bin structure is updated, do we 
need to consider backward compatibility?

>   *- TA bin
>   *   Receive buffer:
>   *- TA ID (4bytes)
> @@ -148,6 +149,7 @@ static ssize_t ta_if_load_debugfs_write(struct file *fp,
> const char *buf, size_t
>   uint8_t  *ta_bin= NULL;
>   uint32_t copy_pos   = 0;
>   int  ret= 0;
> + uint32_t ta_version = 0;
>
>   struct amdgpu_device *adev= (struct amdgpu_device *)file_inode(fp)-
> >i_private;
>   struct psp_context   *psp = &adev->psp;
> @@ -168,6 +170,12 @@ static ssize_t ta_if_load_debugfs_write(struct file *fp,
> const char *buf, size_t
>
>   copy_pos += sizeof(uint32_t);
>
> + ret = copy_from_user((void *)&ta_version, &buf[copy_pos],
> sizeof(uint32_t));
> + if (ret)
> + return -EFAULT;
> +
> + copy_pos += sizeof(uint32_t);
> +
>   ta_bin = kzalloc(ta_bin_len, GFP_KERNEL);
>   if (!ta_bin)
>   return -ENOMEM;
> @@ -185,6 +193,16 @@ static ssize_t ta_if_load_debugfs_write(struct file *fp,
> const char *buf, size_t
>   goto err_free_bin;
>   }
>
> + if (ta_version == context->bin_desc.fw_version) {
> + dev_info(adev->dev,
> +"new ta is same as running ta, running ta will not be
> updated!\n");
> + if (copy_to_user((char *)buf, (void *)&context->session_id,
> sizeof(uint32_t)))
> + ret = -EFAULT;
> + else
> + ret = len;
> + goto err_free_bin;
> + }
> +
>   /*
>* Allocate TA shared buf in case shared buf was freed
>* due to loading TA failed before.
> @@ -209,7 +227,7 @@ static ssize_t ta_if_load_debugfs_write(struct file *fp,
> const char *buf, size_t
>
>   /* Prepare TA context for TA initialization */
>   context->ta_type = ta_type;
> - context->bin_desc.fw_version = get_bin_version(ta_bin);
> + context->bin_desc.fw_version = ta_version;
>   context->bin_desc.size_bytes = ta_bin_len;
>   context->bin_desc.start_addr = ta_bin;
>
> --
> 2.34.1



RE: [PATCH 1/2] drm/amdgpu: add ta initialization failure check condition

2023-07-20 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, July 19, 2023 8:40 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Yang, Stanley ; Chai, Thomas
> 
> Subject: [PATCH 1/2] drm/amdgpu: add ta initialization failure check condition
>
> Add ta initialization failure check condition.

[Tao] better to say "Add condition check for ta initialization failure"

>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
> index 468a67b302d4..049d34fd5ba0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
> @@ -220,7 +220,7 @@ static ssize_t ta_if_load_debugfs_write(struct file *fp,
> const char *buf, size_t
>   }
>
>   ret = psp_fn_ta_initialize(psp);
> - if (ret || context->resp_status) {
> + if (ret || context->resp_status || !context->initialized) {
>   dev_err(adev->dev, "Failed to load TA via debugfs (%d) and 
> status
> (0x%X)\n",
>   ret, context->resp_status);
>   if (!ret)
> --
> 2.34.1



RE: [PATCH Review V3 2/2] drm/amdgpu: Disable RAS by default on APU flatform

2023-07-13 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Stanley.Yang 
> Sent: Friday, July 14, 2023 11:42 AM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ;
> Zhou1, Tao ; Chai, Thomas ; Li,
> Candice 
> Cc: Yang, Stanley 
> Subject: [PATCH Review V3 2/2] drm/amdgpu: Disable RAS by default on APU
> flatform
>
> Disable RAS feature by default for aqua vanjaram on APU platform.
>
> Changed from V1:
>   Splite Disable RAS by default on APU platform into a
>   separated patch.
>
> Changed from V2:
>   Avoid to modify global variable amdgpu_ras_enable.
>
> Signed-off-by: Stanley.Yang 
> Reviewed-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 8673d9790bb0..c46e0ed9165e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2524,8 +2524,17 @@ static void amdgpu_ras_check_supported(struct
> amdgpu_device *adev)
>   /* hw_supported needs to be aligned with RAS block mask. */
>   adev->ras_hw_enabled &= AMDGPU_RAS_BLOCK_MASK;
>
> - adev->ras_enabled = amdgpu_ras_enable == 0 ? 0 :
> - adev->ras_hw_enabled & amdgpu_ras_mask;
> +
> + /*
> +  * Disable ras feature for aqua vanjaram
> +  * by default on apu platform.
> +  */
> + if (adev->ip_versions[MP0_HWIP][0] == IP_VERSION(13, 0, 6))
> + adev->ras_enabled = amdgpu_ras_enable != 1 ? 0 :
> + adev->ras_hw_enabled & amdgpu_ras_mask;
> + else
> + adev->ras_enabled = amdgpu_ras_enable == 0 ? 0 :
> + adev->ras_hw_enabled & amdgpu_ras_mask;
>  }
>
>  static void amdgpu_ras_counte_dw(struct work_struct *work)
> --
> 2.25.1



RE: [PATCH 3/3] drm/amdgpu: Issue ras enable_feature for gfx ip only

2023-07-03 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Zhang, Hawking 
> Sent: Monday, July 3, 2023 4:56 PM
> To: amd-gfx@lists.freedesktop.org; Zhou1, Tao ; Yang,
> Stanley ; Chai, Thomas ; Li,
> Candice 
> Cc: Zhang, Hawking 
> Subject: [PATCH 3/3] drm/amdgpu: Issue ras enable_feature for gfx ip only
>
> For non-GFX IP blocks, set up ras obj if ras feature is allowed. For GFX IP 
> blocks,
> force issue ras enable_feature command to firmware and only set up ras obj if 
> ras
> feature is allowed
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 30 +
>  1 file changed, 10 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 8524365761b6..2e9154bbec64 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -761,16 +761,6 @@ static int __amdgpu_ras_feature_enable(struct
> amdgpu_device *adev,
>   return 0;
>  }
>
> -static int amdgpu_ras_check_feature_allowed(struct amdgpu_device *adev,
> - struct ras_common_if *head)
> -{
> - if (amdgpu_ras_is_feature_allowed(adev, head) ||
> - amdgpu_ras_is_poison_mode_supported(adev))
> - return 1;
> - else
> - return 0;
> -}
> -
>  /* wrapper of psp_ras_enable_features */  int
> amdgpu_ras_feature_enable(struct amdgpu_device *adev,
>   struct ras_common_if *head, bool enable) @@ -782,7 +772,16
> @@ int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
>   if (!con)
>   return -EINVAL;
>
> - if (head->block == AMDGPU_RAS_BLOCK__GFX) {
> + /* Do not enable ras feature if it is not allowed */
> + if (enable &&
> + head->block != AMDGPU_RAS_BLOCK__GFX &&
> + !amdgpu_ras_is_feature_allowed(adev, head))
> + goto out;
> +
> + /* Only enable gfx ras feature from host side */
> + if (head->block == AMDGPU_RAS_BLOCK__GFX &&
> + !amdgpu_sriov_vf(adev) &&
> + !amdgpu_ras_intr_triggered()) {
>   info = kzalloc(sizeof(union ta_ras_cmd_input), GFP_KERNEL);
>   if (!info)
>   return -ENOMEM;
> @@ -798,16 +797,7 @@ int amdgpu_ras_feature_enable(struct amdgpu_device
> *adev,
>   .error_type = amdgpu_ras_error_to_ta(head-
> >type),
>   };
>   }
> - }
>
> - /* Do not enable if it is not allowed. */
> - if (enable && !amdgpu_ras_check_feature_allowed(adev, head))
> - goto out;
> -
> - /* Only enable ras feature operation handle on host side */
> - if (head->block == AMDGPU_RAS_BLOCK__GFX &&
> - !amdgpu_sriov_vf(adev) &&
> - !amdgpu_ras_intr_triggered()) {
>   ret = psp_ras_enable_features(&adev->psp, info, enable);
>   if (ret) {
>   dev_err(adev->dev, "ras %s %s failed poison:%d
> ret:%d\n",
> --
> 2.17.1



RE: [PATCH Review V2 1/1] drm/amdgpu: Remove redundant poison consumption handler function

2023-06-20 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Stanley.Yang 
> Sent: Monday, June 19, 2023 9:50 PM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ;
> Zhou1, Tao ; Chai, Thomas 
> Cc: Yang, Stanley 
> Subject: [PATCH Review V2 1/1] drm/amdgpu: Remove redundant poison
> consumption handler function
>
> The function callback handle_poison_consumption and callback function
> poison_consumption_handler are almost same to handle poison consumption,
> remove poison_consumption_handler.
>
> Changed from V1:
>   Add handle poison consumption function for VCN2.6, VCN4.0,
>   JPEG2.6 and JPEG4.0, return false when handle VCN/JPEGP poison
>   consumption.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c  |  9 -
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h  |  4 
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  |  8 +++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h  |  3 ++-
> drivers/gpu/drm/amd/amdgpu/gfx_v11_0_3.c | 12 +---
>  drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c   |  9 +
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c   |  9 +
>  drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c|  9 +
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c|  9 +
>  9 files changed, 50 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index a33d4bc34cee..c15dbdb2e0f9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -840,15 +840,6 @@ int amdgpu_gfx_ras_sw_init(struct amdgpu_device
> *adev)
>   return 0;
>  }
>
> -int amdgpu_gfx_poison_consumption_handler(struct amdgpu_device *adev,
> - struct amdgpu_iv_entry *entry)
> -{
> - if (adev->gfx.ras && adev->gfx.ras->poison_consumption_handler)
> - return adev->gfx.ras->poison_consumption_handler(adev,
> entry);
> -
> - return 0;
> -}
> -
>  int amdgpu_gfx_process_ras_data_cb(struct amdgpu_device *adev,
>   void *err_data,
>   struct amdgpu_iv_entry *entry)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> index d0c3f2955821..95b80bc8cdb9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
> @@ -273,8 +273,6 @@ struct amdgpu_gfx_ras {
>   int (*rlc_gc_fed_irq)(struct amdgpu_device *adev,
>   struct amdgpu_irq_src *source,
>   struct amdgpu_iv_entry *entry);
> - int (*poison_consumption_handler)(struct amdgpu_device *adev,
> - struct amdgpu_iv_entry
> *entry);
>  };
>
>  struct amdgpu_gfx_shadow_info {
> @@ -538,8 +536,6 @@ int amdgpu_gfx_get_num_kcq(struct amdgpu_device
> *adev);  void amdgpu_gfx_cp_init_microcode(struct amdgpu_device *adev,
> uint32_t ucode_id);
>
>  int amdgpu_gfx_ras_sw_init(struct amdgpu_device *adev); -int
> amdgpu_gfx_poison_consumption_handler(struct amdgpu_device *adev,
> - struct amdgpu_iv_entry
> *entry);
>
>  bool amdgpu_gfx_is_master_xcc(struct amdgpu_device *adev, int xcc_id);  int
> amdgpu_gfx_sysfs_init(struct amdgpu_device *adev); diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 5b6525d8dace..9ce7c7537751 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1668,7 +1668,7 @@ void amdgpu_ras_interrupt_fatal_error_handler(struct
> amdgpu_device *adev)  static void
> amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *obj,
>   struct amdgpu_iv_entry *entry)
>  {
> - bool poison_stat = false;
> + bool poison_stat = true;
>   struct amdgpu_device *adev = obj->adev;
>   struct amdgpu_ras_block_object *block_obj =
>   amdgpu_ras_get_ras_block(adev, obj->head.block, 0); @@ -
> 1694,15 +1694,13 @@ static void
> amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
>   amdgpu_umc_poison_handler(adev, false);
>
>   if (block_obj->hw_ops && block_obj->hw_ops-
> >handle_poison_consumption)
> - poison_stat = block_obj->hw_ops-
> >handle_poison_consumption(adev);
> + poison_stat = block_obj->hw_ops-
> >handle_poison_consumption(adev,
> +entry);

[Tao] !block_obj->hw_ops->handle_poison_consumption is allowed, we can add the 
following code to avoid adding handle_poison_consumption for vcn and j

RE: [PATCH Review 2/2] drm/amdgpu: Add checking mc_vram_size

2023-06-13 Thread Zhou1, Tao
[AMD Official Use Only - General]

With my concerns fixed, the series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Stanley.Yang 
> Sent: Tuesday, June 13, 2023 11:53 AM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ;
> Zhou1, Tao 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 2/2] drm/amdgpu: Add checking mc_vram_size
>
> Do not compare injection address with mc_vram_size if mc_vram_size is zero.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 56bb0db207b9..3c041efcf0c4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -494,7 +494,8 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file
> *f,
>   ret = amdgpu_ras_feature_enable(adev, &data.head, 1);
>   break;
>   case 2:
> - if ((data.inject.address >= adev->gmc.mc_vram_size) ||
> + if ((data.inject.address >= adev->gmc.mc_vram_size &&
> + adev->gmc.mc_vram_size) ||
>   (data.inject.address >= RAS_UMC_INJECT_ADDR_LIMIT)) {
>   dev_warn(adev->dev, "RAS WARN: input address "
>   "0x%llx is invalid.",
> --
> 2.17.1



RE: [PATCH Review 1/2] drm/amdgpu: Optimze checking ras supported

2023-06-13 Thread Zhou1, Tao
[AMD Official Use Only - General]

[Tao] typo in title: Optimze -> Optimize

> -Original Message-
> From: Stanley.Yang 
> Sent: Tuesday, June 13, 2023 11:53 AM
> To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ;
> Zhou1, Tao 
> Cc: Yang, Stanley 
> Subject: [PATCH Review 1/2] drm/amdgpu: Optimze checking ras supported
>
> Using "is_app_apu" to identify device in the native APU mode or carveout mode.
>
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c |  2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  8 +++---
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 34 ++---
>  3 files changed, 23 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 78bacea951a9..352e958b190a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -1653,7 +1653,7 @@ int psp_ras_initialize(struct psp_context *psp)
>
>   if (amdgpu_ras_is_poison_mode_supported(adev))
>   ras_cmd->ras_in_message.init_flags.poison_mode_en = 1;
> - if (!adev->gmc.xgmi.connected_to_cpu)
> + if (!adev->gmc.xgmi.connected_to_cpu && !adev->gmc.is_app_apu)
>   ras_cmd->ras_in_message.init_flags.dgpu_mode = 1;
>   ras_cmd->ras_in_message.init_flags.xcc_mask =
>   adev->gfx.xcc_mask;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 7a0924469e4f..56bb0db207b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1689,8 +1689,7 @@ static void
> amdgpu_ras_interrupt_poison_consumption_handler(struct ras_manager *
>   }
>   }
>
> - if (!adev->gmc.xgmi.connected_to_cpu)
> - amdgpu_umc_poison_handler(adev, false);
> + amdgpu_umc_poison_handler(adev, false);
>
>   if (block_obj->hw_ops && block_obj->hw_ops-
> >handle_poison_consumption)
>   poison_stat = block_obj->hw_ops-
> >handle_poison_consumption(adev);
> @@ -2458,11 +2457,10 @@ static void amdgpu_ras_check_supported(struct
> amdgpu_device *adev)  {
>   adev->ras_hw_enabled = adev->ras_enabled = 0;
>
> - if (!adev->is_atom_fw ||
> - !amdgpu_ras_asic_supported(adev))
> + if (!amdgpu_ras_asic_supported(adev))
>   return;
>
> - if (!adev->gmc.xgmi.connected_to_cpu) {
> + if (!adev->gmc.xgmi.connected_to_cpu && !adev-

[Tao] the tab should be replaced with space.

> >gmc.is_app_apu) {
>   if (amdgpu_atomfirmware_mem_ecc_supported(adev)) {
>   dev_info(adev->dev, "MEM ECC is active.\n");
>   adev->ras_hw_enabled |= (1 <<
> AMDGPU_RAS_BLOCK__UMC | diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1edf8e6aeb16..db0d94ca4ffc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -169,27 +169,31 @@ int amdgpu_umc_poison_handler(struct
> amdgpu_device *adev, bool reset)  {
>   int ret = AMDGPU_RAS_SUCCESS;
>
> - if (!amdgpu_sriov_vf(adev)) {
> - if (!adev->gmc.xgmi.connected_to_cpu) {
> - struct ras_err_data err_data = {0, 0, 0, NULL};
> - struct ras_common_if head = {
> - .block = AMDGPU_RAS_BLOCK__UMC,
> - };
> - struct ras_manager *obj = amdgpu_ras_find_obj(adev,
> &head);
> -
> - ret = amdgpu_umc_do_page_retirement(adev,
> &err_data, NULL, reset);
> -
> - if (ret == AMDGPU_RAS_SUCCESS && obj) {
> - obj->err_data.ue_count += err_data.ue_count;
> - obj->err_data.ce_count += err_data.ce_count;
> - }
> - } else if (reset) {
> + if (adev->gmc.xgmi.connected_to_cpu ||
> + adev->gmc.is_app_apu) {
> + if (reset) {
>   /* MCA poison handler is only responsible for GPU reset,
>* let MCA notifier do page retirement.
>*/
>   kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
>   amdgpu_ras_reset_gpu(adev);
>   }
> + return ret;
> + }
> +
> + if (!amdgpu_sriov_vf(adev)) {
> + struct ras_err_data err_data = {0, 0, 0, NULL};
> +   

RE: [PATCH 2/2] drm/amdgpu: Enable gfx v11_0_3 ras if poison mode is supported

2023-06-11 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Zhang, Hawking 
> Sent: Sunday, June 11, 2023 6:46 PM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ; Li,
> Candice ; Chai, Thomas ;
> Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH 2/2] drm/amdgpu: Enable gfx v11_0_3 ras if poison mode is
> supported
>
> GFX v11_0_3 ras needs to be enabled if poison mode is supported. Driver 
> doesn't
> need issue an feature enable call in gfx_v11_0 late init phase. The ras late 
> init call
> is already centralized to amdgpu_ras_late_init.
> In addition, move poison_mode check out of common helper like
> amdgpu_ras_is_supported and amdgpu_ras_is_feature_allowed ensure only GFX
> RAS is enabled when poison mode is supported.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 49 -
> drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c  | 26 -
>  2 files changed, 16 insertions(+), 59 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dd7cdc234d7e..35e70860d628 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -126,6 +126,7 @@ static bool amdgpu_ras_check_bad_page_unlock(struct
> amdgpu_ras *con,
>   uint64_t addr);
>  static bool amdgpu_ras_check_bad_page(struct amdgpu_device *adev,
>   uint64_t addr);
> +static void amdgpu_ras_query_poison_mode(struct amdgpu_device *adev);
>  #ifdef CONFIG_X86_MCE_AMD
>  static void amdgpu_register_bad_pages_mca_notifier(struct amdgpu_device
> *adev);  struct mce_notifier_adev_list { @@ -757,16 +758,6 @@ static int
> __amdgpu_ras_feature_enable(struct amdgpu_device *adev,
>   return 0;
>  }
>
> -static int amdgpu_ras_check_feature_allowed(struct amdgpu_device *adev,
> - struct ras_common_if *head)
> -{
> - if (amdgpu_ras_is_feature_allowed(adev, head) ||
> - amdgpu_ras_is_poison_mode_supported(adev))
> - return 1;
> - else
> - return 0;
> -}
> -
>  /* wrapper of psp_ras_enable_features */  int
> amdgpu_ras_feature_enable(struct amdgpu_device *adev,
>   struct ras_common_if *head, bool enable) @@ -797,7 +788,7
> @@ int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
>   }
>
>   /* Do not enable if it is not allowed. */
> - if (enable && !amdgpu_ras_check_feature_allowed(adev, head))
> + if (enable && !amdgpu_ras_is_feature_allowed(adev, head))
>   goto out;
>
>   /* Only enable ras feature operation handle on host side */ @@ -2420,9
> +2411,9 @@ static bool amdgpu_ras_asic_supported(struct amdgpu_device
> *adev)  }
>
>  /*
> - * this is workaround for vega20 workstation sku,
> - * force enable gfx ras, ignore vbios gfx ras flag
> - * due to GC EDC can not write
> + * Common helpers for device or IP specific RAS quirks including
> + * a). Enable gfx ras on D16406 or D36002 board
> + * b). Enable gfx ras in gfx_v11_0_3 if poison mode is supported
>   */
>  static void amdgpu_ras_get_quirks(struct amdgpu_device *adev)  { @@ -
> 2431,10 +2422,16 @@ static void amdgpu_ras_get_quirks(struct amdgpu_device
> *adev)
>   if (!ctx)
>   return;
>
> + /* Enable gfx ras on specific board */
>   if (strnstr(ctx->vbios_version, "D16406",
>   sizeof(ctx->vbios_version)) ||
> - strnstr(ctx->vbios_version, "D36002",
> - sizeof(ctx->vbios_version)))
> + strnstr(ctx->vbios_version, "D36002",
> + sizeof(ctx->vbios_version)))
> + adev->ras_hw_enabled |= (1 << AMDGPU_RAS_BLOCK__GFX);
> +
> + /* Enable gfx ras on gfx_v11_0_3 if poison mode is supported */
> + if (adev->ip_versions[GC_HWIP][0] == IP_VERSION(11, 0, 3) &&
> + amdgpu_ras_is_poison_mode_supported(adev))
>   adev->ras_hw_enabled |= (1 << AMDGPU_RAS_BLOCK__GFX);  }
>
> @@ -2502,6 +2499,8 @@ static void amdgpu_ras_check_supported(struct
> amdgpu_device *adev)
>  1 <<
> AMDGPU_RAS_BLOCK__MMHUB);
>   }
>
> + amdgpu_ras_query_poison_mode(adev);
> +
>   amdgpu_ras_get_quirks(adev);
>
>   /* hw_supported needs to be aligned with RAS block mask. */ @@ -
> 2659,8 +2658,6 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
>   goto release_con;
>   }
>
> - amdgpu_ras_query_poison_mode(adev);
> 

RE: [PATCH 1/2] drm/amdgpu: Only create err_count sysfs when hw_op is supported

2023-06-11 Thread Zhou1, Tao
[AMD Official Use Only - General]

> -Original Message-
> From: Zhang, Hawking 
> Sent: Sunday, June 11, 2023 6:46 PM
> To: amd-gfx@lists.freedesktop.org; Yang, Stanley ; Li,
> Candice ; Chai, Thomas ;
> Zhou1, Tao 
> Cc: Zhang, Hawking 
> Subject: [PATCH 1/2] drm/amdgpu: Only create err_count sysfs when hw_op is
> supported
>
> Some IP blocks only support partial ras feature and don't have ras counter 
> and/or
> ras error status register at all.
> Driver should not create err_count sysfs node for those IP blocks.
>
> Signed-off-by: Hawking Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 31 ++---
>  1 file changed, 18 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index a6c3265cdbc4..dd7cdc234d7e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2757,23 +2757,28 @@ int amdgpu_ras_block_late_init(struct
> amdgpu_device *adev,
>   goto cleanup;
>   }
>
> - r = amdgpu_ras_sysfs_create(adev, ras_block);
> - if (r)
> - goto interrupt;
> + if (ras_obj->hw_ops &&
> + (ras_obj->hw_ops->query_ras_error_count ||
> +  ras_obj->hw_ops->query_ras_error_status)) {

[Tao] the condition can be also changed like this:

   if (!(ras_obj->hw_ops &&
(ras_obj->hw_ops->query_ras_error_count ||
 ras_obj->hw_ops->query_ras_error_status)))
return 0;

Either way is fine with me.


> + r = amdgpu_ras_sysfs_create(adev, ras_block);
> + if (r)
> + goto interrupt;
>
> - /* Those are the cached values at init.
> -  */
> - query_info = kzalloc(sizeof(struct ras_query_if), GFP_KERNEL);
> - if (!query_info)
> - return -ENOMEM;
> - memcpy(&query_info->head, ras_block, sizeof(struct ras_common_if));
> + /* Those are the cached values at init.
> +  */
> + query_info = kzalloc(sizeof(struct ras_query_if), GFP_KERNEL);
> + if (!query_info)
> + return -ENOMEM;
> + memcpy(&query_info->head, ras_block, sizeof(struct
> ras_common_if));
>
> - if (amdgpu_ras_query_error_count(adev, &ce_count, &ue_count,
> query_info) == 0) {
> - atomic_set(&con->ras_ce_count, ce_count);
> - atomic_set(&con->ras_ue_count, ue_count);
> + if (amdgpu_ras_query_error_count(adev, &ce_count, &ue_count,
> query_info) == 0) {
> + atomic_set(&con->ras_ce_count, ce_count);
> + atomic_set(&con->ras_ue_count, ue_count);
> + }
> +
> + kfree(query_info);
>   }
>
> - kfree(query_info);
>   return 0;
>
>  interrupt:
> --
> 2.17.1



RE: [PATCH v3 6/6] drm/amdgpu: add RAS POISON interrupt funcs for jpeg_v4_0

2023-05-15 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Horatio
> Zhang
> Sent: Tuesday, May 16, 2023 1:04 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Liu, HaoPing (Alan) ; Zhou, Bob
> ; Zhang, Horatio ; Xu, Feifei
> ; Zhou1, Tao ; Jiang, Sonny
> ; Limonciello, Mario ;
> Liu, Leo ; Zhang, Hawking 
> Subject: [PATCH v3 6/6] drm/amdgpu: add RAS POISON interrupt funcs for
> jpeg_v4_0
> 
> Add ras_poison_irq and functions. And fix the amdgpu_irq_put call trace in
> jpeg_v4_0_hw_fini.
> 
> [   50.497562] RIP: 0010:amdgpu_irq_put+0xa4/0xc0 [amdgpu]
> [   50.497619] RSP: 0018:aa2400fcfcb0 EFLAGS: 00010246
> [   50.497620] RAX:  RBX: 0001 RCX:
> 
> [   50.497621] RDX:  RSI:  RDI:
> 
> [   50.497621] RBP: aa2400fcfcd0 R08:  R09:
> 
> [   50.497622] R10:  R11:  R12:
> 99b2105242d8
> [   50.497622] R13:  R14: 99b21050 R15:
> 99b21050
> [   50.497623] FS:  () GS:99b51848()
> knlGS:
> [   50.497623] CS:  0010 DS:  ES:  CR0: 80050033
> [   50.497624] CR2: 7f9d32aa91e8 CR3: 0001ba21 CR4:
> 00750ee0
> [   50.497624] PKRU: 5554
> [   50.497625] Call Trace:
> [   50.497625]  
> [   50.497627]  jpeg_v4_0_hw_fini+0x43/0xc0 [amdgpu]
> [   50.497693]  jpeg_v4_0_suspend+0x13/0x30 [amdgpu]
> [   50.497751]  amdgpu_device_ip_suspend_phase2+0x240/0x470 [amdgpu]
> [   50.497802]  amdgpu_device_ip_suspend+0x41/0x80 [amdgpu]
> [   50.497854]  amdgpu_device_pre_asic_reset+0xd9/0x4a0 [amdgpu]
> [   50.497905]  amdgpu_device_gpu_recover.cold+0x548/0xcf1 [amdgpu]
> [   50.498005]  amdgpu_debugfs_reset_work+0x4c/0x80 [amdgpu]
> [   50.498060]  process_one_work+0x21f/0x400
> [   50.498063]  worker_thread+0x200/0x3f0
> [   50.498064]  ? process_one_work+0x400/0x400
> [   50.498065]  kthread+0xee/0x120
> [   50.498067]  ? kthread_complete_and_exit+0x20/0x20
> [   50.498068]  ret_from_fork+0x22/0x30
> 
> Suggested-by: Hawking Zhang 
> Signed-off-by: Horatio Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 27 +++---
>  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> index 495facb885f4..8690467b3285 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> @@ -87,13 +87,13 @@ static int jpeg_v4_0_sw_init(void *handle)
> 
>   /* JPEG DJPEG POISON EVENT */
>   r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN,
> - VCN_4_0__SRCID_DJPEG0_POISON, &adev->jpeg.inst-
> >irq);
> + VCN_4_0__SRCID_DJPEG0_POISON, &adev->jpeg.inst-
> >ras_poison_irq);
>   if (r)
>   return r;
> 
>   /* JPEG EJPEG POISON EVENT */
>   r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN,
> - VCN_4_0__SRCID_EJPEG0_POISON, &adev->jpeg.inst-
> >irq);
> + VCN_4_0__SRCID_EJPEG0_POISON, &adev->jpeg.inst-
> >ras_poison_irq);
>   if (r)
>   return r;
> 
> @@ -202,7 +202,8 @@ static int jpeg_v4_0_hw_fini(void *handle)
>   RREG32_SOC15(JPEG, 0, regUVD_JRBC_STATUS))
>   jpeg_v4_0_set_powergating_state(adev,
> AMD_PG_STATE_GATE);
>   }
> - amdgpu_irq_put(adev, &adev->jpeg.inst->irq, 0);
> + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__JPEG))
> + amdgpu_irq_put(adev, &adev->jpeg.inst->ras_poison_irq, 0);
> 
>   return 0;
>  }
> @@ -670,6 +671,14 @@ static int jpeg_v4_0_set_interrupt_state(struct
> amdgpu_device *adev,
>   return 0;
>  }
> 
> +static int jpeg_v4_0_set_ras_interrupt_state(struct amdgpu_device *adev,
> + struct amdgpu_irq_src *source,
> + unsigned int type,
> + enum amdgpu_interrupt_state state) {
> + return 0;
> +}
> +
>  static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
> struct amdgpu_irq_src *source,
> struct amdgpu_iv_entry *entry) @@ -680,10
> +689,6 @@ static int jpeg_v4_0_process_interrupt(struct amdgpu_device *adev,
>   case VCN_4_0__SRCID__JPEG_DECODE:
>   amdgpu_fence_process(adev->jpeg.inst->ring_dec);
>   br

RE: [PATCH v2 1/2] drm/amdgpu: separate ras irq from vcn instance irq for UVD_POISON

2023-05-14 Thread Zhou1, Tao
[AMD Official Use Only - General]

The code is fine with me, but it's better to split the patch into three parts, 
one is for common vcn code, one is for vcn 2.6 and the third one is for vcn 4.0.

Regards,
Tao

> -Original Message-
> From: Horatio Zhang 
> Sent: Monday, May 15, 2023 10:28 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Xu, Feifei ; Liu, Leo
> ; Jiang, Sonny ; Limonciello, Mario
> ; Liu, HaoPing (Alan) ;
> Zhou, Bob ; Zhang, Horatio ;
> Zhang, Hawking 
> Subject: [PATCH v2 1/2] drm/amdgpu: separate ras irq from vcn instance irq for
> UVD_POISON
> 
> Separate RAS poison consumption handling from the instance irq, and register
> dedicated ras_poison_irq src and funcs for UVD_POISON. Fix the amdgpu_irq_put
> call trace in vcn_v4_0_hw_fini.
> 
> [   44.563572] RIP: 0010:amdgpu_irq_put+0xa4/0xc0 [amdgpu]
> [   44.563629] RSP: 0018:b36740edfc90 EFLAGS: 00010246
> [   44.563630] RAX:  RBX: 0001 RCX:
> 
> [   44.563630] RDX:  RSI:  RDI:
> 
> [   44.563631] RBP: b36740edfcb0 R08:  R09:
> 
> [   44.563631] R10:  R11:  R12:
> 954c568e2ea8
> [   44.563631] R13:  R14: 954c568c R15:
> 954c568e2ea8
> [   44.563632] FS:  () GS:954f584c()
> knlGS:
> [   44.563632] CS:  0010 DS:  ES:  CR0: 80050033
> [   44.563633] CR2: 7f028741ba70 CR3: 00026ca1 CR4:
> 00750ee0
> [   44.563633] PKRU: 5554
> [   44.563633] Call Trace:
> [   44.563634]  
> [   44.563634]  vcn_v4_0_hw_fini+0x62/0x160 [amdgpu]
> [   44.563700]  vcn_v4_0_suspend+0x13/0x30 [amdgpu]
> [   44.563755]  amdgpu_device_ip_suspend_phase2+0x240/0x470 [amdgpu]
> [   44.563806]  amdgpu_device_ip_suspend+0x41/0x80 [amdgpu]
> [   44.563858]  amdgpu_device_pre_asic_reset+0xd9/0x4a0 [amdgpu]
> [   44.563909]  amdgpu_device_gpu_recover.cold+0x548/0xcf1 [amdgpu]
> [   44.564006]  amdgpu_debugfs_reset_work+0x4c/0x80 [amdgpu]
> [   44.564061]  process_one_work+0x21f/0x400
> [   44.564062]  worker_thread+0x200/0x3f0
> [   44.564063]  ? process_one_work+0x400/0x400
> [   44.564064]  kthread+0xee/0x120
> [   44.564065]  ? kthread_complete_and_exit+0x20/0x20
> [   44.564066]  ret_from_fork+0x22/0x30
> 
> Suggested-by: Hawking Zhang 
> Signed-off-by: Horatio Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 27 ++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h |  3 +++
>  drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c   | 24 ++---
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c   | 35 -
>  4 files changed, 78 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> index e63fcc58e8e0..f53c22db8d25 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
> @@ -1181,6 +1181,31 @@ int amdgpu_vcn_process_poison_irq(struct
> amdgpu_device *adev,
>   return 0;
>  }
> 
> +int amdgpu_vcn_ras_late_init(struct amdgpu_device *adev, struct
> +ras_common_if *ras_block) {
> + int r, i;
> +
> + r = amdgpu_ras_block_late_init(adev, ras_block);
> + if (r)
> + return r;
> +
> + if (amdgpu_ras_is_supported(adev, ras_block->block)) {
> + for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
> + if (adev->vcn.harvest_config & (1 << i))
> + continue;
> +
> + r = amdgpu_irq_get(adev, &adev-
> >vcn.inst[i].ras_poison_irq, 0);
> + if (r)
> + goto late_fini;
> + }
> + }
> + return 0;
> +
> +late_fini:
> + amdgpu_ras_block_late_fini(adev, ras_block);
> + return r;
> +}
> +
>  int amdgpu_vcn_ras_sw_init(struct amdgpu_device *adev)  {
>   int err;
> @@ -1202,7 +1227,7 @@ int amdgpu_vcn_ras_sw_init(struct amdgpu_device
> *adev)
>   adev->vcn.ras_if = &ras->ras_block.ras_comm;
> 
>   if (!ras->ras_block.ras_late_init)
> - ras->ras_block.ras_late_init = amdgpu_ras_block_late_init;
> + ras->ras_block.ras_late_init = amdgpu_vcn_ras_late_init;
> 
>   return 0;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> index c730949ece7d..802d4c2edb41 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.h
> @@ -234

RE: [PATCH 1/2] drm/amdgpu: fix amdgpu_irq_put call trace in jpeg_v4_0_hw_fini

2023-05-08 Thread Zhou1, Tao
[AMD Official Use Only - General]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Horatio Zhang 
> Sent: Monday, May 8, 2023 6:20 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Zhou1, Tao
> ; Xu, Feifei ; Liu, Leo
> ; Jiang, Sonny ; Limonciello,
> Mario ; Liu, HaoPing (Alan)
> ; Zhang, Horatio 
> Subject: [PATCH 1/2] drm/amdgpu: fix amdgpu_irq_put call trace in
> jpeg_v4_0_hw_fini
> 
> During the suspend, the jpeg_v4_0_hw_init function will use the
> amdgpu_irq_put to disable the irq of jpeg.inst, but it was not enabled during 
> the
> resume process, which resulted in a call trace during the GPU reset process.
> 
> [   50.497562] RIP: 0010:amdgpu_irq_put+0xa4/0xc0 [amdgpu]
> [   50.497619] RSP: 0018:aa2400fcfcb0 EFLAGS: 00010246
> [   50.497620] RAX:  RBX: 0001 RCX:
> 
> [   50.497621] RDX:  RSI:  RDI:
> 
> [   50.497621] RBP: aa2400fcfcd0 R08:  R09:
> 
> [   50.497622] R10:  R11:  R12:
> 99b2105242d8
> [   50.497622] R13:  R14: 99b21050 R15:
> 99b21050
> [   50.497623] FS:  () GS:99b51848()
> knlGS:
> [   50.497623] CS:  0010 DS:  ES:  CR0: 80050033
> [   50.497624] CR2: 7f9d32aa91e8 CR3: 0001ba21 CR4:
> 00750ee0
> [   50.497624] PKRU: 5554
> [   50.497625] Call Trace:
> [   50.497625]  
> [   50.497627]  jpeg_v4_0_hw_fini+0x43/0xc0 [amdgpu]
> [   50.497693]  jpeg_v4_0_suspend+0x13/0x30 [amdgpu]
> [   50.497751]  amdgpu_device_ip_suspend_phase2+0x240/0x470 [amdgpu]
> [   50.497802]  amdgpu_device_ip_suspend+0x41/0x80 [amdgpu]
> [   50.497854]  amdgpu_device_pre_asic_reset+0xd9/0x4a0 [amdgpu]
> [   50.497905]  amdgpu_device_gpu_recover.cold+0x548/0xcf1 [amdgpu]
> [   50.498005]  amdgpu_debugfs_reset_work+0x4c/0x80 [amdgpu]
> [   50.498060]  process_one_work+0x21f/0x400
> [   50.498063]  worker_thread+0x200/0x3f0
> [   50.498064]  ? process_one_work+0x400/0x400
> [   50.498065]  kthread+0xee/0x120
> [   50.498067]  ? kthread_complete_and_exit+0x20/0x20
> [   50.498068]  ret_from_fork+0x22/0x30
> 
> Fixes: 86e8255f941e ("drm/amdgpu: add JPEG 4.0 RAS poison consumption
> handling")
> Signed-off-by: Horatio Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> index 77e1e64aa1d1..b5c14a166063 100644
> --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c
> @@ -66,6 +66,13 @@ static int jpeg_v4_0_early_init(void *handle)
>   return 0;
>  }
> 
> +static int jpeg_v4_0_late_init(void *handle) {
> + struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> +
> + return amdgpu_irq_get(adev, &adev->jpeg.inst->irq, 0); }
> +
>  /**
>   * jpeg_v4_0_sw_init - sw init for JPEG block
>   *
> @@ -696,7 +703,7 @@ static int jpeg_v4_0_process_interrupt(struct
> amdgpu_device *adev,  static const struct amd_ip_funcs jpeg_v4_0_ip_funcs = {
>   .name = "jpeg_v4_0",
>   .early_init = jpeg_v4_0_early_init,
> - .late_init = NULL,
> + .late_init = jpeg_v4_0_late_init,
>   .sw_init = jpeg_v4_0_sw_init,
>   .sw_fini = jpeg_v4_0_sw_fini,
>   .hw_init = jpeg_v4_0_hw_init,
> --
> 2.34.1


RE: [PATCH 2/2] drm/amdgpu: fix amdgpu_irq_put call trace in vcn_v4_0_hw_fini

2023-05-08 Thread Zhou1, Tao
[AMD Official Use Only - General]



> -Original Message-
> From: amd-gfx  On Behalf Of Horatio
> Zhang
> Sent: Monday, May 8, 2023 6:20 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Liu, HaoPing (Alan) ; Zhang, Horatio
> ; Xu, Feifei ; Zhou1, Tao
> ; Jiang, Sonny ; Limonciello,
> Mario ; Liu, Leo ; Zhang,
> Hawking 
> Subject: [PATCH 2/2] drm/amdgpu: fix amdgpu_irq_put call trace in
> vcn_v4_0_hw_fini
> 
> During the suspend, the vcn_v4_0_hw_init function will use the amdgpu_irq_put
> to disable the irq of vcn.inst, but it was not enabled during the resume 
> process,
> which resulted in a call trace during the GPU reset process.
> 
> [   44.563572] RIP: 0010:amdgpu_irq_put+0xa4/0xc0 [amdgpu]
> [   44.563629] RSP: 0018:b36740edfc90 EFLAGS: 00010246
> [   44.563630] RAX:  RBX: 0001 RCX:
> 
> [   44.563630] RDX:  RSI:  RDI:
> 
> [   44.563631] RBP: b36740edfcb0 R08:  R09:
> 
> [   44.563631] R10:  R11:  R12:
> 954c568e2ea8
> [   44.563631] R13:  R14: 954c568c R15:
> 954c568e2ea8
> [   44.563632] FS:  () GS:954f584c()
> knlGS:
> [   44.563632] CS:  0010 DS:  ES:  CR0: 80050033
> [   44.563633] CR2: 7f028741ba70 CR3: 00026ca1 CR4:
> 00750ee0
> [   44.563633] PKRU: 5554
> [   44.563633] Call Trace:
> [   44.563634]  
> [   44.563634]  vcn_v4_0_hw_fini+0x62/0x160 [amdgpu]
> [   44.563700]  vcn_v4_0_suspend+0x13/0x30 [amdgpu]
> [   44.563755]  amdgpu_device_ip_suspend_phase2+0x240/0x470 [amdgpu]
> [   44.563806]  amdgpu_device_ip_suspend+0x41/0x80 [amdgpu]
> [   44.563858]  amdgpu_device_pre_asic_reset+0xd9/0x4a0 [amdgpu]
> [   44.563909]  amdgpu_device_gpu_recover.cold+0x548/0xcf1 [amdgpu]
> [   44.564006]  amdgpu_debugfs_reset_work+0x4c/0x80 [amdgpu]
> [   44.564061]  process_one_work+0x21f/0x400
> [   44.564062]  worker_thread+0x200/0x3f0
> [   44.564063]  ? process_one_work+0x400/0x400
> [   44.564064]  kthread+0xee/0x120
> [   44.564065]  ? kthread_complete_and_exit+0x20/0x20
> [   44.564066]  ret_from_fork+0x22/0x30
> 
> Fixes: ea5309de7388 ("drm/amdgpu: add VCN 4.0 RAS poison consumption
> handling")
> Signed-off-by: Horatio Zhang 
> ---
>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> index bf0674039598..b55eb1bf3e30 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
> @@ -281,6 +281,21 @@ static int vcn_v4_0_hw_init(void *handle)
>   return r;
>  }
> 
> +static int vcn_v4_0_late_init(void *handle) {
> + struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> + int i;
> +
> + for (i = 0; i < adev->vcn.num_vcn_inst; ++i) {
> + if (adev->vcn.harvest_config & (1 << i))
> + continue;
> +
> + amdgpu_irq_get(adev, &adev->vcn.inst[i].irq, 0);

[Tao] we can also check its return value and exit if the r is none-zero. But 
either way is fine with me.

> + }
> +
> + return 0;
> +}
> +
>  /**
>   * vcn_v4_0_hw_fini - stop the hardware block
>   *
> @@ -2047,7 +2062,7 @@ static void vcn_v4_0_set_irq_funcs(struct
> amdgpu_device *adev)  static const struct amd_ip_funcs vcn_v4_0_ip_funcs = {
>   .name = "vcn_v4_0",
>   .early_init = vcn_v4_0_early_init,
> - .late_init = NULL,
> + .late_init = vcn_v4_0_late_init,
>   .sw_init = vcn_v4_0_sw_init,
>   .sw_fini = vcn_v4_0_sw_fini,
>   .hw_init = vcn_v4_0_hw_init,
> --
> 2.34.1


RE: [PATCH] drm/amdgpu/gfx: disable cp_ecc_error_irq only when gfx ras is enabled in suspend

2023-05-07 Thread Zhou1, Tao
[AMD Official Use Only - General]

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Chen, Guchun 
> Sent: Saturday, May 6, 2023 8:16 PM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> ; Zhang, Hawking ;
> Lazar, Lijo ; Zhou1, Tao ; Koenig,
> Christian 
> Cc: Chen, Guchun 
> Subject: [PATCH] drm/amdgpu/gfx: disable cp_ecc_error_irq only when gfx ras is
> enabled in suspend
> 
> cp_ecc_error_irq is only enabled when gfx ras is assert.
> So in gfx_v9_0_hw_fini, interrupt disablement for cp_ecc_error_irq should be
> executed under such condition, otherwise, an amdgpu_irq_put calltrace will
> occur.
> 
> [ 7283.170322] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu] [ 7283.170964]
> RSP: 0018:9a5fc3967d00 EFLAGS: 00010246 [ 7283.170967] RAX:
> 98d88afd3040 RBX: 98d89da2 RCX: 
> [ 7283.170969] RDX:  RSI: 98d89da2bef8 RDI:
> 98d89da2 [ 7283.170971] RBP: 98d89da2 R08:
> 98d89da2ca18 R09: 0006 [ 7283.170973] R10:
> d5764243c008 R11:  R12: 1050
> [ 7283.170975] R13: 98d89da38978 R14: 999ae15a R15:
> 98d880130105 [ 7283.170978] FS:  ()
> GS:98d996f0() knlGS: [ 7283.170981] CS:  0010
> DS:  ES:  CR0: 80050033 [ 7283.170983] CR2:
> f7a9d178 CR3: 0001c42ea000 CR4: 003506e0
> [ 7283.170986] Call Trace:
> [ 7283.170988]  
> [ 7283.170989]  gfx_v9_0_hw_fini+0x1c/0x6d0 [amdgpu] [ 7283.171655]
> amdgpu_device_ip_suspend_phase2+0x101/0x1a0 [amdgpu] [ 7283.172245]
> amdgpu_device_suspend+0x103/0x180 [amdgpu] [ 7283.172823]
> amdgpu_pmops_freeze+0x21/0x60 [amdgpu] [ 7283.173412]
> pci_pm_freeze+0x54/0xc0 [ 7283.173419]  ? __pfx_pci_pm_freeze+0x10/0x10
> [ 7283.173425]  dpm_run_callback+0x98/0x200 [ 7283.173430]
> __device_suspend+0x164/0x5f0
> 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2522
> 
> Signed-off-by: Guchun Chen 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c  | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index ecf8ceb53311..f6bc62a94099 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -4442,7 +4442,8 @@ static int gfx_v11_0_hw_fini(void *handle)
>   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   int r;
> 
> - amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0);
> + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX))
> + amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0);
>   amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>   amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index ae09fc1cfe6b..c54d05bdc2d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3751,7 +3751,8 @@ static int gfx_v9_0_hw_fini(void *handle)  {
>   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> 
> - amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0);
> + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX))
> + amdgpu_irq_put(adev, &adev->gfx.cp_ecc_error_irq, 0);
>   amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
>   amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
> 
> --
> 2.25.1


  1   2   3   4   >