RE: [PATCH v1 0/2] SDMA v5_2 ip dump support for devcoredump

2024-07-12 Thread Khatri, Sunil
[AMD Official Use Only - AMD Internal Distribution Only]

Ignore Plz

-Original Message-
From: Sunil Khatri 
Sent: Friday, July 12, 2024 5:23 PM
To: Deucher, Alexander ; Koenig, Christian 

Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil 
Subject: [PATCH v1 0/2] SDMA v5_2 ip dump support for devcoredump

Sample output:
IP: sdma_v5_2
num_instances:2

Instance:0
mmSDMA0_STATUS_REG   0x46deed57
mmSDMA0_STATUS1_REG  0x03ff
mmSDMA0_STATUS2_REG  0x3f20
mmSDMA0_STATUS3_REG  0x03f6
mmSDMA0_UCODE_CHECKSUM   0x716360f5
mmSDMA0_RB_RPTR_FETCH_HI 0x
mmSDMA0_RB_RPTR_FETCH0x4980
mmSDMA0_UTCL1_RD_STATUS  0x01891555
mmSDMA0_UTCL1_WR_STATUS  0x51811555
mmSDMA0_UTCL1_RD_XNACK0  0x00155828
mmSDMA0_UTCL1_RD_XNACK1  0x02a6a700
mmSDMA0_UTCL1_WR_XNACK0  0x00111558
mmSDMA0_UTCL1_WR_XNACK1  0x01c1c100
mmSDMA0_GFX_RB_CNTL  0x80871016
mmSDMA0_GFX_RB_RPTR  0x4980
mmSDMA0_GFX_RB_RPTR_HI   0x
mmSDMA0_GFX_RB_WPTR  0x4980
mmSDMA0_GFX_RB_WPTR_HI   0x
mmSDMA0_GFX_IB_OFFSET0x
mmSDMA0_GFX_IB_BASE_LO   0x00928600
mmSDMA0_GFX_IB_BASE_HI   0x
mmSDMA0_GFX_IB_CNTL  0x0100
mmSDMA0_GFX_IB_RPTR  0x01a0
mmSDMA0_GFX_IB_SUB_REMAIN0x
mmSDMA0_GFX_DUMMY_REG0x00af
mmSDMA0_PAGE_RB_CNTL 0x8087
mmSDMA0_PAGE_RB_RPTR 0x
mmSDMA0_PAGE_RB_RPTR_HI  0x
mmSDMA0_PAGE_RB_WPTR 0x
mmSDMA0_PAGE_RB_WPTR_HI  0x
mmSDMA0_PAGE_IB_OFFSET   0x
mmSDMA0_PAGE_IB_BASE_LO  0x
mmSDMA0_PAGE_IB_BASE_HI  0x
mmSDMA0_PAGE_DUMMY_REG   0x000f
mmSDMA0_RLC0_RB_CNTL 0x8007
mmSDMA0_RLC0_RB_RPTR 0x
mmSDMA0_RLC0_RB_RPTR_HI  0x
mmSDMA0_RLC0_RB_WPTR 0x
mmSDMA0_RLC0_RB_WPTR_HI  0x
mmSDMA0_RLC0_IB_OFFSET   0x
mmSDMA0_RLC0_IB_BASE_LO  0x
mmSDMA0_RLC0_IB_BASE_HI  0x
mmSDMA0_RLC0_DUMMY_REG   0x000f
mmSDMA0_INT_STATUS   0x00e0
mmSDMA0_VM_CNTL  0x
mmGRBM_STATUS2   0x5408

Instance:1
mmSDMA0_STATUS_REG   0x46deed57
mmSDMA0_STATUS1_REG  0x03ff
mmSDMA0_STATUS2_REG  0x43ad
mmSDMA0_STATUS3_REG  0x03f6
mmSDMA0_UCODE_CHECKSUM   0x716360f5
mmSDMA0_RB_RPTR_FETCH_HI 0x
mmSDMA0_RB_RPTR_FETCH0x3d00
mmSDMA0_UTCL1_RD_STATUS  0x01891555
mmSDMA0_UTCL1_WR_STATUS  0x51811555
mmSDMA0_UTCL1_RD_XNACK0  0x00155827
mmSDMA0_UTCL1_RD_XNACK1  0x021a1b00
mmSDMA0_UTCL1_WR_XNACK0  0x00111558
mmSDMA0_UTCL1_WR_XNACK1  0x01656500
mmSDMA0_GFX_RB_CNTL  0x80871016
mmSDMA0_GFX_RB_RPTR  0x3d00
mmSDMA0_GFX_RB_RPTR_HI   0x
mmSDMA0_GFX_RB_WPTR  0x3d00
mmSDMA0_GFX_RB_WPTR_HI   0x
mmSDMA0_GFX_IB_OFFSET0x
mmSDMA0_GFX_IB_BASE_LO   0x00927200
mmSDMA0_GFX_IB_BASE_HI   0x
mmSDMA0_GFX_IB_CNTL

RE: [PATCH v1 3/3] drm/amdgpu: select compute ME engines dynamically

2024-07-09 Thread Khatri, Sunil
[AMD Official Use Only - AMD Internal Distribution Only]

Thanks Alex

-Original Message-
From: Alex Deucher 
Sent: Tuesday, July 9, 2024 7:27 PM
To: Khatri, Sunil 
Cc: Deucher, Alexander ; Koenig, Christian 
; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v1 3/3] drm/amdgpu: select compute ME engines dynamically

Makes sense, although the pattern elsewhere is to just start at 1 for mec.  Not 
sure if it's worth the effort to fix all of those cases up too.
True, but will keep a check on gfx13 and onwards and may be we would have a 
more than one ME for gfx in some chip and then we have to take care of it 
explicitly.

Series is:
Acked-by: Alex Deucher 

On Tue, Jul 9, 2024 at 2:07 AM Sunil Khatri  wrote:
>
> GFX ME right now is one but this could change in future SOC's. Use no
> of ME for GFX as start point for ME for compute for GFX12.
>
> Signed-off-by: Sunil Khatri 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> index 084b039eb765..f384be0d1800 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> @@ -4946,7 +4946,7 @@ static void gfx_v12_ip_dump(void *handle)
> for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
> for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; 
> k++) {
> /* ME0 is for GFX so start from 1 for CP */
> -   soc24_grbm_select(adev, 1+i, j, k, 0);
> +   soc24_grbm_select(adev,
> + adev->gfx.me.num_me + i, j, k, 0);
> for (reg = 0; reg < reg_count; reg++) {
> 
> adev->gfx.ip_dump_compute_queues[index + reg] =
>
> RREG32(SOC15_REG_ENTRY_OFFSET(
> --
> 2.34.1
>


RE: [PATCH v1 1/3] drm/amdgpu: add gfx9 register support in ipdump

2024-05-29 Thread Khatri, Sunil
[AMD Official Use Only - AMD Internal Distribution Only]

-Original Message-
From: Alex Deucher 
Sent: Wednesday, May 29, 2024 7:16 PM
To: Khatri, Sunil 
Cc: Deucher, Alexander ; Koenig, Christian 
; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v1 1/3] drm/amdgpu: add gfx9 register support in ipdump

On Wed, May 29, 2024 at 5:50 AM Sunil Khatri  wrote:
>
> Add general registers of gfx9 in ipdump for devcoredump support.
>
> Signed-off-by: Sunil Khatri 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 124
> +-
>  1 file changed, 123 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 3c8c5abf35ab..528a20393313 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -149,6 +149,94 @@ MODULE_FIRMWARE("amdgpu/aldebaran_sjt_mec2.bin");
>  #define mmGOLDEN_TSC_COUNT_LOWER_Renoir0x0026
>  #define mmGOLDEN_TSC_COUNT_LOWER_Renoir_BASE_IDX   1
>
> +static const struct amdgpu_hwip_reg_entry gc_reg_list_9[] = {
> +   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS2),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT1),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT2),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_STALLED_STAT1),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STALLED_STAT1),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_BUSY_STAT),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_ERROR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_BASE),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_RPTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_BASE),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_RPTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_WPTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_BASE),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_RPTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_WPTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_BASE),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_CMD_BUFSZ),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_CMD_BUFSZ),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_CMD_BUFSZ),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_CMD_BUFSZ),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_LO),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_HI),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BUFSZ),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_LO),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_HI),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BUFSZ),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_LO),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_HI),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BUFSZ),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_LO),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_HI),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BUFSZ),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCPF_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCPC_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCPG_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmGDS_PROTECTION_FAULT),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmGDS_VM_PROTECTION_FAULT),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmIA_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmIA_UTCL1_CNTL),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmPA_CL_CNTL_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmRLC_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmRMI_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmSQC_DCACHE_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmSQC_ICACHE_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmSQ_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmTCP_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmWD_UTCL1_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmVM_L2_PROTECTION_FAULT_STATUS),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_DEBUG),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_MEC_CNTL),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_INSTR_PNTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_MEC1_INSTR_PNTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_MEC2_INSTR_PNTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_ME_INSTR_PNTR),
> +   SOC15_REG_ENTRY_STR(GC, 0, mmCP_PFP_INSTR_PNTR),
> +   SOC15_R

Re: [PATCH v3 2/4] drm/amdgpu: Add support to dump gfx10 cp registers

2024-05-15 Thread Khatri, Sunil



On 5/16/2024 1:40 AM, Deucher, Alexander wrote:

[Public]


-Original Message-
From: Sunil Khatri 
Sent: Wednesday, May 15, 2024 8:18 AM
To: Deucher, Alexander ; Koenig, Christian

Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil 
Subject: [PATCH v3 2/4] drm/amdgpu: Add support to dump gfx10 cp
registers

add support to dump registers of all instances of cp registers in gfx10

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |   1 +
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 117
+++-
  2 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 30d7f9c29478..d96873c154ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -436,6 +436,7 @@ struct amdgpu_gfx {

   /* IP reg dump */
   uint32_t*ipdump_core;
+ uint32_t*ipdump_cp;

I'd call this ip_dump_compute or ip_dump_compute_queues to align with that the 
registers represent.

Sure


Alex


  };

  struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index f6d6a4b9802d..daf9a3571183 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -381,6 +381,49 @@ static const struct amdgpu_hwip_reg_entry
gc_reg_list_10_1[] = {
   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3)  };

+static const struct amdgpu_hwip_reg_entry gc_cp_reg_list_10[] = {
+ /* compute registers */
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_HQD_PQ_WPTR_POLL_ADDR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_HQD_PQ_DOORBELL_CONTROL),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_HQD_CTX_SAVE_BASE_ADDR_LO),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_HQD_CTX_SAVE_BASE_ADDR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_HQD_SUSPEND_WG_STATE_OFFSET),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS) };
+
  static const struct soc15_reg_golden golden_settings_gc_10_1[] = {
   SOC15_REG_GOLDEN_VALUE(GC, 0, mmCB_HW_CONTROL_4,
0x, 0x00400014),
   SOC15_REG_GOLDEN_VALUE(GC, 0, mmCGTT_CPF_CLK_CTRL,
0xfcff8fff, 0xf8000100), @@ -4595,10 +4638,11 @@ static int
gfx_v10_0_compute_ring_init(struct amdgpu_device *adev, int ring_id,
hw_prio, NULL);
  }

-static void gfx_v10_0_alloc_dump_mem(struct amdgpu_device *adev)
+static void gfx_v10_0_alloc_ip_dump(struct amdgpu_device *adev)
  {
   uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1);
   uint32_t *ptr;
+ uint32_t inst;

   ptr = kcalloc(reg_count, sizeof(uint32_t), GFP_KERNEL);
   if (ptr == NULL) {
@@ -4607,6 +4651,19 @@ static void gfx_v10_0_alloc_dump_mem(struct
amdgpu_device *adev)
   } else {
   adev->gfx.ipdump_core = ptr;
   }
+
+ /* Allocate memory for gfx cp registers for all the instances */
+ reg_count = ARRAY_SIZE(gc_cp_reg_list_10);
+ inst = adev->gfx.mec.num_mec * adev->gfx.mec.num_pipe_per_mec *
+

Re: [PATCH v3 3/4] drm/amdgpu: add support to dump gfx10 queue registers

2024-05-15 Thread Khatri, Sunil



On 5/16/2024 1:42 AM, Deucher, Alexander wrote:

[Public]


-Original Message-
From: Sunil Khatri 
Sent: Wednesday, May 15, 2024 8:18 AM
To: Deucher, Alexander ; Koenig, Christian

Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil 
Subject: [PATCH v3 3/4] drm/amdgpu: add support to dump gfx10 queue
registers

Add gfx queue register for all instances in ip dump for gfx10.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |  1 +
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 86
+
  2 files changed, 87 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index d96873c154ed..54232066cd3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -437,6 +437,7 @@ struct amdgpu_gfx {
   /* IP reg dump */
   uint32_t*ipdump_core;
   uint32_t*ipdump_cp;
+ uint32_t*ipdump_gfx_queue;

I'd call this ip_dump_gfx or ip_dump_gfx_queues to better align with that it 
stores.


  };

  struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index daf9a3571183..5b8132ecc039 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -424,6 +424,33 @@ static const struct amdgpu_hwip_reg_entry
gc_cp_reg_list_10[] = {
   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS)  };

+static const struct amdgpu_hwip_reg_entry gc_gfx_queue_reg_list_10[] = {
+ /* gfx queue registers */
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_ACTIVE),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_QUEUE_PRIORITY),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_BASE),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_BASE_HI),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_OFFSET),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CSMD_RPTR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_WPTR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_WPTR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_GFX_HQD_DEQUEUE_REQUEST),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_MAPPED),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_GFX_HQD_QUE_MGR_CONTROL),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_HQ_CONTROL0),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_HQ_STATUS0),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_GFX_HQD_CE_WPTR_POLL_ADDR_LO),
+ SOC15_REG_ENTRY_STR(GC, 0,
mmCP_GFX_HQD_CE_WPTR_POLL_ADDR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_OFFSET),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_CSMD_RPTR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_MQD_BASE_ADDR),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_MQD_BASE_ADDR_HI),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR_POLL_ADDR_LO),
+ SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR_POLL_ADDR_HI) };
+
  static const struct soc15_reg_golden golden_settings_gc_10_1[] = {
   SOC15_REG_GOLDEN_VALUE(GC, 0, mmCB_HW_CONTROL_4,
0x, 0x00400014),
   SOC15_REG_GOLDEN_VALUE(GC, 0, mmCGTT_CPF_CLK_CTRL,
0xfcff8fff, 0xf8000100), @@ -4664,6 +4691,19 @@ static void
gfx_v10_0_alloc_ip_dump(struct amdgpu_device *adev)
   } else {
   adev->gfx.ipdump_cp = ptr;
   }
+
+ /* Allocate memory for gfx cp queue registers for all the instances */
+ reg_count = ARRAY_SIZE(gc_gfx_queue_reg_list_10);
+ inst = adev->gfx.me.num_me * adev->gfx.me.num_pipe_per_me *
+ adev->gfx.me.num_queue_per_pipe;
+
+ ptr = kcalloc(reg_count * inst, sizeof(uint32_t), GFP_KERNEL);
+ if (ptr == NULL) {
+ DRM_ERROR("Failed to allocate memory for GFX CP IP
Dump\n");
+ adev->gfx.ipdump_gfx_queue = NULL;
+ } else {
+ adev->gfx.ipdump_gfx_queue = ptr;
+ }
  }

  static int gfx_v10_0_sw_init(void *handle) @@ -4874,6 +4914,7 @@ static
int gfx_v10_0_sw_fini(void *handle)

   kfree(adev->gfx.ipdump_core);
   kfree(adev->gfx.ipdump_cp);
+ kfree(adev->gfx.ipdump_gfx_queue);

   return 0;
  }
@@ -9368,6 +9409,26 @@ static void gfx_v10_ip_print(void *handle, struct
drm_printer *p)
   }
   }
   }
+
+ /* print gfx queue registers for all instances */
+ if (!adev->gfx.ipdump_gfx_queue)
+ return;
+
+ reg_count = ARRAY_SIZE(gc_gfx_queue_reg_list_10);
+
+ for (i = 0; i < adev->gfx.me.num_me; i++) {
+ for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) {
+ for (k = 0; k < adev->gfx.me.num_queue_per_pipe;
k++) {
+ drm_printf(p, "me %d, pipe %d, queue %d\n",
i, j, k);
+ for (reg = 0; reg < reg_count; reg++) {
+ drm_

Re: [PATCH v3 1/4] drm/amdgpu: update the ip_dump to ipdump_core

2024-05-15 Thread Khatri, Sunil



On 5/16/2024 1:37 AM, Deucher, Alexander wrote:

[Public]


-Original Message-
From: Sunil Khatri 
Sent: Wednesday, May 15, 2024 8:18 AM
To: Deucher, Alexander ; Koenig, Christian

Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil 
Subject: [PATCH v3 1/4] drm/amdgpu: update the ip_dump to ipdump_core

Update the memory pointer from ip_dump to ipdump_core to make it specific
to core registers and rest other registers to be dumped in their respective
memories.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |  2 +-
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 14 +++---
  2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 109f471ff315..30d7f9c29478 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -435,7 +435,7 @@ struct amdgpu_gfx {
   boolmcbp; /* mid command buffer
preemption */

   /* IP reg dump */
- uint32_t*ip_dump;
+ uint32_t*ipdump_core;

I think this looks cleaner as ip_dump_core.


Noted




Alex


  };

  struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 953df202953a..f6d6a4b9802d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -4603,9 +4603,9 @@ static void gfx_v10_0_alloc_dump_mem(struct
amdgpu_device *adev)
   ptr = kcalloc(reg_count, sizeof(uint32_t), GFP_KERNEL);
   if (ptr == NULL) {
   DRM_ERROR("Failed to allocate memory for IP Dump\n");
- adev->gfx.ip_dump = NULL;
+ adev->gfx.ipdump_core = NULL;
   } else {
- adev->gfx.ip_dump = ptr;
+ adev->gfx.ipdump_core = ptr;
   }
  }

@@ -4815,7 +4815,7 @@ static int gfx_v10_0_sw_fini(void *handle)

   gfx_v10_0_free_microcode(adev);

- kfree(adev->gfx.ip_dump);
+ kfree(adev->gfx.ipdump_core);

   return 0;
  }
@@ -9283,13 +9283,13 @@ static void gfx_v10_ip_print(void *handle, struct
drm_printer *p)
   uint32_t i;
   uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1);

- if (!adev->gfx.ip_dump)
+ if (!adev->gfx.ipdump_core)
   return;

   for (i = 0; i < reg_count; i++)
   drm_printf(p, "%-50s \t 0x%08x\n",
  gc_reg_list_10_1[i].reg_name,
-adev->gfx.ip_dump[i]);
+adev->gfx.ipdump_core[i]);
  }

  static void gfx_v10_ip_dump(void *handle) @@ -9298,12 +9298,12 @@
static void gfx_v10_ip_dump(void *handle)
   uint32_t i;
   uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1);

- if (!adev->gfx.ip_dump)
+ if (!adev->gfx.ipdump_core)
   return;

   amdgpu_gfx_off_ctrl(adev, false);
   for (i = 0; i < reg_count; i++)
- adev->gfx.ip_dump[i] =
RREG32(SOC15_REG_ENTRY_OFFSET(gc_reg_list_10_1[i]));
+ adev->gfx.ipdump_core[i] =
+RREG32(SOC15_REG_ENTRY_OFFSET(gc_reg_list_10_1[i]));
   amdgpu_gfx_off_ctrl(adev, true);
  }

--
2.34.1


Re: [PATCH v1 3/4] drm/amdgpu: add compute registers in ip dump for gfx10

2024-05-03 Thread Khatri, Sunil



On 5/3/2024 9:52 PM, Alex Deucher wrote:

On Fri, May 3, 2024 at 12:09 PM Khatri, Sunil  wrote:


On 5/3/2024 9:18 PM, Khatri, Sunil wrote:

On 5/3/2024 8:52 PM, Alex Deucher wrote:

On Fri, May 3, 2024 at 4:45 AM Sunil Khatri 
wrote:

add compute registers in set of registers to dump
during ip dump for gfx10.

Signed-off-by: Sunil Khatri 
---
   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 42
+-
   1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 953df202953a..00c7a842ea3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -378,7 +378,47 @@ static const struct amdgpu_hwip_reg_entry
gc_reg_list_10_1[] = {
  SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE0),
  SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE1),
  SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE2),
-   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3)
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3),
+   /* compute registers */
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_WG_STATE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS)

The registers in patches 3 and 4 are multi-instance, so we should
ideally print every instance of them rather than just one.  Use
nv_grbm_select() to select the pipes and queues.  Make sure to protect
access using the adev->srbm_mutex mutex.

E.g., for the compute registers (patch 3):
  mutex_lock(>srbm_mutex);
  for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
  for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
 for (k = 0; k <
adev->gfx.mec.num_queue_per_pipe; k++) {
   drm_printf("mec %d, pipe %d, queue %d\n", i, j, k);
  nv_grbm_select(adev, i, j, k, 0);
 for (reg = 0; reg < ARRAY_SIZE(compute_regs);
reg++)
 drm_printf(...RREG(compute_regs[reg]));
  }
  }
  }
  nv_grbm_select(adev, 0, 0, 0, 0);
  mutex_unlock(>srbm_mutex);

For gfx registers (patch 4):

  mutex_lock(>srbm_mutex);
  for (i = 0; i < adev->gfx.me.num_me; ++i) {
  for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) {
  for (k = 0; k < adev->gfx.me.num_queue_per_pipe;
k++) {
drm_printf("me %d, pipe %d, queue
%d\n", i, j, k);
  nv_grbm_select(adev, i, j, k, 0);
 for (reg = 0; reg < ARRAY_SIZE(gfx_regs); reg++)
 drm_printf(...RREG(gfx_regs[

Re: [PATCH v1 3/4] drm/amdgpu: add compute registers in ip dump for gfx10

2024-05-03 Thread Khatri, Sunil



On 5/3/2024 9:18 PM, Khatri, Sunil wrote:


On 5/3/2024 8:52 PM, Alex Deucher wrote:
On Fri, May 3, 2024 at 4:45 AM Sunil Khatri  
wrote:

add compute registers in set of registers to dump
during ip dump for gfx10.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 42 
+-

  1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c

index 953df202953a..00c7a842ea3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -378,7 +378,47 @@ static const struct amdgpu_hwip_reg_entry 
gc_reg_list_10_1[] = {

 SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE0),
 SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE1),
 SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE2),
-   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3)
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3),
+   /* compute registers */
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_WG_STATE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS)

The registers in patches 3 and 4 are multi-instance, so we should
ideally print every instance of them rather than just one.  Use
nv_grbm_select() to select the pipes and queues.  Make sure to protect
access using the adev->srbm_mutex mutex.

E.g., for the compute registers (patch 3):
 mutex_lock(>srbm_mutex);
 for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
 for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
    for (k = 0; k < 
adev->gfx.mec.num_queue_per_pipe; k++) {

  drm_printf("mec %d, pipe %d, queue %d\n", i, j, k);
 nv_grbm_select(adev, i, j, k, 0);
    for (reg = 0; reg < ARRAY_SIZE(compute_regs); 
reg++)

    drm_printf(...RREG(compute_regs[reg]));
 }
 }
 }
 nv_grbm_select(adev, 0, 0, 0, 0);
 mutex_unlock(>srbm_mutex);

For gfx registers (patch 4):

 mutex_lock(>srbm_mutex);
 for (i = 0; i < adev->gfx.me.num_me; ++i) {
 for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) {
 for (k = 0; k < adev->gfx.me.num_queue_per_pipe; 
k++) {
   drm_printf("me %d, pipe %d, queue 
%d\n", i, j, k);

 nv_grbm_select(adev, i, j, k, 0);
    for (reg = 0; reg < ARRAY_SIZE(gfx_regs); reg++)
    drm_printf(...RREG(gfx_regs[reg]));
I see one problem here, we dump the registers in memory allocated first 
and read before and store and the

Re: [PATCH v1 3/4] drm/amdgpu: add compute registers in ip dump for gfx10

2024-05-03 Thread Khatri, Sunil



On 5/3/2024 8:52 PM, Alex Deucher wrote:

On Fri, May 3, 2024 at 4:45 AM Sunil Khatri  wrote:

add compute registers in set of registers to dump
during ip dump for gfx10.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 42 +-
  1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 953df202953a..00c7a842ea3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -378,7 +378,47 @@ static const struct amdgpu_hwip_reg_entry 
gc_reg_list_10_1[] = {
 SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE0),
 SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE1),
 SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE2),
-   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3)
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3),
+   /* compute registers */
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_WG_STATE_OFFSET),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS)

The registers in patches 3 and 4 are multi-instance, so we should
ideally print every instance of them rather than just one.  Use
nv_grbm_select() to select the pipes and queues.  Make sure to protect
access using the adev->srbm_mutex mutex.

E.g., for the compute registers (patch 3):
 mutex_lock(>srbm_mutex);
 for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
 for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
  drm_printf("mec %d, pipe %d, queue %d\n", i, j, k);
 nv_grbm_select(adev, i, j, k, 0);
for (reg = 0; reg < ARRAY_SIZE(compute_regs); reg++)
drm_printf(...RREG(compute_regs[reg]));
 }
 }
 }
 nv_grbm_select(adev, 0, 0, 0, 0);
 mutex_unlock(>srbm_mutex);

For gfx registers (patch 4):

 mutex_lock(>srbm_mutex);
 for (i = 0; i < adev->gfx.me.num_me; ++i) {
 for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) {
 for (k = 0; k < adev->gfx.me.num_queue_per_pipe; k++) {
   drm_printf("me %d, pipe %d, queue %d\n", i, j, 
k);
 nv_grbm_select(adev, i, j, k, 0);
for (reg = 0; reg < ARRAY_SIZE(gfx_regs); reg++)
drm_printf(...RREG(gfx_regs[reg]));
 }
 }
 }
 nv_grbm_select(adev, 0, 0, 0, 0);
 mutex_unlock(>srbm_mutex);


Thanks for pointing that out and suggesting the sample code of how it 
should be. Will take care of this in next 

Re: [PATCH] drm/amdgpu: skip ip dump if devcoredump flag is set

2024-04-25 Thread Khatri, Sunil


On 4/25/2024 7:43 PM, Lazar, Lijo wrote:


On 4/25/2024 3:53 PM, Sunil Khatri wrote:

Do not dump the ip registers during driver reload
in passthrough environment.

Signed-off-by: Sunil Khatri
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 869256394136..b50758482530 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5372,10 +5372,12 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
amdgpu_reset_reg_dumps(tmp_adev);

Probably not related, can the above step be clubbed with what's being
done below? Or, can we move all such to start with amdgpu_reset_dump_*?

Sure lizo

I will club both dump_ip_state and amdgpu_reset_reg_dumps under one if 
condition in the patch to push.


Regards Sunil

  
  	/* Trigger ip dump before we reset the asic */

-   for (i = 0; i < tmp_adev->num_ip_blocks; i++)
-   if (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
-   tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
-   (void *)tmp_adev);
+   if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) {
+   for (i = 0; i < tmp_adev->num_ip_blocks; i++)
+   if 
(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
+   tmp_adev->ip_blocks[i].version->funcs
+   ->dump_ip_state((void *)tmp_adev);
+   }


Anyway,

Reviewed-by: Lijo Lazar

Thanks,
Lijo
  
  	reset_context->reset_device_list = device_list_handle;

r = amdgpu_reset_perform_reset(tmp_adev, reset_context);

Re: [PATCH v5 2/6] drm/amdgpu: add support of gfx10 register dump

2024-04-17 Thread Khatri, Sunil



On 4/17/2024 10:21 PM, Alex Deucher wrote:

On Wed, Apr 17, 2024 at 12:24 PM Lazar, Lijo  wrote:

[AMD Official Use Only - General]

Yes, right now that API doesn't return anything. What I meant is to add that 
check as well as coredump API is essentially used in hang situations.

Old times, access to registers while in GFXOFF resulted in system hang 
(basically it won't go beyond this point). If that happens, then the purpose of 
the patch - to get the context of a device hang - is lost. We may not even get 
a proper dmesg log.

Maybe add a call to amdgpu_get_gfx_off_status(), but unfortunately,
it's not implemented on every chip yet.
So we need both the things do gfx_off and then try status and then read 
reg and enable gfx_off again.


 amdgpu_gfx_off_ctrl(adev, false);
 r= amdgpu_get_gfx_off_status
 if (!r) {

   for (i = 0; i < reg_count; i++)
   adev->gfx.ip_dump[i] =
   RREG32(SOC15_REG_ENTRY_OFFSET(gc_reg_list_10_1[i]));
}
amdgpu_gfx_off_ctrl(adev, true);

Sunil



Alex


Thanks,
Lijo
-Original Message-
From: Khatri, Sunil 
Sent: Wednesday, April 17, 2024 9:42 PM
To: Lazar, Lijo ; Alex Deucher ; Khatri, 
Sunil 
Cc: Deucher, Alexander ; Koenig, Christian 
; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v5 2/6] drm/amdgpu: add support of gfx10 register dump


On 4/17/2024 9:31 PM, Lazar, Lijo wrote:

On 4/17/2024 9:21 PM, Alex Deucher wrote:

On Wed, Apr 17, 2024 at 5:38 AM Sunil Khatri  wrote:

Adding gfx10 gc registers to be used for register dump via
devcoredump during a gpu reset.

Signed-off-by: Sunil Khatri 

Reviewed-by: Alex Deucher 


---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   8 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |   4 +
   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 130 +-
   drivers/gpu/drm/amd/amdgpu/soc15.h|   2 +
   .../include/asic_reg/gc/gc_10_1_0_offset.h|  12 ++
   5 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e0d7f4ee7e16..cac0ca64367b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -139,6 +139,14 @@ enum amdgpu_ss {
  AMDGPU_SS_DRV_UNLOAD
   };

+struct amdgpu_hwip_reg_entry {
+   u32 hwip;
+   u32 inst;
+   u32 seg;
+   u32 reg_offset;
+   const char  *reg_name;
+};
+
   struct amdgpu_watchdog_timer {
  bool timeout_fatal_disable;
  uint32_t period; /* maxCycles = (1 << period), the number
of cycles before a timeout */ diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 04a86dff71e6..64f197bbc866 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -433,6 +433,10 @@ struct amdgpu_gfx {
  uint32_tnum_xcc_per_xcp;
  struct mutexpartition_mutex;
  boolmcbp; /* mid command buffer 
preemption */
+
+   /* IP reg dump */
+   uint32_t*ip_dump;
+   uint32_treg_count;
   };

   struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index a0bc4196ff8b..4a54161f4837 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin");
   MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin");
   MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin");

+static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = {
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS2),
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS3),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT1),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT2),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_STALLED_STAT1),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STALLED_STAT1),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_BUSY_STAT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT2),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT2),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STATUS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_ERROR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HPD_STATUS0),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP

Re: [PATCH v5 2/6] drm/amdgpu: add support of gfx10 register dump

2024-04-17 Thread Khatri, Sunil



On 4/17/2024 9:31 PM, Lazar, Lijo wrote:


On 4/17/2024 9:21 PM, Alex Deucher wrote:

On Wed, Apr 17, 2024 at 5:38 AM Sunil Khatri  wrote:

Adding gfx10 gc registers to be used for register
dump via devcoredump during a gpu reset.

Signed-off-by: Sunil Khatri 

Reviewed-by: Alex Deucher 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   8 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |   4 +
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 130 +-
  drivers/gpu/drm/amd/amdgpu/soc15.h|   2 +
  .../include/asic_reg/gc/gc_10_1_0_offset.h|  12 ++
  5 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e0d7f4ee7e16..cac0ca64367b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -139,6 +139,14 @@ enum amdgpu_ss {
 AMDGPU_SS_DRV_UNLOAD
  };

+struct amdgpu_hwip_reg_entry {
+   u32 hwip;
+   u32 inst;
+   u32 seg;
+   u32 reg_offset;
+   const char  *reg_name;
+};
+
  struct amdgpu_watchdog_timer {
 bool timeout_fatal_disable;
 uint32_t period; /* maxCycles = (1 << period), the number of cycles 
before a timeout */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 04a86dff71e6..64f197bbc866 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -433,6 +433,10 @@ struct amdgpu_gfx {
 uint32_tnum_xcc_per_xcp;
 struct mutexpartition_mutex;
 boolmcbp; /* mid command buffer preemption 
*/
+
+   /* IP reg dump */
+   uint32_t*ip_dump;
+   uint32_treg_count;
  };

  struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index a0bc4196ff8b..4a54161f4837 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin");

+static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = {
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS2),
+   SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS3),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT1),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT2),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_STALLED_STAT1),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STALLED_STAT1),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_BUSY_STAT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT2),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT2),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STATUS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_ERROR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HPD_STATUS0),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_RPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_BASE),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_CMD_BUFSZ),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_CMD_BUFSZ),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_CMD_BUFSZ),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_CMD_BUFSZ),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BUFSZ),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BUFSZ),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BUFSZ),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_LO),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_HI),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BUFSZ),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCPF_UTCL1_STATUS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCPC_UTCL1_STATUS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmCPG_UTCL1_STATUS),
+   SOC15_REG_ENTRY_STR(GC, 0, mmGDS_PROTECTION_FAULT),
+   SOC15_REG_ENTRY_STR(GC, 0, 

Re: [PATCH v4 2/6] drm/amdgpu: add support of gfx10 register dump

2024-04-17 Thread Khatri, Sunil



On 4/17/2024 2:15 PM, Christian König wrote:



Am 17.04.24 um 10:18 schrieb Sunil Khatri:

Adding gfx10 gc registers to be used for register
dump via devcoredump during a gpu reset.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   8 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |   4 +
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c    | 130 +-
  drivers/gpu/drm/amd/amdgpu/soc15.h    |   2 +
  .../include/asic_reg/gc/gc_10_1_0_offset.h    |  12 ++
  5 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index e0d7f4ee7e16..210af65a744c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -139,6 +139,14 @@ enum amdgpu_ss {
  AMDGPU_SS_DRV_UNLOAD
  };
  +struct amdgpu_hwip_reg_entry {
+    u32    hwip;
+    u32    inst;
+    u32    seg;
+    u32    reg_offset;



+    char    reg_name[50];


Make that a const char *. Otherwise it bloats up the final binary 
because the compiler has to add zeros at the end.

Noted.



+};
+
  struct amdgpu_watchdog_timer {
  bool timeout_fatal_disable;
  uint32_t period; /* maxCycles = (1 << period), the number of 
cycles before a timeout */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h

index 04a86dff71e6..64f197bbc866 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -433,6 +433,10 @@ struct amdgpu_gfx {
  uint32_t    num_xcc_per_xcp;
  struct mutex    partition_mutex;
  bool    mcbp; /* mid command buffer preemption */
+
+    /* IP reg dump */
+    uint32_t    *ip_dump;
+    uint32_t    reg_count;
  };
    struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c

index a0bc4196ff8b..4a54161f4837 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin");
  +static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = {
+    SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS),
+    SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS2),
+    SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS3),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT1),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT2),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_STALLED_STAT1),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STALLED_STAT1),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_BUSY_STAT),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT2),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT2),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STATUS),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_ERROR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HPD_STATUS0),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_BASE),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_RPTR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_BASE),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_RPTR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_WPTR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_BASE),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_RPTR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_WPTR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_BASE),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_CMD_BUFSZ),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_CMD_BUFSZ),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_CMD_BUFSZ),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_CMD_BUFSZ),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_LO),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_HI),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BUFSZ),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_LO),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_HI),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BUFSZ),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_LO),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_HI),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BUFSZ),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_LO),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_HI),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BUFSZ),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCPF_UTCL1_STATUS),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCPC_UTCL1_STATUS),
+    SOC15_REG_ENTRY_STR(GC, 0, mmCPG_UTCL1_STATUS),
+    SOC15_REG_ENTRY_STR(GC, 0, mmGDS_PROTECTION_FAULT),
+    SOC15_REG_ENTRY_STR(GC, 0, mmGDS_VM_PROTECTION_FAULT),
+    SOC15_REG_ENTRY_STR(GC, 0, mmIA_UTCL1_STATUS),
+    SOC15_REG_ENTRY_STR(GC, 0, mmIA_UTCL1_STATUS_2),
+    SOC15_REG_ENTRY_STR(GC, 0, mmPA_CL_CNTL_STATUS),
+    SOC15_REG_ENTRY_STR(GC, 0, 

Re: [PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload

2024-04-17 Thread Khatri, Sunil



On 4/17/2024 1:19 PM, Lazar, Lijo wrote:


On 4/17/2024 1:14 PM, Khatri, Sunil wrote:

On 4/17/2024 1:06 PM, Khatri, Sunil wrote:

devcoredump is used to debug gpu hangs/resets. So in normal process
when there is a hang due to ring timeout or page fault we are doing a
hard reset as soft reset fail in those cases. How are we making sure
that the devcoredump is triggered in those cases and captured?

Regards
Sunil Khatri

On 4/17/2024 9:43 AM, Ahmad Rehman wrote:

In passthrough environment, the driver triggers the mode-1 reset on
reload. The reset causes the core dump collection which is delayed task
and prevents driver from unloading until it is completed. Since we do
not need to collect data on "reset on reload" case, we can skip core
dump collection.

v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.

Signed-off-by: Ahmad Rehman 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +--
   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
   3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1b2e177bc2d6..c718982cffa8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head
*device_list_handle,
   /* Try reset handler method first */
   tmp_adev = list_first_entry(device_list_handle, struct
amdgpu_device,
   reset_list);
-    amdgpu_reset_reg_dumps(tmp_adev);
+
+    if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags))
+    amdgpu_reset_reg_dumps(tmp_adev);
     reset_context->reset_device_list = device_list_handle;
   r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
@@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head
*device_list_handle,
     vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
   -    amdgpu_coredump(tmp_adev, vram_lost, reset_context);
+    if (!test_bit(AMDGPU_SKIP_COREDUMP,
_context->flags))
+    amdgpu_coredump(tmp_adev, vram_lost,
reset_context);
     if (vram_lost) {
   DRM_INFO("VRAM is lost due to GPU reset!\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6ea893ad9a36..c512f70b8272 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2481,6 +2481,7 @@ static void
amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)
     /* Use a common context, just need to make sure full reset is
done */
   set_bit(AMDGPU_SKIP_HW_RESET, _context.flags);
+    set_bit(AMDGPU_SKIP_COREDUMP, _context.flags);

If this is used for guests only can we better have a flag like
amdgpu_sriov_vf  for setting the skip coredump flag ??


A reset is not always triggered just because of hang. There are other
cases like we want to do a reset after a suspend/resume cycle so that
the device starts from a clean state. Those are intentionally triggered
by driver. Also, there are case like RAS errors where we reset and that
also really doesn't need a core dump. In all such cases, this flag is
required, and this is one such case (this patch only addresses passthrough).



Thanks Lijo
Able to verify that in normal hangs dump is working.

Regards
Sunil



Thanks,
Lijo


Regards
Sunil khatri


   r = amdgpu_do_asic_reset(_list, _context);
     if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 66125d43cf21..b11d190ece53 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS {
     AMDGPU_NEED_FULL_RESET = 0,
   AMDGPU_SKIP_HW_RESET = 1,
+    AMDGPU_SKIP_COREDUMP = 2,
   };
     struct amdgpu_reset_context {


Re: [PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload

2024-04-17 Thread Khatri, Sunil



On 4/17/2024 1:06 PM, Khatri, Sunil wrote:
devcoredump is used to debug gpu hangs/resets. So in normal process 
when there is a hang due to ring timeout or page fault we are doing a 
hard reset as soft reset fail in those cases. How are we making sure 
that the devcoredump is triggered in those cases and captured?


Regards
Sunil Khatri

On 4/17/2024 9:43 AM, Ahmad Rehman wrote:

In passthrough environment, the driver triggers the mode-1 reset on
reload. The reset causes the core dump collection which is delayed task
and prevents driver from unloading until it is completed. Since we do
not need to collect data on "reset on reload" case, we can skip core
dump collection.

v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.

Signed-off-by: Ahmad Rehman 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
  3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1b2e177bc2d6..c718982cffa8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

  /* Try reset handler method first */
  tmp_adev = list_first_entry(device_list_handle, struct 
amdgpu_device,

  reset_list);
-    amdgpu_reset_reg_dumps(tmp_adev);
+
+    if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags))
+    amdgpu_reset_reg_dumps(tmp_adev);
    reset_context->reset_device_list = device_list_handle;
  r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
@@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

    vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
  -    amdgpu_coredump(tmp_adev, vram_lost, reset_context);
+    if (!test_bit(AMDGPU_SKIP_COREDUMP, 
_context->flags))
+    amdgpu_coredump(tmp_adev, vram_lost, 
reset_context);

    if (vram_lost) {
  DRM_INFO("VRAM is lost due to GPU reset!\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

index 6ea893ad9a36..c512f70b8272 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2481,6 +2481,7 @@ static void 
amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)
    /* Use a common context, just need to make sure full reset is 
done */

  set_bit(AMDGPU_SKIP_HW_RESET, _context.flags);
+    set_bit(AMDGPU_SKIP_COREDUMP, _context.flags);
If this is used for guests only can we better have a flag like 
amdgpu_sriov_vf  for setting the skip coredump flag ??


Regards
Sunil khatri


  r = amdgpu_do_asic_reset(_list, _context);
    if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h

index 66125d43cf21..b11d190ece53 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS {
    AMDGPU_NEED_FULL_RESET = 0,
  AMDGPU_SKIP_HW_RESET = 1,
+    AMDGPU_SKIP_COREDUMP = 2,
  };
    struct amdgpu_reset_context {


Re: [PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload

2024-04-17 Thread Khatri, Sunil
devcoredump is used to debug gpu hangs/resets. So in normal process when 
there is a hang due to ring timeout or page fault we are doing a hard 
reset as soft reset fail in those cases. How are we making sure that the 
devcoredump is triggered in those cases and captured?


Regards
Sunil Khatri

On 4/17/2024 9:43 AM, Ahmad Rehman wrote:

In passthrough environment, the driver triggers the mode-1 reset on
reload. The reset causes the core dump collection which is delayed task
and prevents driver from unloading until it is completed. Since we do
not need to collect data on "reset on reload" case, we can skip core
dump collection.

v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.

Signed-off-by: Ahmad Rehman 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
  3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1b2e177bc2d6..c718982cffa8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
/* Try reset handler method first */
tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
reset_list);
-   amdgpu_reset_reg_dumps(tmp_adev);
+   
+   if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags))
+   amdgpu_reset_reg_dumps(tmp_adev);
  
  	reset_context->reset_device_list = device_list_handle;

r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
@@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
  
  vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
  
-amdgpu_coredump(tmp_adev, vram_lost, reset_context);

+   if (!test_bit(AMDGPU_SKIP_COREDUMP, 
_context->flags))
+   amdgpu_coredump(tmp_adev, vram_lost, 
reset_context);
  
  if (vram_lost) {

DRM_INFO("VRAM is lost due to GPU 
reset!\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6ea893ad9a36..c512f70b8272 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2481,6 +2481,7 @@ static void amdgpu_drv_delayed_reset_work_handler(struct 
work_struct *work)
  
  	/* Use a common context, just need to make sure full reset is done */

set_bit(AMDGPU_SKIP_HW_RESET, _context.flags);
+   set_bit(AMDGPU_SKIP_COREDUMP, _context.flags);
r = amdgpu_do_asic_reset(_list, _context);
  
  	if (r) {

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 66125d43cf21..b11d190ece53 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS {
  
  	AMDGPU_NEED_FULL_RESET = 0,

AMDGPU_SKIP_HW_RESET = 1,
+   AMDGPU_SKIP_COREDUMP = 2,
  };
  
  struct amdgpu_reset_context {


Re: [PATCH 6/6] drm/amdgpu: add ip dump for each ip in devcoredump

2024-04-16 Thread Khatri, Sunil



On 4/16/2024 7:29 PM, Alex Deucher wrote:

On Tue, Apr 16, 2024 at 8:08 AM Sunil Khatri  wrote:

Add ip dump for each ip of the asic in the
devcoredump for all the ips where a callback
is registered for register dump.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 64fe564b8036..70167f63b4f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -262,6 +262,21 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
 drm_printf(, "Faulty page starting at address: 0x%016llx\n", 
fault_info->addr);
 drm_printf(, "Protection fault status register: 0x%x\n\n", 
fault_info->status);

+   /* dump the ip state for each ip */
+   drm_printf(, "Register Dump\n");
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   if 
(coredump->adev->ip_blocks[i].version->funcs->print_ip_state) {
+   drm_printf(, "IP: %s\n",
+  coredump->adev->ip_blocks[i]
+  .version->funcs->name);
+   drm_printf(, "Offset \t Value\n");

I think we can drop the drm_printf line above if we use register names
rather than offsets in the print functions.  This also allows IPs to
dump stuff besides registers if they want.


Noted

Sunil



Alex


+   coredump->adev->ip_blocks[i]
+   .version->funcs->print_ip_state(
+   (void *)coredump->adev, );
+   drm_printf(, "\n");
+   }
+   }
+
 /* Add ring buffer information */
 drm_printf(, "Ring buffer information\n");
 for (int i = 0; i < coredump->adev->num_rings; i++) {
--
2.34.1



Re: [PATCH 4/6] drm/amdgpu: add support for gfx v10 print

2024-04-16 Thread Khatri, Sunil


On 4/16/2024 7:27 PM, Alex Deucher wrote:

On Tue, Apr 16, 2024 at 8:08 AM Sunil Khatri  wrote:

Add support to print ip information to be
used to print registers in devcoredump
buffer.

Signed-off-by: Sunil Khatri
---
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 17 -
  1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 822bee932041..a7c2a3ddd613 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -9268,6 +9268,21 @@ static void gfx_v10_0_emit_mem_sync(struct amdgpu_ring 
*ring)
 amdgpu_ring_write(ring, gcr_cntl); /* GCR_CNTL */
  }

+static void gfx_v10_ip_print(void *handle, struct drm_printer *p)
+{
+   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+   uint32_t i;
+   uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1);
+
+   if (!adev->gfx.ip_dump)
+   return;
+
+   for (i = 0; i < reg_count; i++)
+   drm_printf(p, "0x%04x \t 0x%08x\n",
+  adev->gfx.ip_dump[i].offset,

Print the name of the register rather than the offset here to make it
output easier to read.  See my comments from patch 2.


Just register name and value is fine or we need the offset too.

Also i am assuming stringify the macro is good enough ?
eg:

#definemmGRBM_STATUS0x0da4
so printing register name exactly like mmGRBM_STATUS is acceptable ? we 
dont need to remove mm as it makes it complicated.



+  adev->gfx.ip_dump[i].value);
+}
+
  static void gfx_v10_ip_dump(void *handle)
  {
 struct amdgpu_device *adev = (struct amdgpu_device *)handle;
@@ -9300,7 +9315,7 @@ static const struct amd_ip_funcs gfx_v10_0_ip_funcs = {
 .set_powergating_state = gfx_v10_0_set_powergating_state,
 .get_clockgating_state = gfx_v10_0_get_clockgating_state,
 .dump_ip_state = gfx_v10_ip_dump,
-   .print_ip_state = NULL,
+   .print_ip_state = gfx_v10_ip_print,
  };

  static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_gfx = {
--
2.34.1


Re: [PATCH 2/6] drm/amdgpu: add support of gfx10 register dump

2024-04-16 Thread Khatri, Sunil


On 4/16/2024 7:30 PM, Christian König wrote:

Am 16.04.24 um 15:55 schrieb Alex Deucher:
On Tue, Apr 16, 2024 at 8:08 AM Sunil Khatri  
wrote:

Adding gfx10 gc registers to be used for register
dump via devcoredump during a gpu reset.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  12 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |   4 +
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c    | 131 
+-

  .../include/asic_reg/gc/gc_10_1_0_offset.h    |  12 ++
  4 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index e0d7f4ee7e16..e016ac33629d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -139,6 +139,18 @@ enum amdgpu_ss {
 AMDGPU_SS_DRV_UNLOAD
  };

+struct hwip_reg_entry {
+   u32 hwip;
+   u32 inst;
+   u32 seg;
+   u32 reg_offset;
+};
+
+struct reg_pair {
+   u32 offset;
+   u32 value;
+};
+
  struct amdgpu_watchdog_timer {
 bool timeout_fatal_disable;
 uint32_t period; /* maxCycles = (1 << period), the number 
of cycles before a timeout */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h

index 04a86dff71e6..295a2c8d2e48 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -433,6 +433,10 @@ struct amdgpu_gfx {
 uint32_t    num_xcc_per_xcp;
 struct mutex    partition_mutex;
 bool    mcbp; /* mid command buffer 
preemption */

+
+   /* IP reg dump */
+   struct reg_pair *ip_dump;
+   uint32_t    reg_count;
  };

  struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c

index a0bc4196ff8b..46e136609ff1 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin");

+static const struct hwip_reg_entry gc_reg_list_10_1[] = {
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS2) },
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS3) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_BUSY_STAT) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT2) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT2) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_GFX_ERROR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_GFX_HPD_STATUS0) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB_BASE) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB_RPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_BASE) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_RPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_BASE) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_RPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_BASE) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_CMD_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_CMD_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_CMD_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_CMD_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BASE_LO) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BASE_HI) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BASE_LO) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BASE_HI) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BASE_LO) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BASE_HI) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BASE_LO) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BASE_HI) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCPF_UTCL1_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmCPC_UTCL1_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmCPG_UTCL1_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmGDS_PROTECTION_FAULT) },
+   { SOC15_REG_ENTRY(GC, 0, mmGDS_VM_PROTECTION_FAULT) },
+   { SOC15_REG_ENTRY(GC, 0, mmIA_UTCL1_STATUS) },
+   { 

Re: [PATCH 2/6] drm/amdgpu: add support of gfx10 register dump

2024-04-16 Thread Khatri, Sunil


On 4/16/2024 7:25 PM, Alex Deucher wrote:

On Tue, Apr 16, 2024 at 8:08 AM Sunil Khatri  wrote:

Adding gfx10 gc registers to be used for register
dump via devcoredump during a gpu reset.

Signed-off-by: Sunil Khatri
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  12 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h   |   4 +
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 131 +-
  .../include/asic_reg/gc/gc_10_1_0_offset.h|  12 ++
  4 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e0d7f4ee7e16..e016ac33629d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -139,6 +139,18 @@ enum amdgpu_ss {
 AMDGPU_SS_DRV_UNLOAD
  };

+struct hwip_reg_entry {
+   u32 hwip;
+   u32 inst;
+   u32 seg;
+   u32 reg_offset;
+};
+
+struct reg_pair {
+   u32 offset;
+   u32 value;
+};
+
  struct amdgpu_watchdog_timer {
 bool timeout_fatal_disable;
 uint32_t period; /* maxCycles = (1 << period), the number of cycles 
before a timeout */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 04a86dff71e6..295a2c8d2e48 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -433,6 +433,10 @@ struct amdgpu_gfx {
 uint32_tnum_xcc_per_xcp;
 struct mutexpartition_mutex;
 boolmcbp; /* mid command buffer preemption 
*/
+
+   /* IP reg dump */
+   struct reg_pair *ip_dump;
+   uint32_treg_count;
  };

  struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index a0bc4196ff8b..46e136609ff1 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin");

+static const struct hwip_reg_entry gc_reg_list_10_1[] = {
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS2) },
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS3) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_BUSY_STAT) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT2) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT2) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_GFX_ERROR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_GFX_HPD_STATUS0) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB_BASE) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB_RPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_BASE) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_RPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_BASE) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_RPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_BASE) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_WPTR) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_CMD_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_CMD_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_CMD_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_CMD_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BASE_LO) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BASE_HI) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BASE_LO) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BASE_HI) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BASE_LO) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BASE_HI) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BASE_LO) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BASE_HI) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BUFSZ) },
+   { SOC15_REG_ENTRY(GC, 0, mmCPF_UTCL1_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmCPC_UTCL1_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmCPG_UTCL1_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmGDS_PROTECTION_FAULT) },
+   { SOC15_REG_ENTRY(GC, 0, mmGDS_VM_PROTECTION_FAULT) },
+   { SOC15_REG_ENTRY(GC, 0, mmIA_UTCL1_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmIA_UTCL1_STATUS_2) },
+   { 

Re: [PATCH v3 4/5] drm/amdgpu: enable redirection of irq's for IH V6.0

2024-04-16 Thread Khatri, Sunil



On 4/16/2024 7:56 PM, Alex Deucher wrote:

On Tue, Apr 16, 2024 at 9:34 AM Sunil Khatri  wrote:

Enable redirection of irq for pagefaults for specific
clients to avoid overflow without dropping interrupts.

So here we redirect the interrupts to another IH ring
i.e ring1 where only these interrupts are processed.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c 
b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
index 26dc99232eb6..8869aac03b82 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
@@ -346,6 +346,21 @@ static int ih_v6_0_irq_init(struct amdgpu_device *adev)
 DELAY, 3);
 WREG32_SOC15(OSSSYS, 0, regIH_MSI_STORM_CTRL, tmp);

+   /* Redirect the interrupts to IH RB1 fpr dGPU */

fpr -> for


Sure will fix it when pushing the change to staging branch.

Regards
Sunil khatri



Alex


+   if (adev->irq.ih1.ring_size) {
+   tmp = RREG32_SOC15(OSSSYS, 0, regIH_RING1_CLIENT_CFG_INDEX);
+   tmp = REG_SET_FIELD(tmp, IH_RING1_CLIENT_CFG_INDEX, INDEX, 0);
+   WREG32_SOC15(OSSSYS, 0, regIH_RING1_CLIENT_CFG_INDEX, tmp);
+
+   tmp = RREG32_SOC15(OSSSYS, 0, regIH_RING1_CLIENT_CFG_DATA);
+   tmp = REG_SET_FIELD(tmp, IH_RING1_CLIENT_CFG_DATA, CLIENT_ID, 
0xa);
+   tmp = REG_SET_FIELD(tmp, IH_RING1_CLIENT_CFG_DATA, SOURCE_ID, 
0x0);
+   tmp = REG_SET_FIELD(tmp, IH_RING1_CLIENT_CFG_DATA,
+   SOURCE_ID_MATCH_ENABLE, 0x1);
+
+   WREG32_SOC15(OSSSYS, 0, regIH_RING1_CLIENT_CFG_DATA, tmp);
+   }
+
 pci_set_master(adev->pdev);

 /* enable interrupts */
--
2.34.1



RE: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump

2024-04-12 Thread Khatri, Sunil
[AMD Official Use Only - General]

-Original Message-
From: Alex Deucher 
Sent: Saturday, April 13, 2024 1:56 AM
To: Khatri, Sunil 
Cc: Khatri, Sunil ; Deucher, Alexander 
; Koenig, Christian ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump

On Fri, Apr 12, 2024 at 1:31 PM Khatri, Sunil  wrote:
>
>
> On 4/12/2024 10:42 PM, Alex Deucher wrote:
>
> On Fri, Apr 12, 2024 at 1:05 PM Khatri, Sunil  wrote:
>
> On 4/12/2024 8:50 PM, Alex Deucher wrote:
>
> On Fri, Apr 12, 2024 at 10:00 AM Sunil Khatri  wrote:
>
> Adding initial set of registers for ipdump during devcoredump starting
> with gfx10 gc registers.
>
> ip dump is triggered when gpu reset happens via devcoredump and the
> memory is allocated by each ip and is freed once the dump is complete
> by devcoredump.
>
> Signed-off-by: Sunil Khatri 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  16 +++
>   .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  |  22 +++
>
> I would split this into two patches, one to add the core
> infrastructure in devcoredump and one to add gfx10 support.  The core
> support could be squashed into patch 1 as well.
>
> Sure
>
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 127 +-
>   .../include/asic_reg/gc/gc_10_1_0_offset.h|  12 ++
>   4 files changed, 176 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 65c17c59c152..e173ad86a241 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -139,6 +139,18 @@ enum amdgpu_ss {
>  AMDGPU_SS_DRV_UNLOAD
>   };
>
> +struct hwip_reg_entry {
> +   u32 hwip;
> +   u32 inst;
> +   u32 seg;
> +   u32 reg_offset;
> +};
> +
> +struct reg_pair {
> +   u32 offset;
> +   u32 value;
> +};
> +
>   struct amdgpu_watchdog_timer {
>  bool timeout_fatal_disable;
>  uint32_t period; /* maxCycles = (1 << period), the number of
> cycles before a timeout */ @@ -1152,6 +1164,10 @@ struct amdgpu_device {
>  booldebug_largebar;
>  booldebug_disable_soft_recovery;
>  booldebug_use_vram_fw_buf;
> +
> +   /* IP register dump */
> +   struct reg_pair *ip_dump;
> +   uint32_tnum_regs;
>   };
>
>   static inline uint32_t amdgpu_ip_version(const struct amdgpu_device
> *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> index 1129e5e5fb42..2079f67c9fac 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> @@ -261,6 +261,18 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
> size_t count,
>  drm_printf(, "Faulty page starting at address: 0x%016llx\n", 
> fault_info->addr);
>  drm_printf(, "Protection fault status register: 0x%x\n\n",
> fault_info->status);
>
> +   /* Add IP dump for each ip */
> +   if (coredump->adev->ip_dump != NULL) {
> +   struct reg_pair *pair;
> +
> +   pair = (struct reg_pair *)coredump->adev->ip_dump;
> +   drm_printf(, "IP register dump\n");
> +   drm_printf(, "Offset \t Value\n");
> +   for (int i = 0; i < coredump->adev->num_regs; i++)
> +   drm_printf(, "0x%04x \t 0x%08x\n", pair[i].offset, 
> pair[i].value);
> +   drm_printf(, "\n");
> +   }
> +
>  /* Add ring buffer information */
>  drm_printf(, "Ring buffer information\n");
>  for (int i = 0; i < coredump->adev->num_rings; i++) { @@
> -299,6 +311,11 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset,
> size_t count,
>
>   static void amdgpu_devcoredump_free(void *data)
>   {
> +   struct amdgpu_coredump_info *temp = data;
> +
> +   kfree(temp->adev->ip_dump);
> +   temp->adev->ip_dump = NULL;
> +   temp->adev->num_regs = 0;
>  kfree(data);
>   }
>
> @@ -337,6 +354,11 @@ void amdgpu_coredump(struct amdgpu_device *adev,
> bool vram_lost,
>
>  coredump->adev = adev;
>
> +   /* Trigger ip dump here to capture the value of registers */
> +   for (int i = 0; i < adev->num_ip_blocks; i++)
> +   if (adev->ip_blocks[i].version->funcs->dump_ip_state)
> +
> + adev-&g

Re: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump

2024-04-12 Thread Khatri, Sunil


On 4/12/2024 10:42 PM, Alex Deucher wrote:

On Fri, Apr 12, 2024 at 1:05 PM Khatri, Sunil  wrote:


On 4/12/2024 8:50 PM, Alex Deucher wrote:

On Fri, Apr 12, 2024 at 10:00 AM Sunil Khatri  wrote:

Adding initial set of registers for ipdump during
devcoredump starting with gfx10 gc registers.

ip dump is triggered when gpu reset happens via
devcoredump and the memory is allocated by each
ip and is freed once the dump is complete by
devcoredump.

Signed-off-by: Sunil Khatri
---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  16 +++
   .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  |  22 +++

I would split this into two patches, one to add the core
infrastructure in devcoredump and one to add gfx10 support.  The core
support could be squashed into patch 1 as well.

Sure



   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 127 +-
   .../include/asic_reg/gc/gc_10_1_0_offset.h|  12 ++
   4 files changed, 176 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 65c17c59c152..e173ad86a241 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -139,6 +139,18 @@ enum amdgpu_ss {
  AMDGPU_SS_DRV_UNLOAD
   };

+struct hwip_reg_entry {
+   u32 hwip;
+   u32 inst;
+   u32 seg;
+   u32 reg_offset;
+};
+
+struct reg_pair {
+   u32 offset;
+   u32 value;
+};
+
   struct amdgpu_watchdog_timer {
  bool timeout_fatal_disable;
  uint32_t period; /* maxCycles = (1 << period), the number of cycles 
before a timeout */
@@ -1152,6 +1164,10 @@ struct amdgpu_device {
  booldebug_largebar;
  booldebug_disable_soft_recovery;
  booldebug_use_vram_fw_buf;
+
+   /* IP register dump */
+   struct reg_pair *ip_dump;
+   uint32_tnum_regs;
   };

   static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 1129e5e5fb42..2079f67c9fac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -261,6 +261,18 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
  drm_printf(, "Faulty page starting at address: 0x%016llx\n", 
fault_info->addr);
  drm_printf(, "Protection fault status register: 0x%x\n\n", 
fault_info->status);

+   /* Add IP dump for each ip */
+   if (coredump->adev->ip_dump != NULL) {
+   struct reg_pair *pair;
+
+   pair = (struct reg_pair *)coredump->adev->ip_dump;
+   drm_printf(, "IP register dump\n");
+   drm_printf(, "Offset \t Value\n");
+   for (int i = 0; i < coredump->adev->num_regs; i++)
+   drm_printf(, "0x%04x \t 0x%08x\n", pair[i].offset, 
pair[i].value);
+   drm_printf(, "\n");
+   }
+
  /* Add ring buffer information */
  drm_printf(, "Ring buffer information\n");
  for (int i = 0; i < coredump->adev->num_rings; i++) {
@@ -299,6 +311,11 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,

   static void amdgpu_devcoredump_free(void *data)
   {
+   struct amdgpu_coredump_info *temp = data;
+
+   kfree(temp->adev->ip_dump);
+   temp->adev->ip_dump = NULL;
+   temp->adev->num_regs = 0;
  kfree(data);
   }

@@ -337,6 +354,11 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool 
vram_lost,

  coredump->adev = adev;

+   /* Trigger ip dump here to capture the value of registers */
+   for (int i = 0; i < adev->num_ip_blocks; i++)
+   if (adev->ip_blocks[i].version->funcs->dump_ip_state)
+   adev->ip_blocks[i].version->funcs->dump_ip_state((void 
*)adev);
+

This seems too complicated. I think it would be easier to

This is how all other per IP functions are called. What do you suggest ?

  ktime_get_ts64(>reset_time);

  dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index a0bc4196ff8b..66e2915a8b4d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin");
   MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin");
   MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin");

+static const struct hwip_reg_entry gc_reg_list_10_1[] = {
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS) },
+   { SOC15_RE

Re: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump

2024-04-12 Thread Khatri, Sunil



On 4/12/2024 8:50 PM, Alex Deucher wrote:

On Fri, Apr 12, 2024 at 10:00 AM Sunil Khatri  wrote:

Adding initial set of registers for ipdump during
devcoredump starting with gfx10 gc registers.

ip dump is triggered when gpu reset happens via
devcoredump and the memory is allocated by each
ip and is freed once the dump is complete by
devcoredump.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  16 +++
  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  |  22 +++

I would split this into two patches, one to add the core
infrastructure in devcoredump and one to add gfx10 support.  The core
support could be squashed into patch 1 as well.

Sure




  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 127 +-
  .../include/asic_reg/gc/gc_10_1_0_offset.h|  12 ++
  4 files changed, 176 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 65c17c59c152..e173ad86a241 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -139,6 +139,18 @@ enum amdgpu_ss {
 AMDGPU_SS_DRV_UNLOAD
  };

+struct hwip_reg_entry {
+   u32 hwip;
+   u32 inst;
+   u32 seg;
+   u32 reg_offset;
+};
+
+struct reg_pair {
+   u32 offset;
+   u32 value;
+};
+
  struct amdgpu_watchdog_timer {
 bool timeout_fatal_disable;
 uint32_t period; /* maxCycles = (1 << period), the number of cycles 
before a timeout */
@@ -1152,6 +1164,10 @@ struct amdgpu_device {
 booldebug_largebar;
 booldebug_disable_soft_recovery;
 booldebug_use_vram_fw_buf;
+
+   /* IP register dump */
+   struct reg_pair *ip_dump;
+   uint32_tnum_regs;
  };

  static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 1129e5e5fb42..2079f67c9fac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -261,6 +261,18 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
 drm_printf(, "Faulty page starting at address: 0x%016llx\n", 
fault_info->addr);
 drm_printf(, "Protection fault status register: 0x%x\n\n", 
fault_info->status);

+   /* Add IP dump for each ip */
+   if (coredump->adev->ip_dump != NULL) {
+   struct reg_pair *pair;
+
+   pair = (struct reg_pair *)coredump->adev->ip_dump;
+   drm_printf(, "IP register dump\n");
+   drm_printf(, "Offset \t Value\n");
+   for (int i = 0; i < coredump->adev->num_regs; i++)
+   drm_printf(, "0x%04x \t 0x%08x\n", pair[i].offset, 
pair[i].value);
+   drm_printf(, "\n");
+   }
+
 /* Add ring buffer information */
 drm_printf(, "Ring buffer information\n");
 for (int i = 0; i < coredump->adev->num_rings; i++) {
@@ -299,6 +311,11 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,

  static void amdgpu_devcoredump_free(void *data)
  {
+   struct amdgpu_coredump_info *temp = data;
+
+   kfree(temp->adev->ip_dump);
+   temp->adev->ip_dump = NULL;
+   temp->adev->num_regs = 0;
 kfree(data);
  }

@@ -337,6 +354,11 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool 
vram_lost,

 coredump->adev = adev;

+   /* Trigger ip dump here to capture the value of registers */
+   for (int i = 0; i < adev->num_ip_blocks; i++)
+   if (adev->ip_blocks[i].version->funcs->dump_ip_state)
+   adev->ip_blocks[i].version->funcs->dump_ip_state((void 
*)adev);
+

This seems too complicated. I think it would be easier to

This is how all other per IP functions are called. What do you suggest ?



 ktime_get_ts64(>reset_time);

 dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index a0bc4196ff8b..66e2915a8b4d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin");
  MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin");

+static const struct hwip_reg_entry gc_reg_list_10_1[] = {
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS) },
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS2) },
+   { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS3) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1) },
+   { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1) },
+   { 

Re: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump

2024-04-12 Thread Khatri, Sunil


On 4/12/2024 8:50 PM, Alex Deucher wrote:

I would split this into two patches, one to add the core
infrastructure in devcoredump and one to add gfx10 support.  The core
support could be squashed into patch 1 as well.



Sure would push the v3 with the changes.

Regards

Sunil


RE: [PATCH 0/2] First set in IP dump patches

2024-04-12 Thread Khatri, Sunil
[AMD Official Use Only - General]

Ignore the series sent by mistake


-Original Message-
From: Sunil Khatri 
Sent: Friday, April 12, 2024 2:30 PM
To: Deucher, Alexander ; Koenig, Christian 

Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil 
Subject: [PATCH 0/2] First set in IP dump patches

Adding infrastructure needed for ipdump along with dumping gfx10 registers.

Sunil Khatri (2):
  drm/amdgpu: add prototype to dump ip state
  drm/amdgpu: Add support of gfx10 register dump

 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  16 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_acp.c   |   1 +
 .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  |  22 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/cik.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/cik_ih.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/cik_sdma.c |   1 +
 drivers/gpu/drm/amd/amdgpu/cz_ih.c|   1 +
 drivers/gpu/drm/amd/amdgpu/dce_v10_0.c|   1 +
 drivers/gpu/drm/amd/amdgpu/dce_v11_0.c|   1 +
 drivers/gpu/drm/amd/amdgpu/dce_v6_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 142 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c|   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/iceland_ih.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/ih_v6_0.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/ih_v6_1.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/ih_v7_0.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c|   1 +
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c|   2 +
 drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c|   1 +
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c|   1 +
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c  |   1 +
 drivers/gpu/drm/amd/amdgpu/mes_v10_1.c|   1 +
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c|   1 +
 drivers/gpu/drm/amd/amdgpu/navi10_ih.c|   1 +
 drivers/gpu/drm/amd/amdgpu/nv.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c|   1 +
 drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c|   1 +
 drivers/gpu/drm/amd/amdgpu/si.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/si_dma.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/si_ih.c|   1 +
 drivers/gpu/drm/amd/amdgpu/soc15.c|   1 +
 drivers/gpu/drm/amd/amdgpu/soc21.c|   1 +
 drivers/gpu/drm/amd/amdgpu/tonga_ih.c |   1 +
 drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c |   1 +
 drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c |   1 +
 drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/vce_v2_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/vce_v3_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c |   2 +
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c |   1 +
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c   |   1 +
 drivers/gpu/drm/amd/amdgpu/vi.c   |   1 +
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   1 +
 drivers/gpu/drm/amd/include/amd_shared.h  |   1 +
 drivers/gpu/drm/amd/pm/legacy-dpm/kv_dpm.c|   1 +
 drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c|   1 +
 .../gpu/drm/amd/pm/powerplay/amd_powerplay.c  |   1 +
 66 files changed, 245 insertions(+)

--
2.34.1



RE: [PATCH 2/2] drm/amdgpu: Add support of gfx10 register dump

2024-04-12 Thread Khatri, Sunil
[AMD Official Use Only - General]

Ignore sent by mistake.

-Original Message-
From: Sunil Khatri 
Sent: Friday, April 12, 2024 2:30 PM
To: Deucher, Alexander ; Koenig, Christian 

Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil 
Subject: [PATCH 2/2] drm/amdgpu: Add support of gfx10 register dump

Adding initial set of registers for ipdump during devcoredump starting with 
gfx10 gc registers.

ip dump is triggered when gpu reset happens via devcoredump and the memory is 
allocated by each ip and is freed once the dump is complete by devcoredump.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  16 ++
 .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  |  22 +++
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 143 +-
 3 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 65c17c59c152..e173ad86a241 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -139,6 +139,18 @@ enum amdgpu_ss {
AMDGPU_SS_DRV_UNLOAD
 };

+struct hwip_reg_entry {
+   u32 hwip;
+   u32 inst;
+   u32 seg;
+   u32 reg_offset;
+};
+
+struct reg_pair {
+   u32 offset;
+   u32 value;
+};
+
 struct amdgpu_watchdog_timer {
bool timeout_fatal_disable;
uint32_t period; /* maxCycles = (1 << period), the number of cycles 
before a timeout */ @@ -1152,6 +1164,10 @@ struct amdgpu_device {
booldebug_largebar;
booldebug_disable_soft_recovery;
booldebug_use_vram_fw_buf;
+
+   /* IP register dump */
+   struct reg_pair *ip_dump;
+   uint32_tnum_regs;
 };

 static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 1129e5e5fb42..2079f67c9fac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -261,6 +261,18 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
drm_printf(, "Faulty page starting at address: 0x%016llx\n", 
fault_info->addr);
drm_printf(, "Protection fault status register: 0x%x\n\n", 
fault_info->status);

+   /* Add IP dump for each ip */
+   if (coredump->adev->ip_dump != NULL) {
+   struct reg_pair *pair;
+
+   pair = (struct reg_pair *)coredump->adev->ip_dump;
+   drm_printf(, "IP register dump\n");
+   drm_printf(, "Offset \t Value\n");
+   for (int i = 0; i < coredump->adev->num_regs; i++)
+   drm_printf(, "0x%04x \t 0x%08x\n", pair[i].offset, 
pair[i].value);
+   drm_printf(, "\n");
+   }
+
/* Add ring buffer information */
drm_printf(, "Ring buffer information\n");
for (int i = 0; i < coredump->adev->num_rings; i++) { @@ -299,6 +311,11 
@@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,

 static void amdgpu_devcoredump_free(void *data)  {
+   struct amdgpu_coredump_info *temp = data;
+
+   kfree(temp->adev->ip_dump);
+   temp->adev->ip_dump = NULL;
+   temp->adev->num_regs = 0;
kfree(data);
 }

@@ -337,6 +354,11 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool 
vram_lost,

coredump->adev = adev;

+   /* Trigger ip dump here to capture the value of registers */
+   for (int i = 0; i < adev->num_ip_blocks; i++)
+   if (adev->ip_blocks[i].version->funcs->dump_ip_state)
+   adev->ip_blocks[i].version->funcs->dump_ip_state((void 
*)adev);
+
ktime_get_ts64(>reset_time);

dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT, diff 
--git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index a0bc4196ff8b..05c4b1d62132 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -47,6 +47,22 @@
 #include "gfx_v10_0.h"
 #include "nbio_v2_3.h"

+/*
+ * Manually adding some of the missing gfx10 registers from spec  */
+#define mmCP_DEBUG_BASE_IDX0
+#define mmCP_DEBUG 0x1e1f
+#define mmCP_MES_DEBUG_INTERRUPT_INSTR_PNTR_BASE_IDX   1
+#define mmCP_MES_DEBUG_INTERRUPT_INSTR_PNTR0x2840
+#define mmRLC_GPM_DEBUG_INST_A_BASE_IDX1
+#define mmRLC_GPM_DEBUG_INST_A 0x4c22
+#define mmRLC_GPM_DEBUG_INST_B_BASE_IDX1
+#define mmRLC_GPM_DEBUG_INST_B 0x

Re: [PATCH] drm/amdgpu: add IP's FW information to devcoredump

2024-03-27 Thread Khatri, Sunil



On 3/28/2024 8:38 AM, Alex Deucher wrote:

On Tue, Mar 26, 2024 at 1:31 PM Sunil Khatri  wrote:

Add FW information of all the IP's in the devcoredump.

Signed-off-by: Sunil Khatri 

Might want to include the vbios version info as well, e.g.,
atom_context->name
atom_context->vbios_pn
atom_context->vbios_ver_str
atom_context->date


Sure i will add those parameters too.

Regards

Sunil


Either way,
Reviewed-by: Alex Deucher 


---
  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  | 122 ++
  1 file changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 44c5da8aa9ce..d598b6520ec9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -69,6 +69,124 @@ const char *hw_ip_names[MAX_HWIP] = {
 [PCIE_HWIP] = "PCIE",
  };

+static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev,
+  struct drm_printer *p)
+{
+   uint32_t version;
+   uint32_t feature;
+   uint8_t smu_program, smu_major, smu_minor, smu_debug;
+
+   drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n",
+  adev->vce.fb_version, adev->vce.fw_version);
+   drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", 0,
+  adev->uvd.fw_version);
+   drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", 0,
+  adev->gmc.fw_version);
+   drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.me_feature_version, adev->gfx.me_fw_version);
+   drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version);
+   drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.ce_feature_version, adev->gfx.ce_fw_version);
+   drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version);
+
+   drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlc_feature_version,
+  adev->gfx.rlc_srlc_fw_version);
+   drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlg_feature_version,
+  adev->gfx.rlc_srlg_fw_version);
+   drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srls_feature_version,
+  adev->gfx.rlc_srls_fw_version);
+   drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcp_ucode_feature_version,
+  adev->gfx.rlcp_ucode_version);
+   drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcv_ucode_feature_version,
+  adev->gfx.rlcv_ucode_version);
+   drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec_feature_version, adev->gfx.mec_fw_version);
+
+   if (adev->gfx.mec2_fw)
+   drm_printf(p, "MEC2 feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec2_feature_version,
+  adev->gfx.mec2_fw_version);
+
+   drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n", 0,
+  adev->gfx.imu_fw_version);
+   drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n",
+  adev->psp.sos.feature_version, adev->psp.sos.fw_version);
+   drm_printf(p, "PSP ASD feature version: %u, fw version: 0x%08x\n",
+  adev->psp.asd_context.bin_desc.feature_version,
+  adev->psp.asd_context.bin_desc.fw_version);
+
+   drm_printf(p, "TA XGMI feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.xgmi_context.context.bin_desc.feature_version,
+  adev->psp.xgmi_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA RAS feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.ras_context.context.bin_desc.feature_version,
+  adev->psp.ras_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA HDCP feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.hdcp_context.context.bin_desc.feature_version,
+  adev->psp.hdcp_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA DTM feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.dtm_context.context.bin_desc.feature_version,
+  adev->psp.dtm_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA RAP feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.rap_context.context.bin_desc.feature_version,
+  

Re: [PATCH] drm/amdgpu: add support of bios dump in devcoredump

2024-03-26 Thread Khatri, Sunil



On 3/26/2024 10:23 PM, Alex Deucher wrote:

On Tue, Mar 26, 2024 at 10:38 AM Sunil Khatri  wrote:

dump the bios binary in the devcoredump.

Signed-off-by: Sunil Khatri 
---
  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  | 20 +++
  1 file changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 44c5da8aa9ce..f33963d777eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -132,6 +132,26 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
 drm_printf(, "Faulty page starting at address: 0x%016llx\n", 
fault_info->addr);
 drm_printf(, "Protection fault status register: 0x%x\n\n", 
fault_info->status);

+   /* Dump BIOS */
+   if (coredump->adev->bios && coredump->adev->bios_size) {
+   int i = 0;
+
+   drm_printf(, "BIOS Binary dump\n");
+   drm_printf(, "Valid BIOS  Size:%d bytes type:%s\n",
+  coredump->adev->bios_size,
+  coredump->adev->is_atom_fw ?
+  "Atom bios":"Non Atom Bios");
+
+   while (i < coredump->adev->bios_size) {
+   /* Printing 15 bytes in a line */
+   if (i % 15 == 0)
+   drm_printf(, "\n");
+   drm_printf(, "0x%x \t", coredump->adev->bios[i]);
+   i++;
+   }
+   drm_printf(, "\n");
+   }

I don't think it's too useful to dump this as text.  I was hoping it
could be a binary.  I guess, we can just get this from debugfs if we
need it if a binary is not possible.



Yes , this dumps in text format only and the binary is already available 
in debugfs. So discarding the patch.




Alex



+
 /* Add ring buffer information */
 drm_printf(, "Ring buffer information\n");
 for (int i = 0; i < coredump->adev->num_rings; i++) {
--
2.34.1



RE: [PATCH v2] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil
[AMD Official Use Only - General]

Ignore this as I have send v3.

-Original Message-
From: Sunil Khatri 
Sent: Tuesday, March 19, 2024 8:41 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
linux-ker...@vger.kernel.org; Zhang, Hawking ; Kuehling, 
Felix ; Lazar, Lijo ; Khatri, Sunil 

Subject: [PATCH v2] drm/amdgpu: refactor code to reuse system information

Refactor the code so debugfs and devcoredump can reuse the common information 
and avoid unnecessary copy of it.

created a new file which would be the right place to hold functions which will 
be used between ioctl, debugfs and devcoredump.

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/Makefile  |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c | 146 +++  
drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.h |  33 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  | 117 +--
 4 files changed, 182 insertions(+), 116 deletions(-)  create mode 100644 
drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.h

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile
index 4536c8ad0e11..2c5c800c1ed6 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o 
amdgpu_kms.o \
amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
amdgpu_fw_attestation.o amdgpu_securedisplay.o \
amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+amdgpu_coreinfo.o

 amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c
new file mode 100644
index ..597fc9d432ce
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person
+obtaining a
+ * copy of this software and associated documentation files (the
+"Software"),
+ * to deal in the Software without restriction, including without
+limitation
+ * the rights to use, copy, modify, merge, publish, distribute,
+sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom
+the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
+SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM,
+DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
+OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu_coreinfo.h"
+#include "amd_pcie.h"
+
+
+void amdgpu_coreinfo_devinfo(struct amdgpu_device *adev, struct
+drm_amdgpu_info_device *dev_info) {
+   int ret;
+   uint64_t vm_size;
+   uint32_t pcie_gen_mask;
+
+   dev_info->device_id = adev->pdev->device;
+   dev_info->chip_rev = adev->rev_id;
+   dev_info->external_rev = adev->external_rev_id;
+   dev_info->pci_rev = adev->pdev->revision;
+   dev_info->family = adev->family;
+   dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+   dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se;
+   /* return all clocks in KHz */
+   dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10;
+   if (adev->pm.dpm_enabled) {
+   dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 
10;
+   dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 
10;
+   dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 
10;
+   dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 
10;
+   } else {
+   dev_info->max_engine_clock =
+   dev_info->min_engine_clock =
+   adev->clock.default_sclk * 10;
+   dev_info->max_memory_clock =
+   dev_info->min_memory_clock =
+   adev->clock.default_mclk * 10;
+   }
+   dev_info->enab

Re: [PATCH] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil

Sent a new patch based on discussion with Alex.

On 3/19/2024 8:34 PM, Christian König wrote:

Am 19.03.24 um 15:59 schrieb Alex Deucher:

On Tue, Mar 19, 2024 at 10:56 AM Christian König
 wrote:

Am 19.03.24 um 15:26 schrieb Alex Deucher:
On Tue, Mar 19, 2024 at 8:32 AM Sunil Khatri  
wrote:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between sysfs, debugfs
and devcoredump.

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
   drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 


   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
   4 files changed, 157 insertions(+), 115 deletions(-)
   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile

index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o 
amdgpu_doorbell_mgr.o amdgpu_kms.o \
  amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o 
amdgpu_rap.o \

  amdgpu_fw_attestation.o amdgpu_securedisplay.o \
  amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o 
amdgpu_lsdma.o \

-   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o 
amdgpu_devinfo.o


   amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group 
amdgpu_vram_mgr_attr_group;

   extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
   extern const struct attribute_group amdgpu_flash_attr_group;

+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info);

   #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

new file mode 100644
index ..d2c15a1dcb0d
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person 
obtaining a
+ * copy of this software and associated documentation files (the 
"Software"),
+ * to deal in the Software without restriction, including without 
limitation
+ * the rights to use, copy, modify, merge, publish, distribute, 
sublicense,
+ * and/or sell copies of the Software, and to permit persons to 
whom the
+ * Software is furnished to do so, subject to the following 
conditions:

+ *
+ * The above copyright notice and this permission notice shall be 
included in

+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY 
KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO 
EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, 
DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE 
USE OR

+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info)

We can probably keep this in amdgpu_kms.c unless that file is getting
too big.  I don't think it warrants a new file at this point.  If you
do keep it in amdgpu_kms.c, I'd recommend renaming it to something
like amdgpu_kms_device_info() to keep the naming conventions.

We should not be using this for anything new in the first place.

A whole bunch of the stuff inside the devinfo structure has been
deprecated because we found that putting everything into one structure
was a bad idea.

It's a convenient way to collect a lot of useful information that we
want in the core dump.  Plus it's not going anywhere because we need
to keep compatibility in the IOCTL.


Yeah and exactly that is what I'm strictly against. The devinfo wasn't 
used for new stuff because we found that it is way to inflexible.


That's why we have multiple separate IOCTLs for the memory and 
firmware information for example.


We should really *not* reuse that for the device core dumping.

Rather just use the same information from the different IPs and 
subsystems directly. E.g. add a function to the VM, GFX etc for 
printing out devcoredump infos.


I have pushed new v2 based 

Re: [PATCH v2] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil



On 3/19/2024 8:07 PM, Christian König wrote:

Am 19.03.24 um 15:25 schrieb Sunil Khatri:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between ioctl, debugfs
and devcoredump.


Ok, taking a closer look that is certainly not a good idea.

The devinfo structure was just created because somebody thought that 
mixing all that stuff into one structure would be a good idea.


We have pretty much deprecated that approach and should *really* not 
change anything here any more.
To support the ioctl we are keeping that information same without 
changing it. The intent to add a new file is because we will have more 
information coming in this new file. Next in line is firmware 
information which is again a huge function with lot of information and 
to use that information in devcoredump and ioctl and sysfs the new file 
seems to be right idea after some discussions.
FYI: this will not be just one function in new file but more to come so 
code can be reused without copying it.


Regards,
Christian.



Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
  4 files changed, 157 insertions(+), 115 deletions(-)
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile

index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o 
amdgpu_kms.o \

  amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
  amdgpu_fw_attestation.o amdgpu_securedisplay.o \
  amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-    amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+    amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o 
amdgpu_devinfo.o

    amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group 
amdgpu_vram_mgr_attr_group;

  extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
  extern const struct attribute_group amdgpu_flash_attr_group;
  +int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info);

  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

new file mode 100644
index ..fdcbc1984031
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person 
obtaining a
+ * copy of this software and associated documentation files (the 
"Software"),
+ * to deal in the Software without restriction, including without 
limitation
+ * the rights to use, copy, modify, merge, publish, distribute, 
sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom 
the

+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be 
included in

+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO 
EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, 
DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE 
USE OR

+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info)

+{
+    int ret;
+    uint64_t vm_size;
+    uint32_t pcie_gen_mask;
+
+    if (dev_info == NULL)
+    return -EINVAL;
+
+    dev_info->device_id = adev->pdev->device;
+    dev_info->chip_rev = adev->rev_id;
+    dev_info->external_rev = adev->external_rev_id;
+    dev_info->pci_rev = adev->pdev->revision;
+    dev_info->family = adev->family;
+    dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+    dev_info->num_shader_arrays_per_engine = 
adev->gfx.config.max_sh_per_se;

+    /* return all clocks in KHz */
+    

Re: [PATCH] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil


On 3/19/2024 7:43 PM, Lazar, Lijo wrote:


On 3/19/2024 7:27 PM, Khatri, Sunil wrote:

On 3/19/2024 7:19 PM, Lazar, Lijo wrote:

On 3/19/2024 6:02 PM, Sunil Khatri wrote:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between sysfs, debugfs
and devcoredump.

Cc: Christian König
Cc: Alex Deucher
Signed-off-by: Sunil Khatri
---
   drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 
   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
   4 files changed, 157 insertions(+), 115 deletions(-)
   create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile
b/drivers/gpu/drm/amd/amdgpu/Makefile
index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o
amdgpu_kms.o \
   amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
   amdgpu_fw_attestation.o amdgpu_securedisplay.o \
   amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-    amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+    amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
amdgpu_devinfo.o
     amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
   diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group
amdgpu_vram_mgr_attr_group;
   extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
   extern const struct attribute_group amdgpu_flash_attr_group;
   +int amdgpu_device_info(struct amdgpu_device *adev, struct
drm_amdgpu_info_device *dev_info);
   #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
new file mode 100644
index ..d2c15a1dcb0d
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person
obtaining a
+ * copy of this software and associated documentation files (the
"Software"),
+ * to deal in the Software without restriction, including without
limitation
+ * the rights to use, copy, modify, merge, publish, distribute,
sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom
the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO
EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM,
DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct
drm_amdgpu_info_device *dev_info)
+{
+    int ret;
+    uint64_t vm_size;
+    uint32_t pcie_gen_mask;
+
+    if (dev_info == NULL)
+    return -EINVAL;
+
+    dev_info->device_id = adev->pdev->device;
+    dev_info->chip_rev = adev->rev_id;
+    dev_info->external_rev = adev->external_rev_id;
+    dev_info->pci_rev = adev->pdev->revision;
+    dev_info->family = adev->family;
+    dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+    dev_info->num_shader_arrays_per_engine =
adev->gfx.config.max_sh_per_se;
+    /* return all clocks in KHz */
+    dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10;
+    if (adev->pm.dpm_enabled) {
+    dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev,
false) * 10;
+    dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev,
false) * 10;
+    dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true)
* 10;
+    dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true)
* 10;
+    } else {
+    dev_info->max_engine_clock =
+    dev_info->min_engine_clock =
+    adev->clock.default_sclk * 10;
+    dev_info->max_memory_clock =
+    dev_info->min_memory_clock =
+    adev->clock.default_mc

Re: [PATCH] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil



On 3/19/2024 7:19 PM, Lazar, Lijo wrote:


On 3/19/2024 6:02 PM, Sunil Khatri wrote:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between sysfs, debugfs
and devcoredump.

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
  4 files changed, 157 insertions(+), 115 deletions(-)
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile
index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o 
amdgpu_kms.o \
amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
amdgpu_fw_attestation.o amdgpu_securedisplay.o \
amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o 
amdgpu_devinfo.o
  
  amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group 
amdgpu_vram_mgr_attr_group;
  extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
  extern const struct attribute_group amdgpu_flash_attr_group;
  
+int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info);

  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
new file mode 100644
index ..d2c15a1dcb0d
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info)
+{
+   int ret;
+   uint64_t vm_size;
+   uint32_t pcie_gen_mask;
+
+   if (dev_info == NULL)
+   return -EINVAL;
+
+   dev_info->device_id = adev->pdev->device;
+   dev_info->chip_rev = adev->rev_id;
+   dev_info->external_rev = adev->external_rev_id;
+   dev_info->pci_rev = adev->pdev->revision;
+   dev_info->family = adev->family;
+   dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+   dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se;
+   /* return all clocks in KHz */
+   dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10;
+   if (adev->pm.dpm_enabled) {
+   dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 
10;
+   dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 
10;
+   dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 
10;
+   dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 
10;
+   } else {
+   dev_info->max_engine_clock =
+   dev_info->min_engine_clock =
+   adev->clock.default_sclk * 10;
+   dev_info->max_memory_clock =
+   dev_info->min_memory_clock =
+   adev->clock.default_mclk * 10;
+   }
+   

Re: [PATCH] drm/amdgpu: refactor code to reuse system information

2024-03-19 Thread Khatri, Sunil
Validated the code by using the function in same way as ioctl would use 
in devcoredump and getting the valid values.


Also this would be the container of the information that we need to 
share between ioctl, debugfs and devcoredump and keep updating this 
based on information needed.



On 3/19/2024 6:02 PM, Sunil Khatri wrote:

Refactor the code so debugfs and devcoredump can reuse
the common information and avoid unnecessary copy of it.

created a new file which would be the right place to
hold functions which will be used between sysfs, debugfs
and devcoredump.

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/Makefile |   2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +--
  4 files changed, 157 insertions(+), 115 deletions(-)
  create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile
index 4536c8ad0e11..05d34f4b18f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o 
amdgpu_kms.o \
amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
amdgpu_fw_attestation.o amdgpu_securedisplay.o \
amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
-   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
+   amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o 
amdgpu_devinfo.o
  
  amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 9c62552bec34..0267870aa9b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1609,4 +1609,5 @@ extern const struct attribute_group 
amdgpu_vram_mgr_attr_group;
  extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
  extern const struct attribute_group amdgpu_flash_attr_group;
  
+int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info);

  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
new file mode 100644
index ..d2c15a1dcb0d
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright 2024 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#include "amdgpu.h"
+#include "amd_pcie.h"
+
+#include 
+
+int amdgpu_device_info(struct amdgpu_device *adev, struct 
drm_amdgpu_info_device *dev_info)
+{
+   int ret;
+   uint64_t vm_size;
+   uint32_t pcie_gen_mask;
+
+   if (dev_info == NULL)
+   return -EINVAL;
+
+   dev_info->device_id = adev->pdev->device;
+   dev_info->chip_rev = adev->rev_id;
+   dev_info->external_rev = adev->external_rev_id;
+   dev_info->pci_rev = adev->pdev->revision;
+   dev_info->family = adev->family;
+   dev_info->num_shader_engines = adev->gfx.config.max_shader_engines;
+   dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se;
+   /* return all clocks in KHz */
+   dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10;
+   if (adev->pm.dpm_enabled) {
+   dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 
10;
+   dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 
10;
+   dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 
10;
+   dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 
10;
+   } else {
+   dev_info->max_engine_clock =
+   dev_info->min_engine_clock =
+  

RE: [bug report] drm/amdgpu: add ring buffer information in devcoredump

2024-03-18 Thread Khatri, Sunil
[AMD Official Use Only - General]

Got it. Thanks for reported that. Sent the patch for review.

Regards
Sunil khatri

-Original Message-
From: Dan Carpenter 
Sent: Saturday, March 16, 2024 2:42 PM
To: Khatri, Sunil 
Cc: Khatri, Sunil ; Koenig, Christian 
; Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [bug report] drm/amdgpu: add ring buffer information in devcoredump

The static checker is just complaining about NULL checking that doesn't make 
sense.  It raises the question, can the pointer be NULL or not?

Based on your comments and from reviewing the code, I do not think it can be 
NULL.  Thus the correct thing is to remove the unnecessary NULL check.

regards,
dan carpenter



Re: [bug report] drm/amdgpu: add ring buffer information in devcoredump

2024-03-15 Thread Khatri, Sunil

Thanks for pointing these. I do have some doubt and i raised inline.

On 3/15/2024 8:46 PM, Dan Carpenter wrote:

Hello Sunil Khatri,

Commit 42742cc541bb ("drm/amdgpu: add ring buffer information in
devcoredump") from Mar 11, 2024 (linux-next), leads to the following
Smatch static checker warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:219 amdgpu_devcoredump_read()
error: we previously assumed 'coredump->adev' could be null (see line 
206)

drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
 171 static ssize_t
 172 amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
 173 void *data, size_t datalen)
 174 {
 175 struct drm_printer p;
 176 struct amdgpu_coredump_info *coredump = data;
 177 struct drm_print_iterator iter;
 178 int i;
 179
 180 iter.data = buffer;
 181 iter.offset = 0;
 182 iter.start = offset;
 183 iter.remain = count;
 184
 185 p = drm_coredump_printer();
 186
 187 drm_printf(, " AMDGPU Device Coredump \n");
 188 drm_printf(, "version: " AMDGPU_COREDUMP_VERSION "\n");
 189 drm_printf(, "kernel: " UTS_RELEASE "\n");
 190 drm_printf(, "module: " KBUILD_MODNAME "\n");
 191 drm_printf(, "time: %lld.%09ld\n", 
coredump->reset_time.tv_sec,
 192 coredump->reset_time.tv_nsec);
 193
 194 if (coredump->reset_task_info.pid)
 195 drm_printf(, "process_name: %s PID: %d\n",
 196coredump->reset_task_info.process_name,
 197coredump->reset_task_info.pid);
 198
 199 if (coredump->ring) {
 200 drm_printf(, "\nRing timed out details\n");
 201 drm_printf(, "IP Type: %d Ring Name: %s\n",
 202coredump->ring->funcs->type,
 203coredump->ring->name);
 204 }
 205
 206 if (coredump->adev) {
 ^^
Check for NULL

This is the check for NULL. Is there any issue here ?


 207 struct amdgpu_vm_fault_info *fault_info =
 208 >adev->vm_manager.fault_info;
 209
 210 drm_printf(, "\n[%s] Page fault observed\n",
 211fault_info->vmhub ? "mmhub" : "gfxhub");
 212 drm_printf(, "Faulty page starting at address: 
0x%016llx\n",
 213fault_info->addr);
 214 drm_printf(, "Protection fault status register: 
0x%x\n\n",
 215fault_info->status);
 216 }
 217
 218 drm_printf(, "Ring buffer information\n");
--> 219 for (int i = 0; i < coredump->adev->num_rings; i++) {
 ^^
Unchecked dereference

Agree


 220 int j = 0;
 221 struct amdgpu_ring *ring = coredump->adev->rings[i];
 222
 223 drm_printf(, "ring name: %s\n", ring->name);
 224 drm_printf(, "Rptr: 0x%llx Wptr: 0x%llx RB mask: 
%x\n",
 225amdgpu_ring_get_rptr(ring),
 226amdgpu_ring_get_wptr(ring),
 227ring->buf_mask);
 228 drm_printf(, "Ring size in dwords: %d\n",
 229ring->ring_size / 4);
 230 drm_printf(, "Ring contents\n");
 231 drm_printf(, "Offset \t Value\n");
 232
 233 while (j < ring->ring_size) {
 234 drm_printf(, "0x%x \t 0x%x\n", j, 
ring->ring[j/4]);
 235 j += 4;
 236 }
 237 }
 238
 239 if (coredump->reset_vram_lost)
 240 drm_printf(, "VRAM is lost due to GPU reset!\n");
 241 if (coredump->adev->reset_info.num_regs) {
 ^^
Here too

Agree.


 242 drm_printf(, "AMDGPU register dumps:\nOffset: 
Value:\n");
 243
 244 for (i = 0; i < coredump->adev->reset_info.num_regs; 
i++)
 245 drm_printf(, "0x%08x: 0x%08x\n",
 246
coredump->adev->reset_info.reset_dump_reg_list[i],
 247
coredump->adev->reset_info.reset_dump_reg_value[i]);
 248 }
 249
 250 return count - iter.remain;
 251 }



Although adev is a global structure and never in the code it is being 
checked for NULL as it wont be NULL until the driver is unloaded. I can 
add a check  for adev in the beginning of the function 
amdgpu_devcoredump_read for 

Re: [PATCH] drm/amdgpu: add the hw_ip version of all IP's

2024-03-15 Thread Khatri, Sunil



On 3/15/2024 6:45 PM, Alex Deucher wrote:

On Fri, Mar 15, 2024 at 8:13 AM Sunil Khatri  wrote:

Add all the IP's version information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 

This looks great.

Reviewed-by: Alex Deucher 


Thanks Alex




---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 62 +++
  1 file changed, 62 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..3d4bfe0a5a7c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -29,6 +29,43 @@
  #include "sienna_cichlid.h"
  #include "smu_v13_0_10.h"

+const char *hw_ip_names[MAX_HWIP] = {
+   [GC_HWIP]   = "GC",
+   [HDP_HWIP]  = "HDP",
+   [SDMA0_HWIP]= "SDMA0",
+   [SDMA1_HWIP]= "SDMA1",
+   [SDMA2_HWIP]= "SDMA2",
+   [SDMA3_HWIP]= "SDMA3",
+   [SDMA4_HWIP]= "SDMA4",
+   [SDMA5_HWIP]= "SDMA5",
+   [SDMA6_HWIP]= "SDMA6",
+   [SDMA7_HWIP]= "SDMA7",
+   [LSDMA_HWIP]= "LSDMA",
+   [MMHUB_HWIP]= "MMHUB",
+   [ATHUB_HWIP]= "ATHUB",
+   [NBIO_HWIP] = "NBIO",
+   [MP0_HWIP]  = "MP0",
+   [MP1_HWIP]  = "MP1",
+   [UVD_HWIP]  = "UVD/JPEG/VCN",
+   [VCN1_HWIP] = "VCN1",
+   [VCE_HWIP]  = "VCE",
+   [VPE_HWIP]  = "VPE",
+   [DF_HWIP]   = "DF",
+   [DCE_HWIP]  = "DCE",
+   [OSSSYS_HWIP]   = "OSSSYS",
+   [SMUIO_HWIP]= "SMUIO",
+   [PWR_HWIP]  = "PWR",
+   [NBIF_HWIP] = "NBIF",
+   [THM_HWIP]  = "THM",
+   [CLK_HWIP]  = "CLK",
+   [UMC_HWIP]  = "UMC",
+   [RSMU_HWIP] = "RSMU",
+   [XGMI_HWIP] = "XGMI",
+   [DCI_HWIP]  = "DCI",
+   [PCIE_HWIP] = "PCIE",
+};
+
+
  int amdgpu_reset_init(struct amdgpu_device *adev)
  {
 int ret = 0;
@@ -196,6 +233,31 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->reset_task_info.process_name,
coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+
+   drm_printf(, "\nIP Information\n");
+   drm_printf(, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id);
+   drm_printf(, "SOC External Revision id: %d\n",
+  coredump->adev->external_rev_id);
+
+   for (int i = 1; i < MAX_HWIP; i++) {
+   for (int j = 0; j < HWIP_MAX_INSTANCE; j++) {
+   int ver = coredump->adev->ip_versions[i][j];
+
+   if (ver)
+   drm_printf(, "HWIP: %s[%d][%d]: 
v%d.%d.%d.%d.%d\n",
+  hw_ip_names[i], i, j,
+  IP_VERSION_MAJ(ver),
+  IP_VERSION_MIN(ver),
+  IP_VERSION_REV(ver),
+  IP_VERSION_VARIANT(ver),
+  IP_VERSION_SUBREV(ver));
+   }
+   }
+   }
+
 if (coredump->ring) {
 drm_printf(, "\nRing timed out details\n");
 drm_printf(, "IP Type: %d Ring Name: %s\n",
--
2.34.1



RE: [PATCH] drm/amdgpu: add the hw_ip version of all IP's

2024-03-15 Thread Khatri, Sunil
[AMD Official Use Only - General]

Hello Alex

Added the information directly from the ip_version and also added names for 
each ip so the version information makes more sense to the user.

Below is the output in devcoredump now:
IP Information
SOC Family: 143
SOC Revision id: 0
SOC External Revision id: 50
HWIP: GC[1][0]: v10.3.2.0.0
HWIP: HDP[2][0]: v5.0.3.0.0
HWIP: SDMA0[3][0]: v5.2.2.0.0
HWIP: SDMA1[4][0]: v5.2.2.0.0
HWIP: MMHUB[12][0]: v2.1.0.0.0
HWIP: ATHUB[13][0]: v2.1.0.0.0
HWIP: NBIO[14][0]: v3.3.1.0.0
HWIP: MP0[15][0]: v11.0.11.0.0
HWIP: MP1[16][0]: v11.0.11.0.0
HWIP: UVD/JPEG/VCN[17][0]: v3.0.0.0.0
HWIP: UVD/JPEG/VCN[17][1]: v3.0.1.0.0
HWIP: DF[21][0]: v3.7.3.0.0
HWIP: DCE[22][0]: v3.0.0.0.0
HWIP: OSSSYS[23][0]: v5.0.3.0.0
HWIP: SMUIO[24][0]: v11.0.6.0.0
HWIP: NBIF[26][0]: v3.3.1.0.0
HWIP: THM[27][0]: v11.0.5.0.0
HWIP: CLK[28][0]: v11.0.3.0.0
HWIP: CLK[28][1]: v11.0.3.0.0
HWIP: CLK[28][2]: v11.0.3.0.0
HWIP: CLK[28][3]: v11.0.3.0.0
HWIP: CLK[28][4]: v11.0.3.0.0
HWIP: CLK[28][5]: v11.0.3.0.0
HWIP: CLK[28][6]: v11.0.3.0.0
HWIP: CLK[28][7]: v11.0.3.0.0
HWIP: UMC[29][0]: v8.7.1.0.0
HWIP: UMC[29][1]: v8.7.1.0.0
HWIP: UMC[29][2]: v8.7.1.0.0
HWIP: UMC[29][3]: v8.7.1.0.0
HWIP: UMC[29][4]: v8.7.1.0.0
HWIP: UMC[29][5]: v8.7.1.0.0
HWIP: PCIE[33][0]: v6.5.0.0.0


-Original Message-
From: Sunil Khatri 
Sent: Friday, March 15, 2024 5:43 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
linux-ker...@vger.kernel.org; Khatri, Sunil 
Subject: [PATCH] drm/amdgpu: add the hw_ip version of all IP's

Add all the IP's version information on a SOC to the devcoredump.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 62 +++
 1 file changed, 62 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..3d4bfe0a5a7c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -29,6 +29,43 @@
 #include "sienna_cichlid.h"
 #include "smu_v13_0_10.h"

+const char *hw_ip_names[MAX_HWIP] = {
+   [GC_HWIP]   = "GC",
+   [HDP_HWIP]  = "HDP",
+   [SDMA0_HWIP]= "SDMA0",
+   [SDMA1_HWIP]= "SDMA1",
+   [SDMA2_HWIP]= "SDMA2",
+   [SDMA3_HWIP]= "SDMA3",
+   [SDMA4_HWIP]= "SDMA4",
+   [SDMA5_HWIP]= "SDMA5",
+   [SDMA6_HWIP]= "SDMA6",
+   [SDMA7_HWIP]= "SDMA7",
+   [LSDMA_HWIP]= "LSDMA",
+   [MMHUB_HWIP]= "MMHUB",
+   [ATHUB_HWIP]= "ATHUB",
+   [NBIO_HWIP] = "NBIO",
+   [MP0_HWIP]  = "MP0",
+   [MP1_HWIP]  = "MP1",
+   [UVD_HWIP]  = "UVD/JPEG/VCN",
+   [VCN1_HWIP] = "VCN1",
+   [VCE_HWIP]  = "VCE",
+   [VPE_HWIP]  = "VPE",
+   [DF_HWIP]   = "DF",
+   [DCE_HWIP]  = "DCE",
+   [OSSSYS_HWIP]   = "OSSSYS",
+   [SMUIO_HWIP]= "SMUIO",
+   [PWR_HWIP]  = "PWR",
+   [NBIF_HWIP] = "NBIF",
+   [THM_HWIP]  = "THM",
+   [CLK_HWIP]  = "CLK",
+   [UMC_HWIP]  = "UMC",
+   [RSMU_HWIP] = "RSMU",
+   [XGMI_HWIP] = "XGMI",
+   [DCI_HWIP]  = "DCI",
+   [PCIE_HWIP] = "PCIE",
+};
+
+
 int amdgpu_reset_init(struct amdgpu_device *adev)  {
int ret = 0;
@@ -196,6 +233,31 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
   coredump->reset_task_info.process_name,
   coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+
+   drm_printf(, "\nIP Information\n");
+   drm_printf(, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id);
+   drm_printf(, "SOC External Revision id: %d\n",
+  coredump->adev->external_rev_id);
+
+   for (int i = 1; i < MAX_HWIP; i++) {
+   for (int j = 0; j < HWIP_MAX_INSTANCE; j++) {
+   int ver = coredump->adev->ip_versions[i][j];
+
+   if (ver)
+ 

Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

2024-03-14 Thread Khatri, Sunil



On 3/14/2024 8:12 PM, Alex Deucher wrote:

On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil  wrote:


On 3/14/2024 1:58 AM, Alex Deucher wrote:

On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri  wrote:

Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++
   1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
 coredump->reset_task_info.process_name,
 coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+   drm_printf(, "\nIP Information\n");
+   drm_printf(, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   struct amdgpu_ip_block *ip =
+   >adev->ip_blocks[i];
+   drm_printf(, "IP type: %d IP name: %s\n",
+  ip->version->type,
+  ip->version->funcs->name);
+   drm_printf(, "IP version: (%d,%d,%d)\n\n",
+  ip->version->major,
+  ip->version->minor,
+  ip->version->rev);
+   }
+   }

I think the IP discovery table would be more useful.  Either walk the
adev->ip_versions structure, or just include the IP discovery binary.

I did explore the adev->ip_versions and if i just go through the array
it doesn't give any useful information directly.
There are no ways to find directly from adev->ip_versions below things
until i also reparse the discovery binary again like done the discovery
amdgpu_discovery_reg_base_init and walk through the headers of various
ips using discovery binary.
a. Which IP is available on soc or not.
b. How many instances are there
Also i again have to change back to major, minor and rev convention for
this information to be useful. I am exploring it more if i find some
other information i will update.

adev->ip_block[] is derived from ip discovery only for each block which
is there on the SOC, so we are not reading information which isnt
applicable for the soc. We have name , type and version no of the IPs
available on the soc. If you want i could add no of instances of each IP
too if you think that's useful information here. Could you share what
information is missing in this approach so i can include that.

I was hoping to get the actual IP versions for the IPs from IP
discovery rather than the versions from the ip_block array.  The
latter are common so you can end up with the same version used across
a wide variety of chips (e.g., all gfx10.x based chips use the same
gfx 10 IP code even if the actual IP version is different for most of
the chips).

Got it. let me check how to get it could be done rightly.



For dumping the IP discovery binary, i dont understand how that
information would be useful directly and needs to be decoded like we are
doing in discovery init. Please correct me if my understanding is wrong
here.

It's probably not a high priority, I was just thinking it might be
useful to have in case there ended up being some problem related to
the IP discovery table on some boards.  E.g., we'd know that all
boards with a certain harvest config seem to align with a reported
problem.  Similar for vbios.  It's more for telemetry.  E.g., all the
boards reporting some problem have a particular powerplay config or
whatever.

I got it.
But two points of contention here in my understanding. The dump works 
only where there is reset and not sure if it could be used very early in 
development of not. Second point is that devcoredump is 4096 
bytes/4Kbyte of memory where we are dumping all the information. Not 
sure if that could be increased but it might not be enough if we are 
planning to dump all to it.


Another point is since we have sysfs/debugfs/info ioctl etc information 
available. We should sort out what really is helpful in debugging GPU 
hang and that's added in devcore.


Regards
Sunil



Alex



Alex


+
  if (coredump->ring) {
  drm_printf(, "\nRing timed out details\n");
  drm_printf(, "IP Type: %d Ring Name: %s\n",
--
2.34.1



Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's

2024-03-14 Thread Khatri, Sunil



On 3/14/2024 11:40 AM, Sharma, Shashank wrote:


On 14/03/2024 06:58, Khatri, Sunil wrote:


On 3/14/2024 2:06 AM, Alex Deucher wrote:
On Tue, Mar 12, 2024 at 8:42 AM Sunil Khatri  
wrote:

Add firmware version information of each
IP and each instance where applicable.


Is there a way we can share some common code with devcoredump,
debugfs, and the info IOCTL?  All three places need to query this
information and the same logic is repeated in each case.


Hello Alex,

Yes you re absolutely right the same information is being retrieved 
again as done in debugfs. I can reorganize the code so same function 
could be used by debugfs and devcoredump but this is exactly what i 
tried to avoid here. I did try to use minimum functionality in 
devcoredump without shuffling a lot of code here and there.


Also our devcoredump is implemented in amdgpu_reset.c and not all the 
information is available here and there we might have to include lot 
of header and cross functions in amdgpu_reset until we want a 
dedicated file for devcoredump.



I think Alex is suggesting to have one common backend to generate all 
the core debug info, and then different wrapper functions which can 
pack this raw info into the packets aligned with respective infra like 
devcore/debugfs/info IOCTL, which seems like a good idea to me.


If you think you need a new file for this backend it should be fine.
My suggestion was on same lines that if we want to use the same infra to 
access information from different parts of the code then we need to 
reorganize. And at same time since there is quite some data we are 
planning to add in devcoredump so i recommend to have a dedicated .c/.h 
instead of using amdgpu_reset.c so a clean include is easy to maintain.


Once Alex confirms it i can start working on design and what all 
information we need on this.


Regards
Sunil



something like:

amdgpu_debug_core.c::

struct amdgpu_core_debug_info {

/* Superset of all the info you are collecting from HW */

};

- amdgpu_debug_generate_core_info

{

/* This function collects the core debug info from HW and saves in 
amdgpu_core_debug_info,


  we can update this periodically regardless of a request */

}

and then:

devcore_info *amdgpu_debug_pack_for_devcore(core_debug_info)

{

/* convert core debug info into devcore aligned format/data */

}

ioctl_info *amdgpu_debug_pack_for_info_ioctl(core_debug_info)

{

/* convert core debug info into info IOCTL aligned format/data */

}

debugfs_info *amdgpu_debug_pack_for_debugfs(core_debug_info)

{

/* convert core debug info into debugfs aligned format/data */

}

- Shashank





Info IOCTL does have a lot of information which also is in pipeline 
to be dumped but this if we want to reuse the functionality of IOCTL 
we need to reorganize a lot of code.


If that is the need of the hour i could work on that. Please let me 
know.


This is my suggestion if it makes sense:

1. If we want to reuse a lot of functionality then we need to 
modularize some of the functions further so they could be consumed 
directly by devcoredump.
2. We should also have a dedicated file for devcoredump.c/.h so its 
easy to include headers of needed functionality cleanly and easy to 
expand devcoredump.
3. based on the priority and importance of this task we can add 
information else some repetition is a real possibility.




Alex



Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 
++

  1 file changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 611fdb90a1fc..78ddc58aef67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device 
*adev, bool vram_lost,

  {
  }
  #else
+static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, 
struct drm_printer *p)

+{
+   uint32_t version;
+   uint32_t feature;
+   uint8_t smu_program, smu_major, smu_minor, smu_debug;
+
+   drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n",
+  adev->vce.fb_version, adev->vce.fw_version);
+   drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n",
+  0, adev->uvd.fw_version);
+   drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n",
+  0, adev->gmc.fw_version);
+   drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.me_feature_version, 
adev->gfx.me_fw_version);

+   drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.pfp_feature_version, 
adev->gfx.pfp_fw_version);

+   drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.ce_feature_version, 
adev->gfx.ce_fw_version);

+   

Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's

2024-03-13 Thread Khatri, Sunil



On 3/14/2024 2:06 AM, Alex Deucher wrote:

On Tue, Mar 12, 2024 at 8:42 AM Sunil Khatri  wrote:

Add firmware version information of each
IP and each instance where applicable.


Is there a way we can share some common code with devcoredump,
debugfs, and the info IOCTL?  All three places need to query this
information and the same logic is repeated in each case.


Hello Alex,

Yes you re absolutely right the same information is being retrieved 
again as done in debugfs. I can reorganize the code so same function 
could be used by debugfs and devcoredump but this is exactly what i 
tried to avoid here. I did try to use minimum functionality in 
devcoredump without shuffling a lot of code here and there.


Also our devcoredump is implemented in amdgpu_reset.c and not all the 
information is available here and there we might have to include lot of 
header and cross functions in amdgpu_reset until we want a dedicated 
file for devcoredump.


Info IOCTL does have a lot of information which also is in pipeline to 
be dumped but this if we want to reuse the functionality of IOCTL we 
need to reorganize a lot of code.


If that is the need of the hour i could work on that. Please let me know.

This is my suggestion if it makes sense:

1. If we want to reuse a lot of functionality then we need to modularize 
some of the functions further so they could be consumed directly by 
devcoredump.
2. We should also have a dedicated file for devcoredump.c/.h so its easy 
to include headers of needed functionality cleanly and easy to expand 
devcoredump.
3. based on the priority and importance of this task we can add 
information else some repetition is a real possibility.




Alex



Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++
  1 file changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 611fdb90a1fc..78ddc58aef67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool 
vram_lost,
  {
  }
  #else
+static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct 
drm_printer *p)
+{
+   uint32_t version;
+   uint32_t feature;
+   uint8_t smu_program, smu_major, smu_minor, smu_debug;
+
+   drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n",
+  adev->vce.fb_version, adev->vce.fw_version);
+   drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n",
+  0, adev->uvd.fw_version);
+   drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n",
+  0, adev->gmc.fw_version);
+   drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.me_feature_version, adev->gfx.me_fw_version);
+   drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version);
+   drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.ce_feature_version, adev->gfx.ce_fw_version);
+   drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version);
+
+   drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlc_feature_version,
+  adev->gfx.rlc_srlc_fw_version);
+   drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlg_feature_version,
+  adev->gfx.rlc_srlg_fw_version);
+   drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srls_feature_version,
+  adev->gfx.rlc_srls_fw_version);
+   drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcp_ucode_feature_version,
+  adev->gfx.rlcp_ucode_version);
+   drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcv_ucode_feature_version,
+  adev->gfx.rlcv_ucode_version);
+   drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec_feature_version,
+  adev->gfx.mec_fw_version);
+
+   if (adev->gfx.mec2_fw)
+   drm_printf(p,
+  "MEC2 feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec2_feature_version,
+  adev->gfx.mec2_fw_version);
+
+   drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n",
+  0, adev->gfx.imu_fw_version);
+   drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n",
+  adev->psp.sos.feature_version,
+  adev->psp.sos.fw_version);
+   drm_printf(p, "PSP ASD 

Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

2024-03-13 Thread Khatri, Sunil



On 3/14/2024 1:58 AM, Alex Deucher wrote:

On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri  wrote:

Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->reset_task_info.process_name,
coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+   drm_printf(, "\nIP Information\n");
+   drm_printf(, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   struct amdgpu_ip_block *ip =
+   >adev->ip_blocks[i];
+   drm_printf(, "IP type: %d IP name: %s\n",
+  ip->version->type,
+  ip->version->funcs->name);
+   drm_printf(, "IP version: (%d,%d,%d)\n\n",
+  ip->version->major,
+  ip->version->minor,
+  ip->version->rev);
+   }
+   }

I think the IP discovery table would be more useful.  Either walk the
adev->ip_versions structure, or just include the IP discovery binary.


I did explore the adev->ip_versions and if i just go through the array 
it doesn't give any useful information directly.
There are no ways to find directly from adev->ip_versions below things 
until i also reparse the discovery binary again like done the discovery 
amdgpu_discovery_reg_base_init and walk through the headers of various 
ips using discovery binary.

a. Which IP is available on soc or not.
b. How many instances are there
Also i again have to change back to major, minor and rev convention for 
this information to be useful. I am exploring it more if i find some 
other information i will update.


adev->ip_block[] is derived from ip discovery only for each block which 
is there on the SOC, so we are not reading information which isnt 
applicable for the soc. We have name , type and version no of the IPs 
available on the soc. If you want i could add no of instances of each IP 
too if you think that's useful information here. Could you share what 
information is missing in this approach so i can include that.


For dumping the IP discovery binary, i dont understand how that 
information would be useful directly and needs to be decoded like we are 
doing in discovery init. Please correct me if my understanding is wrong 
here.

Alex


+
 if (coredump->ring) {
 drm_printf(, "\nRing timed out details\n");
 drm_printf(, "IP Type: %d Ring Name: %s\n",
--
2.34.1



Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's

2024-03-13 Thread Khatri, Sunil
[AMD Official Use Only - General]

Gentle reminder

Regards
Sunil

Get Outlook for Android<https://aka.ms/AAb9ysg>

From: Sunil Khatri 
Sent: Tuesday, March 12, 2024 6:11:48 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-gfx@lists.freedesktop.org ; 
dri-de...@lists.freedesktop.org ; 
linux-ker...@vger.kernel.org ; Khatri, Sunil 

Subject: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's

Add firmware version information of each
IP and each instance where applicable.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++
 1 file changed, 122 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 611fdb90a1fc..78ddc58aef67 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool 
vram_lost,
 {
 }
 #else
+static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct 
drm_printer *p)
+{
+   uint32_t version;
+   uint32_t feature;
+   uint8_t smu_program, smu_major, smu_minor, smu_debug;
+
+   drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n",
+  adev->vce.fb_version, adev->vce.fw_version);
+   drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n",
+  0, adev->uvd.fw_version);
+   drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n",
+  0, adev->gmc.fw_version);
+   drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.me_feature_version, adev->gfx.me_fw_version);
+   drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version);
+   drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.ce_feature_version, adev->gfx.ce_fw_version);
+   drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version);
+
+   drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlc_feature_version,
+  adev->gfx.rlc_srlc_fw_version);
+   drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srlg_feature_version,
+  adev->gfx.rlc_srlg_fw_version);
+   drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlc_srls_feature_version,
+  adev->gfx.rlc_srls_fw_version);
+   drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcp_ucode_feature_version,
+  adev->gfx.rlcp_ucode_version);
+   drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.rlcv_ucode_feature_version,
+  adev->gfx.rlcv_ucode_version);
+   drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec_feature_version,
+  adev->gfx.mec_fw_version);
+
+   if (adev->gfx.mec2_fw)
+   drm_printf(p,
+  "MEC2 feature version: %u, fw version: 0x%08x\n",
+  adev->gfx.mec2_feature_version,
+  adev->gfx.mec2_fw_version);
+
+   drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n",
+  0, adev->gfx.imu_fw_version);
+   drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n",
+  adev->psp.sos.feature_version,
+  adev->psp.sos.fw_version);
+   drm_printf(p, "PSP ASD feature version: %u, fw version: 0x%08x\n",
+  adev->psp.asd_context.bin_desc.feature_version,
+  adev->psp.asd_context.bin_desc.fw_version);
+
+   drm_printf(p, "TA XGMI feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.xgmi_context.context.bin_desc.feature_version,
+  adev->psp.xgmi_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA RAS feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.ras_context.context.bin_desc.feature_version,
+  adev->psp.ras_context.context.bin_desc.fw_version);
+   drm_printf(p, "TA HDCP feature version: 0x%08x, fw version: 0x%08x\n",
+  adev->psp.hdcp_context.context.bin_desc.feature_version,
+  adev->psp.hdcp_context.context.b

Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

2024-03-13 Thread Khatri, Sunil
[AMD Official Use Only - General]

Gentle reminder for review.

Regards
Sunil

Get Outlook for Android<https://aka.ms/AAb9ysg>

From: Sunil Khatri 
Sent: Tuesday, March 12, 2024 6:11:47 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-gfx@lists.freedesktop.org ; 
dri-de...@lists.freedesktop.org ; 
linux-ker...@vger.kernel.org ; Khatri, Sunil 

Subject: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->reset_task_info.process_name,
coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+   drm_printf(, "\nIP Information\n");
+   drm_printf(, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   struct amdgpu_ip_block *ip =
+   >adev->ip_blocks[i];
+   drm_printf(, "IP type: %d IP name: %s\n",
+  ip->version->type,
+  ip->version->funcs->name);
+   drm_printf(, "IP version: (%d,%d,%d)\n\n",
+  ip->version->major,
+  ip->version->minor,
+  ip->version->rev);
+   }
+   }
+
 if (coredump->ring) {
 drm_printf(, "\nRing timed out details\n");
 drm_printf(, "IP Type: %d Ring Name: %s\n",
--
2.34.1



Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

2024-03-13 Thread Khatri, Sunil
[AMD Official Use Only - General]

Gentle Reminder for review.

Regards,
Sunil

Get Outlook for Android<https://aka.ms/AAb9ysg>

From: Sunil Khatri 
Sent: Tuesday, March 12, 2024 6:11:47 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-gfx@lists.freedesktop.org ; 
dri-de...@lists.freedesktop.org ; 
linux-ker...@vger.kernel.org ; Khatri, Sunil 

Subject: [PATCH 1/2] drm/amdgpu: add the IP information of the soc

Add all the IP's information on a SOC to the
devcoredump.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index a0dbccad2f53..611fdb90a1fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->reset_task_info.process_name,
coredump->reset_task_info.pid);

+   /* GPU IP's information of the SOC */
+   if (coredump->adev) {
+   drm_printf(, "\nIP Information\n");
+   drm_printf(, "SOC Family: %d\n", coredump->adev->family);
+   drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id);
+
+   for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
+   struct amdgpu_ip_block *ip =
+   >adev->ip_blocks[i];
+   drm_printf(, "IP type: %d IP name: %s\n",
+  ip->version->type,
+  ip->version->funcs->name);
+   drm_printf(, "IP version: (%d,%d,%d)\n\n",
+  ip->version->major,
+  ip->version->minor,
+  ip->version->rev);
+   }
+   }
+
 if (coredump->ring) {
 drm_printf(, "\nRing timed out details\n");
 drm_printf(, "IP Type: %d Ring Name: %s\n",
--
2.34.1



Re: [PATCH] drm/amdgpu: add ring buffer information in devcoredump

2024-03-11 Thread Khatri, Sunil



On 3/11/2024 7:29 PM, Christian König wrote:



Am 11.03.24 um 13:22 schrieb Sunil Khatri:

Add relevant ringbuffer information such as
rptr, wptr, ring name, ring size and also
the ring contents for each ring on a gpu reset.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +
  1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 6d059f853adc..1992760039da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 fault_info->status);
  }
  +    drm_printf(, "Ring buffer information\n");
+    for (int i = 0; i < coredump->adev->num_rings; i++) {
+    int j = 0;
+    struct amdgpu_ring *ring = coredump->adev->rings[i];
+
+    drm_printf(, "ring name: %s\n", ring->name);
+    drm_printf(, "Rptr: 0x%llx Wptr: 0x%llx\n",
+   amdgpu_ring_get_rptr(ring) & ring->buf_mask,
+   amdgpu_ring_get_wptr(ring) & ring->buf_mask);


Don't apply the mask here. We do have some use cases where the rptr 
and wptr are outside the ring buffer.

Sure i will remove the mask.



+    drm_printf(, "Ring size in dwords: %d\n",
+   ring->ring_size / 4);


Rather print the mask as additional value here.

Does that help adding the mask value ?



+    drm_printf(, "Ring contents\n");
+    drm_printf(, "Offset \t Value\n");
+
+    while (j < ring->ring_size) {
+    drm_printf(, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
+    j += 4;
+    }



+    drm_printf(, "Ring dumped\n");


That seems superfluous.


Noted


Regards
Sunil



Regards,
Christian.


+    }
+
  if (coredump->reset_vram_lost)
  drm_printf(, "VRAM is lost due to GPU reset!\n");
  if (coredump->adev->reset_info.num_regs) {




RE: [PATCH] drm/amdgpu: add all ringbuffer information in devcoredump

2024-03-11 Thread Khatri, Sunil
Ignore this as I updated commit message and subject so sending new mail.


-Original Message-
From: Sunil Khatri  
Sent: Monday, March 11, 2024 5:04 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
linux-ker...@vger.kernel.org; Khatri, Sunil 
Subject: [PATCH] drm/amdgpu: add all ringbuffer information in devcoredump

Add ringbuffer information such as:
rptr, wptr, ring name, ring size and also the ring contents for each ring on a 
gpu reset.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 6d059f853adc..1992760039da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
   fault_info->status);
}
 
+   drm_printf(, "Ring buffer information\n");
+   for (int i = 0; i < coredump->adev->num_rings; i++) {
+   int j = 0;
+   struct amdgpu_ring *ring = coredump->adev->rings[i];
+
+   drm_printf(, "ring name: %s\n", ring->name);
+   drm_printf(, "Rptr: 0x%llx Wptr: 0x%llx\n",
+  amdgpu_ring_get_rptr(ring) & ring->buf_mask,
+  amdgpu_ring_get_wptr(ring) & ring->buf_mask);
+   drm_printf(, "Ring size in dwords: %d\n",
+  ring->ring_size / 4);
+   drm_printf(, "Ring contents\n");
+   drm_printf(, "Offset \t Value\n");
+
+   while (j < ring->ring_size) {
+   drm_printf(, "0x%x \t 0x%x\n", j, ring->ring[j/4]);
+   j += 4;
+   }
+   drm_printf(, "Ring dumped\n");
+   }
+
if (coredump->reset_vram_lost)
drm_printf(, "VRAM is lost due to GPU reset!\n");
if (coredump->adev->reset_info.num_regs) {
--
2.34.1



Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump

2024-03-08 Thread Khatri, Sunil



On 3/8/2024 2:39 PM, Christian König wrote:

Am 07.03.24 um 21:50 schrieb Sunil Khatri:

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed
Faulty page starting at address: 0x
Protection fault status register: 0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 147100c27c2d..8794a3c21176 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 coredump->ring->name);
  }
  +    if (coredump->adev) {
+    struct amdgpu_vm_fault_info *fault_info =
+    >adev->vm_manager.fault_info;
+
+    drm_printf(, "\n[%s] Page fault observed\n",
+   fault_info->vmhub ? "mmhub" : "gfxhub");
+    drm_printf(, "Faulty page starting at address: 0x%016llx\n",
+   fault_info->addr);
+    drm_printf(, "Protection fault status register: 0x%x\n",
+   fault_info->status);
+    }
+
  if (coredump->reset_vram_lost)
-    drm_printf(, "VRAM is lost due to GPU reset!\n");
+    drm_printf(, "\nVRAM is lost due to GPU reset!\n");


Why this additional new line?
The intent is the devcoredump have different sections clearly demarcated 
with an new line else "VRAM is lost due to GPU reset!" seems part of the 
page fault information.

[gfxhub] Page fault observed
Faulty page starting at address: 0x
Protection fault status register: 0x301031

VRAM is lost due to GPU reset!

Regards
Sunil



Apart from that looks really good to me.

Regards,
Christian.


  if (coredump->adev->reset_info.num_regs) {
  drm_printf(, "AMDGPU register dumps:\nOffset: 
Value:\n");




Re: [PATCH 2/2] drm/amdgpu: add vm fault information to devcoredump

2024-03-07 Thread Khatri, Sunil



On 3/8/2024 12:44 AM, Alex Deucher wrote:

On Thu, Mar 7, 2024 at 12:00 PM Sunil Khatri  wrote:

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed
Faulty page starting at address 0x

Do you want a : before the address for consistency?

sure.



Protection fault status register:0x301031

How about a space after the : for consistency?

For parsability, it may make more sense to just have a list of key value pairs:
[GPU page fault]
hub:
addr:
status:
[Ring timeout details]
IP:
ring:
name:

etc.


Sure i agree but till now i was capturing information like we shared in 
dmesg which is user readable. But surely one we have enough data i could 
arrange all in key: value pairs like you suggest in a patch later if 
that works ?





VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 147100c27c2d..dd39e614d907 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, 
size_t count,
coredump->ring->name);
 }

+   if (coredump->adev) {
+   struct amdgpu_vm_fault_info *fault_info =
+   >adev->vm_manager.fault_info;
+
+   drm_printf(, "\n[%s] Page fault observed\n",
+  fault_info->vmhub ? "mmhub" : "gfxhub");
+   drm_printf(, "Faulty page starting at address 0x%016llx\n",
+  fault_info->addr);
+   drm_printf(, "Protection fault status register:0x%x\n",
+  fault_info->status);
+   }
+
 if (coredump->reset_vram_lost)
-   drm_printf(, "VRAM is lost due to GPU reset!\n");
+   drm_printf(, "\nVRAM is lost due to GPU reset!\n");
 if (coredump->adev->reset_info.num_regs) {
 drm_printf(, "AMDGPU register dumps:\nOffset: Value:\n");

--
2.34.1



Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump

2024-03-07 Thread Khatri, Sunil



On 3/7/2024 6:10 PM, Christian König wrote:

Am 07.03.24 um 09:37 schrieb Khatri, Sunil:


On 3/7/2024 1:47 PM, Christian König wrote:

Am 06.03.24 um 19:19 schrieb Sunil Khatri:

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed for GPU family:143
Faulty page starting at address 0x
Protection fault status register:0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
  2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 147100c27c2d..d7fea6cdf2f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 coredump->ring->name);
  }
  +    if (coredump->fault_info.status) {
+    struct amdgpu_vm_fault_info *fault_info = 
>fault_info;

+
+    drm_printf(, "\n[%s] Page fault observed for GPU 
family:%d\n",

+   fault_info->vmhub ? "mmhub" : "gfxhub",
+   coredump->adev->family);
+    drm_printf(, "Faulty page starting at address 0x%016llx\n",
+   fault_info->addr);
+    drm_printf(, "Protection fault status register:0x%x\n",
+   fault_info->status);
+    }
+
  if (coredump->reset_vram_lost)
-    drm_printf(, "VRAM is lost due to GPU reset!\n");
+    drm_printf(, "\nVRAM is lost due to GPU reset!\n");
  if (coredump->adev->reset_info.num_regs) {
  drm_printf(, "AMDGPU register dumps:\nOffset: 
Value:\n");
  @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device 
*adev, bool vram_lost,

  if (job) {
  s_job = >base;
  coredump->ring = to_amdgpu_ring(s_job->sched);
+    coredump->fault_info = job->vm->fault_info;


That's illegal. The VM pointer might already be stale at this point.

I think you need to add the fault info of the last fault globally in 
the VRAM manager or move this to the process info Shashank is 
working on.
Are you saying that during the reset or otherwise a vm which is part 
of this job could have been freed  and we might have a NULL 
dereference or invalid reference? Till now based on the resets and 
pagefaults that i have created till now using the same app which we 
are using for IH overflow i am able to get the valid vm only.


Assuming  amdgpu_vm is freed for this job or stale, are you 
suggesting to update this information in adev-> vm_manager along 
with existing per vm fault_info or only in vm_manager ?


Good question. having it both in the VM as well as the VM manager 
sounds like the simplest option for now.


Let me update the patch then with information in VM manager.

Regards
Sunil



Regards,
Christian.



Regards,
Christian.


  }
    coredump->adev = adev;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h

index 60522963aaca..3197955264f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -98,6 +98,7 @@ struct amdgpu_coredump_info {
  struct timespec64   reset_time;
  bool    reset_vram_lost;
  struct amdgpu_ring    *ring;
+    struct amdgpu_vm_fault_info    fault_info;
  };
  #endif






Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump

2024-03-07 Thread Khatri, Sunil



On 3/7/2024 1:47 PM, Christian König wrote:

Am 06.03.24 um 19:19 schrieb Sunil Khatri:

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed for GPU family:143
Faulty page starting at address 0x
Protection fault status register:0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
  2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index 147100c27c2d..d7fea6cdf2f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 coredump->ring->name);
  }
  +    if (coredump->fault_info.status) {
+    struct amdgpu_vm_fault_info *fault_info = 
>fault_info;

+
+    drm_printf(, "\n[%s] Page fault observed for GPU 
family:%d\n",

+   fault_info->vmhub ? "mmhub" : "gfxhub",
+   coredump->adev->family);
+    drm_printf(, "Faulty page starting at address 0x%016llx\n",
+   fault_info->addr);
+    drm_printf(, "Protection fault status register:0x%x\n",
+   fault_info->status);
+    }
+
  if (coredump->reset_vram_lost)
-    drm_printf(, "VRAM is lost due to GPU reset!\n");
+    drm_printf(, "\nVRAM is lost due to GPU reset!\n");
  if (coredump->adev->reset_info.num_regs) {
  drm_printf(, "AMDGPU register dumps:\nOffset: 
Value:\n");
  @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device 
*adev, bool vram_lost,

  if (job) {
  s_job = >base;
  coredump->ring = to_amdgpu_ring(s_job->sched);
+    coredump->fault_info = job->vm->fault_info;


That's illegal. The VM pointer might already be stale at this point.

I think you need to add the fault info of the last fault globally in 
the VRAM manager or move this to the process info Shashank is working on.
Are you saying that during the reset or otherwise a vm which is part 
of this job could have been freed  and we might have a NULL 
dereference or invalid reference? Till now based on the resets and 
pagefaults that i have created till now using the same app which we 
are using for IH overflow i am able to get the valid vm only.


Assuming  amdgpu_vm is freed for this job or stale, are you suggesting 
to update this information in adev-> vm_manager along with existing 
per vm fault_info or only in vm_manager ?


Regards,
Christian.


  }
    coredump->adev = adev;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h

index 60522963aaca..3197955264f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -98,6 +98,7 @@ struct amdgpu_coredump_info {
  struct timespec64   reset_time;
  bool    reset_vram_lost;
  struct amdgpu_ring    *ring;
+    struct amdgpu_vm_fault_info    fault_info;
  };
  #endif




Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump

2024-03-06 Thread Khatri, Sunil



On 3/7/2024 12:51 AM, Deucher, Alexander wrote:

[Public]


-Original Message-
From: Sunil Khatri 
Sent: Wednesday, March 6, 2024 1:20 PM
To: Deucher, Alexander ; Koenig, Christian
; Sharma, Shashank

Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; linux-
ker...@vger.kernel.org; Joshi, Mukul ; Paneer
Selvam, Arunpravin ; Khatri, Sunil

Subject: [PATCH] drm/amdgpu: add vm fault information to devcoredump

Add page fault information to the devcoredump.

Output of devcoredump:
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed for GPU family:143 Faulty page starting at

I think we should add a separate section for the GPU identification information 
(family, PCI ids, IP versions, etc.).  For this patch, I think fine to just 
print the fault address and status.


Noted

Regards
Sunil


Alex


address 0x Protection fault status register:0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
  2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 147100c27c2d..d7fea6cdf2f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t
offset, size_t count,
  coredump->ring->name);
   }

+ if (coredump->fault_info.status) {
+ struct amdgpu_vm_fault_info *fault_info = 

fault_info;

+
+ drm_printf(, "\n[%s] Page fault observed for GPU
family:%d\n",
+fault_info->vmhub ? "mmhub" : "gfxhub",
+coredump->adev->family);
+ drm_printf(, "Faulty page starting at address 0x%016llx\n",
+fault_info->addr);
+ drm_printf(, "Protection fault status register:0x%x\n",
+fault_info->status);
+ }
+
   if (coredump->reset_vram_lost)
- drm_printf(, "VRAM is lost due to GPU reset!\n");
+ drm_printf(, "\nVRAM is lost due to GPU reset!\n");
   if (coredump->adev->reset_info.num_regs) {
   drm_printf(, "AMDGPU register dumps:\nOffset:
Value:\n");

@@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device
*adev, bool vram_lost,
   if (job) {
   s_job = >base;
   coredump->ring = to_amdgpu_ring(s_job->sched);
+ coredump->fault_info = job->vm->fault_info;
   }

   coredump->adev = adev;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 60522963aaca..3197955264f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -98,6 +98,7 @@ struct amdgpu_coredump_info {
   struct timespec64   reset_time;
   boolreset_vram_lost;
   struct amdgpu_ring  *ring;
+ struct amdgpu_vm_fault_info fault_info;
  };
  #endif

--
2.34.1


RE: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil
[AMD Official Use Only - General]

Ignore this. Triggered wrongly.

-Original Message-
From: Sunil Khatri 
Sent: Wednesday, March 6, 2024 11:50 PM
To: Deucher, Alexander ; Koenig, Christian 
; Sharma, Shashank 
Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
linux-ker...@vger.kernel.org; Joshi, Mukul ; Paneer 
Selvam, Arunpravin ; Khatri, Sunil 

Subject: [PATCH] drm/amdgpu: cache in more vm fault information

When an  page fault interrupt is raised there is a lot more information that is 
useful for developers to analyse the pagefault.

Add all such information in the last cached pagefault from an interrupt handler.

Signed-off-by: Sunil Khatri 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++--  
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++-  
drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +-  
drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +-  
drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c  | 2 +-  
drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c  | 2 +-  
drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 2 +-
 7 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 4299ce386322..b77e8e28769d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, 
struct seq_file *m)
  * Cache the fault info for later use by userspace in debugging.
  */
 void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
- unsigned int pasid,
+ struct amdgpu_iv_entry *entry,
  uint64_t addr,
  uint32_t status,
  unsigned int vmhub)
@@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device 
*adev,

xa_lock_irqsave(>vm_manager.pasids, flags);

-   vm = xa_load(>vm_manager.pasids, pasid);
+   vm = xa_load(>vm_manager.pasids, entry->pasid);
/* Don't update the fault cache if status is 0.  In the multiple
 * fault case, subsequent faults will return a 0 status which is
 * useless for userspace and replaces the useful fault status, so @@ 
-2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device 
*adev,
if (vm && status) {
vm->fault_info.addr = addr;
vm->fault_info.status = status;
+   vm->fault_info.client_id = entry->client_id;
+   vm->fault_info.src_id = entry->src_id;
+   vm->fault_info.vmid = entry->vmid;
+   vm->fault_info.pasid = entry->pasid;
+   vm->fault_info.ring_id = entry->ring_id;
if (AMDGPU_IS_GFXHUB(vmhub)) {
vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
vm->fault_info.vmhub |=
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 047ec1930d12..c7782a89bdb5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info {
uint32_tstatus;
/* which vmhub? gfxhub, mmhub, etc. */
unsigned intvmhub;
+   unsigned intclient_id;
+   unsigned intsrc_id;
+   unsigned intring_id;
+   unsigned intpasid;
+   unsigned intvmid;
 };

 struct amdgpu_vm {
@@ -605,7 +610,7 @@ static inline void amdgpu_vm_eviction_unlock(struct 
amdgpu_vm *vm)  }

 void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
- unsigned int pasid,
+ struct amdgpu_iv_entry *entry,
  uint64_t addr,
  uint32_t status,
  unsigned int vmhub);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index d933e19e0cf5..6b177ce8db0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -150,7 +150,7 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device 
*adev,
status = RREG32(hub->vm_l2_pro_fault_status);
WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);

-   amdgpu_vm_update_fault_cache(adev, entry->pasid, addr, status,
+   amdgpu_vm_update_fault_cache(adev, entry, addr, status,
 entry->vmid_src ? AMDGPU_MMHUB0(0) 
: AMDGPU_GFXHUB(0));
}

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 527dc917e049..bcf254856a3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -121,7 +121,7 @@ static int gmc_v11_0_process_interrupt(struct amdgpu_

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil
As discussion we decided that we dont need the client id, srcid, pasid 
etc in page fault information dump. So this patch isnt needed anymore. 
So dropping this patch and will add the new information in the 
devcoredump for pagefault which is all available in existing structures.


As discussed, we just need to provide faulting address, Fault status 
register with gpu family to decode the fault along with process information.


Regards
Sunil Khatri

On 3/6/2024 9:56 PM, Khatri, Sunil wrote:


On 3/6/2024 9:49 PM, Christian König wrote:

Am 06.03.24 um 17:06 schrieb Khatri, Sunil:


On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:


On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:
On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  
wrote:


On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.
Well actually those information are not that interesting 
because

they are hw generation specific.

You should probably rather use the decoded strings here, 
e.g. hub,

client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 
and v11
provide like node_id and fault from which die but thats 
again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats 
information

is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other 
information
like address. So i think to provide the similar information 
in the

devcoredump.

Currently we do not have all this information from either job 
or vm
being derived from the job during a reset. We surely could 
add more
relevant information later on as per request but this 
information is

useful as
eventually its developers only who would use the dump file 
provided

by customer to debug.

Below is the information that i dump in devcore and i feel 
that is
good information but new information could be added which 
could be

picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 
0x1b (UTCL2)
This is a perfect example what I mean. You record in the patch 
is the
client_id, but this is is basically meaningless unless you 
have access

to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the 
decoded client

string is mandatory to have I think.

Sure i am capturing that information as i am trying to 
minimise the
memory interaction to minimum as we are still in interrupt 
context
here that why i recorded the integer information compared to 
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and 
client
string from client id. So are we good to go with the 
information with

the above information of sharing details in devcoredump using the
additional information from pagefault cached.
I think amdgpu_vm_fault_info() has everything you need already 
(vmhub,

status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.


Just to share context here Alex, i am preparing this for 
devcoredump, my intention was to replicate the information which 
in KMD we are sharing in Dmesg for page faults. If assuming we do 
not add client id specially we would not be able to share enough 
information in devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is 
partial information as src id and client id and ip block shares 
good information.


For process related information we are capturing that information 
part of dump from existing functionality.

 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b 
(UTCL2)

VRAM is lost due to GPU reset!

Regards
Sunil



The decoded client name would be really useful I think since the 
fault handled is a catch all and handles a whole

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:59 PM, Alex Deucher wrote:

On Wed, Mar 6, 2024 at 11:21 AM Khatri, Sunil  wrote:


On 3/6/2024 9:45 PM, Alex Deucher wrote:

On Wed, Mar 6, 2024 at 11:06 AM Khatri, Sunil  wrote:

On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:

On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:

On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this
information is
useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b
(UTCL2)

This is a perfect example what I mean. You record in the patch is
the
client_id, but this is is basically meaningless unless you have
access
to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded
client
string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.

Just to share context here Alex, i am preparing this for devcoredump,
my intention was to replicate the information which in KMD we are
sharing in Dmesg for page faults. If assuming we do not add client id
specially we would not be able to share enough information in
devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is
partial information as src id and client id and ip block shares good
information.

For process related information we are capturing that information
part of dump from existing functionality.
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)
VRAM is lost due to GPU reset!

Regards
Sunil


The decoded client name would be really useful I think since the
fault handled is a catch all and handles a whole bunch of different
clients.

But that should be ideally passed in as const string instead of the
hw generation specific client_id.

As long as it's only a pointer we also don't run into the trouble
that we need to allocate memory for it.

I agree but i prefer adding the client id and decoding it in
devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is
better else we have to do an sprintf this string to fault_info in 

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:49 PM, Christian König wrote:

Am 06.03.24 um 17:06 schrieb Khatri, Sunil:


On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:


On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:
On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  
wrote:


On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. 
hub,

client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and 
v11
provide like node_id and fault from which die but thats again 
very
specific to IP_VERSION(9, 4, 3)) i dont know why thats 
information

is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other 
information
like address. So i think to provide the similar information in 
the

devcoredump.

Currently we do not have all this information from either job 
or vm
being derived from the job during a reset. We surely could add 
more
relevant information later on as per request but this 
information is

useful as
eventually its developers only who would use the dump file 
provided

by customer to debug.

Below is the information that i dump in devcore and i feel 
that is
good information but new information could be added which 
could be

picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 
0x1b (UTCL2)
This is a perfect example what I mean. You record in the patch 
is the
client_id, but this is is basically meaningless unless you have 
access

to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded 
client

string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise 
the

memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to 
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and 
client
string from client id. So are we good to go with the information 
with

the above information of sharing details in devcoredump using the
additional information from pagefault cached.
I think amdgpu_vm_fault_info() has everything you need already 
(vmhub,

status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.


Just to share context here Alex, i am preparing this for 
devcoredump, my intention was to replicate the information which in 
KMD we are sharing in Dmesg for page faults. If assuming we do not 
add client id specially we would not be able to share enough 
information in devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is 
partial information as src id and client id and ip block shares 
good information.


For process related information we are capturing that information 
part of dump from existing functionality.

 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b 
(UTCL2)

VRAM is lost due to GPU reset!

Regards
Sunil



The decoded client name would be really useful I think since the 
fault handled is a catch all and handles a whole bunch of 
different clients.


But that should be ideally passed in as const string instead of 
the hw generation specific client_id.


As long as it's only a pointer we also don't run into the trouble 
that we need to allocate memory for it.


I agree but i prefer adding the client id and decoding it in 
devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is 
better else we have to do an sprintf this string to fault_info in 
irq context which is writing more bytes to memory i gu

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:45 PM, Alex Deucher wrote:

On Wed, Mar 6, 2024 at 11:06 AM Khatri, Sunil  wrote:


On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:

On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:

On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this
information is
useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b
(UTCL2)

This is a perfect example what I mean. You record in the patch is
the
client_id, but this is is basically meaningless unless you have
access
to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded
client
string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.

Just to share context here Alex, i am preparing this for devcoredump,
my intention was to replicate the information which in KMD we are
sharing in Dmesg for page faults. If assuming we do not add client id
specially we would not be able to share enough information in
devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is
partial information as src id and client id and ip block shares good
information.

For process related information we are capturing that information
part of dump from existing functionality.
 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)
VRAM is lost due to GPU reset!

Regards
Sunil


The decoded client name would be really useful I think since the
fault handled is a catch all and handles a whole bunch of different
clients.

But that should be ideally passed in as const string instead of the
hw generation specific client_id.

As long as it's only a pointer we also don't run into the trouble
that we need to allocate memory for it.

I agree but i prefer adding the client id and decoding it in
devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is
better else we have to do an sprintf this string to fault_info in irq
context which is writing more bytes to memory i guess compared to an
integer:)

Well I totally ag

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:07 PM, Christian König wrote:

Am 06.03.24 um 16:13 schrieb Khatri, Sunil:


On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:


On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this 
information is

useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b 
(UTCL2)
This is a perfect example what I mean. You record in the patch is 
the
client_id, but this is is basically meaningless unless you have 
access

to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded 
client

string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to 
decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.


Just to share context here Alex, i am preparing this for devcoredump, 
my intention was to replicate the information which in KMD we are 
sharing in Dmesg for page faults. If assuming we do not add client id 
specially we would not be able to share enough information in 
devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is 
partial information as src id and client id and ip block shares good 
information.


For process related information we are capturing that information 
part of dump from existing functionality.

 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)
VRAM is lost due to GPU reset!

Regards
Sunil



The decoded client name would be really useful I think since the 
fault handled is a catch all and handles a whole bunch of different 
clients.


But that should be ideally passed in as const string instead of the 
hw generation specific client_id.


As long as it's only a pointer we also don't run into the trouble 
that we need to allocate memory for it.


I agree but i prefer adding the client id and decoding it in 
devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is 
better else we have to do an sprintf this string to fault_info in irq 
context which is writing more bytes to memory i guess compared to an 
integer:)


Well I totally agree that we shouldn't fiddle to much in the interrupt 
hand

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 9:19 PM, Alex Deucher wrote:

On Wed, Mar 6, 2024 at 10:32 AM Alex Deucher  wrote:

On Wed, Mar 6, 2024 at 10:13 AM Khatri, Sunil  wrote:


On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:

On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this information is
useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b
(UTCL2)

This is a perfect example what I mean. You record in the patch is the
client_id, but this is is basically meaningless unless you have access
to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded client
string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.

Just to share context here Alex, i am preparing this for devcoredump, my
intention was to replicate the information which in KMD we are sharing
in Dmesg for page faults. If assuming we do not add client id specially
we would not be able to share enough information in devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is
partial information as src id and client id and ip block shares good
information.

We also need to include the status register value.  That contains the
important information (type of access, fault type, client, etc.).
Client_id and src_id are only used to route the interrupt to the right
software code.  E.g., a different client_id and src_id would be a
completely different interrupt (e.g., vblank or fence, etc.).  For GPU
page faults the client_id and src_id will always be the same.

The devcoredump should also include information about the GPU itself
as well (e.g., PCI DID/VID, maybe some of the relevant IP versions).

We already have "status" which is register "GCVM_L2_PROTECTION_FAULT_STATUS". 
But the problem here is this all needs to be captured in interrupt context which i want to avoid 
and this is family specific calls.

chip family would also be good.  And also vram size.

If we have a way to identify the chip and we have the vm status
register and vm fault address, we can decode all of the fault
information.

In this patch i am focusing on page fault specific information only[taking one 
at a time]. But eventually will be adding more information as per the 
devcoredump JIRA plan. will keep this in todo too for other information that 
you 

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 8:34 PM, Christian König wrote:

Am 06.03.24 um 15:29 schrieb Alex Deucher:

On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil  wrote:


On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:

On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.

Well actually those information are not that interesting because
they are hw generation specific.

You should probably rather use the decoded strings here, e.g. hub,
client, xcc_id, node_id etc...

See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11
provide like node_id and fault from which die but thats again very
specific to IP_VERSION(9, 4, 3)) i dont know why thats information
is not there in v10 and v11.

I agree to your point but, as of now during a pagefault we are
dumping this information which is useful like which client
has generated an interrupt and for which src and other information
like address. So i think to provide the similar information in the
devcoredump.

Currently we do not have all this information from either job or vm
being derived from the job during a reset. We surely could add more
relevant information later on as per request but this information is
useful as
eventually its developers only who would use the dump file provided
by customer to debug.

Below is the information that i dump in devcore and i feel that is
good information but new information could be added which could be
picked later.


Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b 
(UTCL2)

This is a perfect example what I mean. You record in the patch is the
client_id, but this is is basically meaningless unless you have access
to the AMD internal hw documentation.

What you really need is the client in decoded form, in this case
UTCL2. You can keep the client_id additionally, but the decoded client
string is mandatory to have I think.

Sure i am capturing that information as i am trying to minimise the
memory interaction to minimum as we are still in interrupt context
here that why i recorded the integer information compared to decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
string from client id. So are we good to go with the information with
the above information of sharing details in devcoredump using the
additional information from pagefault cached.

I think amdgpu_vm_fault_info() has everything you need already (vmhub,
status, and addr).  client_id and src_id are just tokens in the
interrupt cookie so we know which IP to route the interrupt to. We
know what they will be because otherwise we'd be in the interrupt
handler for a different IP.  I don't think ring_id has any useful
information in this context and vmid and pasid are probably not too
useful because they are just tokens to associate the fault with a
process.  It would be better to have the process name.


Just to share context here Alex, i am preparing this for devcoredump, my 
intention was to replicate the information which in KMD we are sharing 
in Dmesg for page faults. If assuming we do not add client id specially 
we would not be able to share enough information in devcoredump.
It would be just address and hub(gfxhub/mmhub) and i think that is 
partial information as src id and client id and ip block shares good 
information.


For process related information we are capturing that information part 
of dump from existing functionality.

 AMDGPU Device Coredump 
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 45.084775181
process_name: soft_recovery_p PID: 1780

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)
VRAM is lost due to GPU reset!

Regards
Sunil



The decoded client name would be really useful I think since the fault 
handled is a catch all and handles a whole bunch of different clients.


But that should be ideally passed in as const string instead of the hw 
generation specific client_id.


As long as it's only a pointer we also don't run into the trouble that 
we need to allocate memory for it.


I agree but i prefer adding the client id and decoding it in devcorecump 
using soc15_ih_clientid_name[fault_info->client_id]) is better else we 
have to do an sprintf this string to fault_info in irq context which is 
writing more bytes to memory i guess compared to an integer:)


We can argue on values like pasid and vmid and ring id to be taken off 
if they are totally not useful.


Regards
Sunil



Christian.



Alex


regards
sunil


Regards,
Christ

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 6:12 PM, Christian König wrote:

Am 06.03.24 um 11:40 schrieb Khatri, Sunil:


On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.


Well actually those information are not that interesting because 
they are hw generation specific.


You should probably rather use the decoded strings here, e.g. hub, 
client, xcc_id, node_id etc...


See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11 
provide like node_id and fault from which die but thats again very 
specific to IP_VERSION(9, 4, 3)) i dont know why thats information 
is not there in v10 and v11.


I agree to your point but, as of now during a pagefault we are 
dumping this information which is useful like which client
has generated an interrupt and for which src and other information 
like address. So i think to provide the similar information in the 
devcoredump.


Currently we do not have all this information from either job or vm 
being derived from the job during a reset. We surely could add more 
relevant information later on as per request but this information is 
useful as
eventually its developers only who would use the dump file provided 
by customer to debug.


Below is the information that i dump in devcore and i feel that is 
good information but new information could be added which could be 
picked later.



Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)


This is a perfect example what I mean. You record in the patch is the 
client_id, but this is is basically meaningless unless you have access 
to the AMD internal hw documentation.


What you really need is the client in decoded form, in this case 
UTCL2. You can keep the client_id additionally, but the decoded client 
string is mandatory to have I think.


Sure i am capturing that information as i am trying to minimise the 
memory interaction to minimum as we are still in interrupt context 
here that why i recorded the integer information compared to decoding

and writing strings there itself but to postpone till we dump.

Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client 
string from client id. So are we good to go with the information with 
the above information of sharing details in devcoredump using the 
additional information from pagefault cached.


regards
sunil



Regards,
Christian.



Regards
Sunil Khatri



Regards,
Christian.



Add all such information in the last cached
pagefault from an interrupt handler.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++-
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c  | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c  | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 2 +-
  7 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 4299ce386322..b77e8e28769d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct 
amdgpu_vm *vm, struct seq_file *m)

   * Cache the fault info for later use by userspace in debugging.
   */
  void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
-  unsigned int pasid,
+  struct amdgpu_iv_entry *entry,
    uint64_t addr,
    uint32_t status,
    unsigned int vmhub)
@@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct 
amdgpu_device *adev,

    xa_lock_irqsave(>vm_manager.pasids, flags);
  -    vm = xa_load(>vm_manager.pasids, pasid);
+    vm = xa_load(>vm_manager.pasids, entry->pasid);
  /* Don't update the fault cache if status is 0.  In the multiple
   * fault case, subsequent faults will return a 0 status which is
   * useless for userspace and replaces the useful fault 
status, so
@@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct 
amdgpu_device *adev,

  if (vm && status) {
  vm->fault_info.addr = addr;
  vm->fault_info.status = status;
+    vm->fault_info.client_id = entry->client_id;
+    vm->fault_info.src_id = entry->src_id;
+    vm->fault_info.vmid = entry->vmid;
+    vm->fault_info.pasid = entry->pasid;
+    vm->fault_info.ring_id = entry->ring_id;
  if (AMDGPU_IS_GFXHUB(vmhub)) {
  vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
  vm->fault_info.vmhub |=
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gp

Re: [PATCH] drm/amdgpu: cache in more vm fault information

2024-03-06 Thread Khatri, Sunil



On 3/6/2024 3:37 PM, Christian König wrote:

Am 06.03.24 um 10:04 schrieb Sunil Khatri:

When an  page fault interrupt is raised there
is a lot more information that is useful for
developers to analyse the pagefault.


Well actually those information are not that interesting  because they 
are hw generation specific.


You should probably rather use the decoded strings here, e.g. hub, 
client, xcc_id, node_id etc...


See gmc_v9_0_process_interrupt() an example.
I saw this v9 does provide more information than what v10 and v11 
provide like node_id and fault from which die but thats again very 
specific to IP_VERSION(9, 4, 3)) i dont know why thats information is 
not there in v10 and v11.


I agree to your point but, as of now during a pagefault we are dumping 
this information which is useful like which client
has generated an interrupt and for which src and other information like 
address. So i think to provide the similar information in the devcoredump.


Currently we do not have all this information from either job or vm 
being derived from the job during a reset. We surely could add more 
relevant information later on as per request but this information is 
useful as
eventually its developers only who would use the dump file provided by 
customer to debug.


Below is the information that i dump in devcore and i feel that is good 
information but new information could be added which could be picked later.



Page fault information
[gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
in page starting at address 0x from client 0x1b (UTCL2)


Regards
Sunil Khatri



Regards,
Christian.



Add all such information in the last cached
pagefault from an interrupt handler.

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++-
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c  | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c  | 2 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 2 +-
  7 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 4299ce386322..b77e8e28769d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm 
*vm, struct seq_file *m)

   * Cache the fault info for later use by userspace in debugging.
   */
  void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
-  unsigned int pasid,
+  struct amdgpu_iv_entry *entry,
    uint64_t addr,
    uint32_t status,
    unsigned int vmhub)
@@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct 
amdgpu_device *adev,

    xa_lock_irqsave(>vm_manager.pasids, flags);
  -    vm = xa_load(>vm_manager.pasids, pasid);
+    vm = xa_load(>vm_manager.pasids, entry->pasid);
  /* Don't update the fault cache if status is 0.  In the multiple
   * fault case, subsequent faults will return a 0 status which is
   * useless for userspace and replaces the useful fault status, so
@@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct 
amdgpu_device *adev,

  if (vm && status) {
  vm->fault_info.addr = addr;
  vm->fault_info.status = status;
+    vm->fault_info.client_id = entry->client_id;
+    vm->fault_info.src_id = entry->src_id;
+    vm->fault_info.vmid = entry->vmid;
+    vm->fault_info.pasid = entry->pasid;
+    vm->fault_info.ring_id = entry->ring_id;
  if (AMDGPU_IS_GFXHUB(vmhub)) {
  vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
  vm->fault_info.vmhub |=
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h

index 047ec1930d12..c7782a89bdb5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info {
  uint32_t    status;
  /* which vmhub? gfxhub, mmhub, etc. */
  unsigned int    vmhub;
+    unsigned int    client_id;
+    unsigned int    src_id;
+    unsigned int    ring_id;
+    unsigned int    pasid;
+    unsigned int    vmid;
  };
    struct amdgpu_vm {
@@ -605,7 +610,7 @@ static inline void 
amdgpu_vm_eviction_unlock(struct amdgpu_vm *vm)

  }
    void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
-  unsigned int pasid,
+  struct amdgpu_iv_entry *entry,
    uint64_t addr,
    uint32_t status,
    unsigned int vmhub);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

index d933e19e0cf5..6b177ce8db0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -150,7 +150,7 

Re: [PATCH v2] drm/amdgpu: add ring timeout information in devcoredump

2024-03-05 Thread Khatri, Sunil



On 3/5/2024 6:40 PM, Christian König wrote:

Am 05.03.24 um 12:58 schrieb Sunil Khatri:

Add ring timeout related information in the amdgpu
devcoredump file for debugging purposes.

During the gpu recovery process the registered call
is triggered and add the debug information in data
file created by devcoredump framework under the
directory /sys/class/devcoredump/devcdx/

Signed-off-by: Sunil Khatri 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  2 ++
  2 files changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c

index a59364e9b6ed..aa7fed59a0d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -196,6 +196,13 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
offset, size_t count,

 coredump->reset_task_info.process_name,
 coredump->reset_task_info.pid);
  +    if (coredump->ring_timeout) {
+    drm_printf(, "\nRing timed out details\n");
+    drm_printf(, "IP Type: %d Ring Name: %s \n",
+    coredump->ring->funcs->type,
+    coredump->ring->name);
+    }
+
  if (coredump->reset_vram_lost)
  drm_printf(, "VRAM is lost due to GPU reset!\n");
  if (coredump->adev->reset_info.num_regs) {
@@ -220,6 +227,8 @@ void amdgpu_coredump(struct amdgpu_device *adev, 
bool vram_lost,

  {
  struct amdgpu_coredump_info *coredump;
  struct drm_device *dev = adev_to_drm(adev);
+    struct amdgpu_job *job = reset_context->job;
+    struct drm_sched_job *s_job;
    coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT);
  @@ -228,6 +237,12 @@ void amdgpu_coredump(struct amdgpu_device 
*adev, bool vram_lost,

  return;
  }
  +    if (job) {
+    s_job = >base;
+    coredump->ring = to_amdgpu_ring(s_job->sched);
+    coredump->ring_timeout = TRUE;
+    }
+
  coredump->reset_vram_lost = vram_lost;
    if (reset_context->job && reset_context->job->vm) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h

index 19899f6b9b2b..6d67001a1057 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -97,6 +97,8 @@ struct amdgpu_coredump_info {
  struct amdgpu_task_info reset_task_info;
  struct timespec64   reset_time;
  bool    reset_vram_lost;
+    struct amdgpu_ring  *ring;
+    bool    ring_timeout;


I think you can drop ring_timeout, just having ring as optional 
information should be enough.


Apart from that looks pretty good I think.

- GPU reset could happen due to two possibilities atleast: 1. via sysfs 
cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover there is no timeout or 
page fault here. In this case we need information if ringtimeout 
happened or not else it will try to print empty information in 
devcoredump. Same goes for pagefault also in that case also we need to 
see if recovery ran due to pagefault and then only add that information.


So to cover all use cases i added this parameter.


Thanks
Sunil


Regards,
Christian.


  };
  #endif




Re: [PATCH] drm/amdgpu: add ring timeout information in devcoredump

2024-03-05 Thread Khatri, Sunil

On 3/5/2024 2:53 PM, Christian König wrote:
> Am 01.03.24 um 13:43 schrieb Sunil Khatri:
>> Add ring timeout related information in the amdgpu
>> devcoredump file for debugging purposes.
>>
>> During the gpu recovery process the registered call
>> is triggered and add the debug information in data
>> file created by devcoredump framework under the
>> directory /sys/class/devcoredump/devcdx/
>>
>> Signed-off-by: Sunil Khatri 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h   | 15 +++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   | 11 +++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 12 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
>>   4 files changed, 38 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 9246bca0a008..9f57c7795c47 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -816,6 +816,17 @@ struct amdgpu_reset_info {
>>   #endif
>>   };
>>   +/*
>> + *  IP and Queue information during timeout
>> + */
>> +struct amdgpu_ring_timeout_info {
>> +    bool timeout;
>
> What should that be good for?
> In case of page fault or gpu reset due to other reasons there is no 
> time out. In that case we are not adding any information and we are 
> using this flag while dumping information.
>
>> +    char ring_name[32];
>> +    enum amdgpu_ring_type ip_type;
>
> Those information should already be available in the core dump.
Will update.
>
>> +    bool soft_recovery;
>
> That doesn't make sense since we don't do a core dump in case of a 
> soft recovery.
Noted, this can be removed.
>
>> +};
>> +
>> +
>>   /*
>>    * Non-zero (true) if the GPU has VRAM. Zero (false) otherwise.
>>    */
>> @@ -1150,6 +1161,10 @@ struct amdgpu_device {
>>   bool    debug_largebar;
>>   bool debug_disable_soft_recovery;
>>   bool    debug_use_vram_fw_buf;
>> +
>> +#ifdef CONFIG_DEV_COREDUMP
>> +    struct amdgpu_ring_timeout_info ring_timeout_info;
>> +#endif
>
> Please never store core dump related information in the amdgpu_device 
> structure.

Let me see to it. Point taken

Thanks
Sunil

>
> Regards,
> Christian.
>
>>   };
>>     static inline uint32_t amdgpu_ip_version(const struct 
>> amdgpu_device *adev,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 71a5cf37b472..e36b7352b2de 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -51,8 +51,19 @@ static enum drm_gpu_sched_stat 
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>   memset(, 0, sizeof(struct amdgpu_task_info));
>>   adev->job_hang = true;
>>   +#ifdef CONFIG_DEV_COREDUMP
>> +    /* Update the ring timeout info for coredump*/
>> +    adev->ring_timeout_info.timeout = TRUE;
>> +    sprintf(adev->ring_timeout_info.ring_name, s_job->sched->name);
>> +    adev->ring_timeout_info.ip_type = ring->funcs->type;
>> +    adev->ring_timeout_info.soft_recovery = FALSE;
>> +#endif
>> +
>>   if (amdgpu_gpu_recovery &&
>>   amdgpu_ring_soft_recovery(ring, job->vmid, 
>> s_job->s_fence->parent)) {
>> +#ifdef CONFIG_DEV_COREDUMP
>> +    adev->ring_timeout_info.soft_recovery = TRUE;
>> +#endif
>>   DRM_ERROR("ring %s timeout, but soft recovered\n",
>>     s_job->sched->name);
>>   goto exit;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> index 4baa300121d8..d4f892ed105f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> @@ -196,8 +196,16 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
>> offset, size_t count,
>>  coredump->reset_task_info.process_name,
>>  coredump->reset_task_info.pid);
>>   +    if (coredump->timeout_info.timeout) {
>> +    drm_printf(, "\nRing timed out details\n");
>> +    drm_printf(, "IP Type: %d Ring Name: %s Soft Recovery: %s\n",
>> +    coredump->timeout_info.ip_type,
>> +    coredump->timeout_info.ring_name,
>> +    coredump->timeout_info.soft_recovery ? 
>> "Successful":"Failed");
>> +    }
>> +
>>   if (coredump->reset_vram_lost)
>> -    drm_printf(, "VRAM is lost due to GPU reset!\n");
>> +    drm_printf(, "\nVRAM is lost due to GPU reset!\n");
>>   if (coredump->adev->reset_info.num_regs) {
>>   drm_printf(, "AMDGPU register dumps:\nOffset: 
>> Value:\n");
>>   @@ -228,6 +236,7 @@ void amdgpu_coredump(struct amdgpu_device 
>> *adev, bool vram_lost,
>>   return;
>>   }
>>   +    coredump->timeout_info = adev->ring_timeout_info;
>>   coredump->reset_vram_lost = vram_lost;
>>     if (reset_context->job && reset_context->job->vm)
>> @@ -236,6 +245,7 @@ void amdgpu_coredump(struct amdgpu_device 

RE: [PATCH] drm/amdgpu/gmc11: implement get_vbios_fb_size()

2023-05-12 Thread Khatri, Sunil
[AMD Official Use Only - General]

Acked-by: Sunil Khatri 

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Thursday, May 11, 2023 8:13 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu/gmc11: implement get_vbios_fb_size()

Implement get_vbios_fb_size() so we can properly reserve the vbios splash 
screen to avoid potential artifacts on the screen during the transition from 
the pre-OS console to the OS console.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index f73c238f3145..2f570fb5febe 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -31,6 +31,8 @@
 #include "umc_v8_10.h"
 #include "athub/athub_3_0_0_sh_mask.h"
 #include "athub/athub_3_0_0_offset.h"
+#include "dcn/dcn_3_2_0_offset.h"
+#include "dcn/dcn_3_2_0_sh_mask.h"
 #include "oss/osssys_6_0_0_offset.h"
 #include "ivsrcid/vmc/irqsrcs_vmc_1_0.h"
 #include "navi10_enum.h"
@@ -546,7 +548,24 @@ static void gmc_v11_0_get_vm_pte(struct amdgpu_device 
*adev,
 
 static unsigned gmc_v11_0_get_vbios_fb_size(struct amdgpu_device *adev)  {
-   return 0;
+   u32 d1vga_control = RREG32_SOC15(DCE, 0, regD1VGA_CONTROL);
+   unsigned size;
+
+   if (REG_GET_FIELD(d1vga_control, D1VGA_CONTROL, D1VGA_MODE_ENABLE)) {
+   size = AMDGPU_VBIOS_VGA_ALLOCATION;
+   } else {
+   u32 viewport;
+   u32 pitch;
+
+   viewport = RREG32_SOC15(DCE, 0, 
regHUBP0_DCSURF_PRI_VIEWPORT_DIMENSION);
+   pitch = RREG32_SOC15(DCE, 0, regHUBPREQ0_DCSURF_SURFACE_PITCH);
+   size = (REG_GET_FIELD(viewport,
+   HUBP0_DCSURF_PRI_VIEWPORT_DIMENSION, 
PRI_VIEWPORT_HEIGHT) *
+   REG_GET_FIELD(pitch, 
HUBPREQ0_DCSURF_SURFACE_PITCH, PITCH) *
+   4);
+   }
+
+   return size;
 }
 
 static const struct amdgpu_gmc_funcs gmc_v11_0_gmc_funcs = {
--
2.40.1


RE: Help debug amdgpu faults

2022-11-22 Thread Khatri, Sunil
[AMD Official Use Only - General]

Hello Alex, Robert

I too have similar issues which I am facing on chrome. Are there any tools in 
linux environment which can help debug such issues like page faults, kernel 
panic caused by invalid pointer access.

I have used tools like ramdump parser which can be used to use the ramdump 
after a crash and check a lot of static data in the memory and even the page 
table could be checked by walking through them manually. We used to load the 
kernel symbols along with ramdump to go line by line.

Appreciate if you can point to some document or some tools which is already 
used by linux graphics teams either UMD or KMD drivers so chrome team can also 
exploit those to debug issues.

Regards
Sunil Khatri 

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Tuesday, November 22, 2022 7:42 PM
To: Robert Beckett 
Cc: Dmitrii Osipenko ; Adrián Martínez Larumbe 
; Koenig, Christian ; 
amd-gfx@lists.freedesktop.org; Daniel Stone 
Subject: Re: Help debug amdgpu faults

On Tue, Nov 22, 2022 at 6:53 AM Robert Beckett  
wrote:
>
> Hi,
>
>
> does anyone know any documentation, or can provide advice on debugging amdgpu 
> fault reports?

This is a GPU page fault so it refers the the GPU virtual address space of the 
application .  Each process (well fd really), gets its own GPU virtual address 
space into which system memory, system mmio space, or vram can be mapped.  The 
user mode drivers control their GPU virtual address space.

>
>
> e.g:
>
> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: [gfxhub] 
> page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid 
> 999 thread vkcube pid 999)

This is the process that caused the fault.

> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu:   in page 
> starting at address 0x80010070 from client 0x1b (UTCL2)

This is the virtual address that faulted.

> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: 
> GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10
> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu:  Faulty 
> UTCL2 client ID: SDMA0 (0xd)

The fault came from the SDMA engine.

> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu:  
> MORE_FAULTS: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu:  
> WALKER_ERROR: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu:  
> PERMISSION_FAULTS: 0x1

The page was not marked as valid in the GPU page table.

> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu:  
> MAPPING_ERROR: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu:  RW: 0x0

SDMA attempted to read an invalid page.

>
>
>
> see 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F2267data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=vep6PKgDjRz02A3xYI8f7600QV2%2B7GYXsx%2FVYPY1M2I%3Dreserved=0
>  for more context.
>
> We have a complicated setup involving rendering then blitting to virtio-gpu 
> exported dmabufs, with plenty of hacks in the mesa and xwayland stacks, so we 
> are considering this our issue to debug, and not an issue with the driver at 
> this point.
> However, having debugged all the interesting parts leading to these faults, I 
> am unable to decode the fault report to correlate to a buffer.
>
> in the fault report, what address space is the address from?
> given that the fault handler shifts the reported addres up by 12, I assume it 
> is a 4K pfn which makes me assume a physical address is this correct?
> if so, is that a vram pa or a host system memory pa?
> is there any documentation for the other fields reported like the fault 
> status etc?

See the comments above.  There is some kernel doc as well:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.kernel.org%2Fgpu%2Famdgpu%2Fdriver-core.html%23amdgpu-virtual-memorydata=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=dd971OoEZSJl%2FJif4%2Bypv9Dp0deeMVsQuCMc2o9BgQk%3Dreserved=0

>
> I'd appreciate any advice you could give to help us debug further.

Some operation you are doing in the user mode driver is reading an invalid 
page.  Possibly reading past the end of a buffer or something mis-aligned.  
Compare the faulting GPU address to the GPU virtual address space in the 
application and you should be able to track down what is happening.

Alex

>
> Thanks
>
> Bob
>
<>

RE: [PATCH] drm/amdgpu: enable tmz by default for skyrim

2022-05-30 Thread Khatri, Sunil
[AMD Official Use Only - General]

@Ernst Sjöstrand<mailto:ern...@gmail.com>
Make sense. Thanks for Review. Pushed another patch without any such names.

Regards
Sunil khatri

From: Ernst Sjöstrand 
Sent: Tuesday, May 31, 2022 1:47 AM
To: Khatri, Sunil 
Cc: Deucher, Alexander ; amd-gfx mailing list 

Subject: Re: [PATCH] drm/amdgpu: enable tmz by default for skyrim

Skyrim is maybe not the best code name ever for a GPU, perhaps not include it 
upstream if it's not official?

Regards
//Ernst

Den mån 30 maj 2022 kl 20:03 skrev Sunil Khatri 
mailto:sunil.kha...@amd.com>>:
Enable tmz feature by default for skyrim
i.e IP GC 10.3.7

Signed-off-by: Sunil Khatri mailto:sunil.kha...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 798c56214a23..aebc384531ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -518,6 +518,8 @@ void amdgpu_gmc_tmz_set(struct amdgpu_device *adev)
case IP_VERSION(9, 1, 0):
/* RENOIR looks like RAVEN */
case IP_VERSION(9, 3, 0):
+   /* GC 10.3.7 */
+   case IP_VERSION(10, 3, 7):
if (amdgpu_tmz == 0) {
adev->gmc.tmz_enabled = false;
dev_info(adev->dev,
@@ -540,8 +542,6 @@ void amdgpu_gmc_tmz_set(struct amdgpu_device *adev)
case IP_VERSION(10, 3, 1):
/* YELLOW_CARP*/
case IP_VERSION(10, 3, 3):
-   /* GC 10.3.7 */
-   case IP_VERSION(10, 3, 7):
/* Don't enable it by default yet.
 */
if (amdgpu_tmz < 1) {
--
2.25.1