RE: [PATCH v1 0/2] SDMA v5_2 ip dump support for devcoredump
[AMD Official Use Only - AMD Internal Distribution Only] Ignore Plz -Original Message- From: Sunil Khatri Sent: Friday, July 12, 2024 5:23 PM To: Deucher, Alexander ; Koenig, Christian Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil Subject: [PATCH v1 0/2] SDMA v5_2 ip dump support for devcoredump Sample output: IP: sdma_v5_2 num_instances:2 Instance:0 mmSDMA0_STATUS_REG 0x46deed57 mmSDMA0_STATUS1_REG 0x03ff mmSDMA0_STATUS2_REG 0x3f20 mmSDMA0_STATUS3_REG 0x03f6 mmSDMA0_UCODE_CHECKSUM 0x716360f5 mmSDMA0_RB_RPTR_FETCH_HI 0x mmSDMA0_RB_RPTR_FETCH0x4980 mmSDMA0_UTCL1_RD_STATUS 0x01891555 mmSDMA0_UTCL1_WR_STATUS 0x51811555 mmSDMA0_UTCL1_RD_XNACK0 0x00155828 mmSDMA0_UTCL1_RD_XNACK1 0x02a6a700 mmSDMA0_UTCL1_WR_XNACK0 0x00111558 mmSDMA0_UTCL1_WR_XNACK1 0x01c1c100 mmSDMA0_GFX_RB_CNTL 0x80871016 mmSDMA0_GFX_RB_RPTR 0x4980 mmSDMA0_GFX_RB_RPTR_HI 0x mmSDMA0_GFX_RB_WPTR 0x4980 mmSDMA0_GFX_RB_WPTR_HI 0x mmSDMA0_GFX_IB_OFFSET0x mmSDMA0_GFX_IB_BASE_LO 0x00928600 mmSDMA0_GFX_IB_BASE_HI 0x mmSDMA0_GFX_IB_CNTL 0x0100 mmSDMA0_GFX_IB_RPTR 0x01a0 mmSDMA0_GFX_IB_SUB_REMAIN0x mmSDMA0_GFX_DUMMY_REG0x00af mmSDMA0_PAGE_RB_CNTL 0x8087 mmSDMA0_PAGE_RB_RPTR 0x mmSDMA0_PAGE_RB_RPTR_HI 0x mmSDMA0_PAGE_RB_WPTR 0x mmSDMA0_PAGE_RB_WPTR_HI 0x mmSDMA0_PAGE_IB_OFFSET 0x mmSDMA0_PAGE_IB_BASE_LO 0x mmSDMA0_PAGE_IB_BASE_HI 0x mmSDMA0_PAGE_DUMMY_REG 0x000f mmSDMA0_RLC0_RB_CNTL 0x8007 mmSDMA0_RLC0_RB_RPTR 0x mmSDMA0_RLC0_RB_RPTR_HI 0x mmSDMA0_RLC0_RB_WPTR 0x mmSDMA0_RLC0_RB_WPTR_HI 0x mmSDMA0_RLC0_IB_OFFSET 0x mmSDMA0_RLC0_IB_BASE_LO 0x mmSDMA0_RLC0_IB_BASE_HI 0x mmSDMA0_RLC0_DUMMY_REG 0x000f mmSDMA0_INT_STATUS 0x00e0 mmSDMA0_VM_CNTL 0x mmGRBM_STATUS2 0x5408 Instance:1 mmSDMA0_STATUS_REG 0x46deed57 mmSDMA0_STATUS1_REG 0x03ff mmSDMA0_STATUS2_REG 0x43ad mmSDMA0_STATUS3_REG 0x03f6 mmSDMA0_UCODE_CHECKSUM 0x716360f5 mmSDMA0_RB_RPTR_FETCH_HI 0x mmSDMA0_RB_RPTR_FETCH0x3d00 mmSDMA0_UTCL1_RD_STATUS 0x01891555 mmSDMA0_UTCL1_WR_STATUS 0x51811555 mmSDMA0_UTCL1_RD_XNACK0 0x00155827 mmSDMA0_UTCL1_RD_XNACK1 0x021a1b00 mmSDMA0_UTCL1_WR_XNACK0 0x00111558 mmSDMA0_UTCL1_WR_XNACK1 0x01656500 mmSDMA0_GFX_RB_CNTL 0x80871016 mmSDMA0_GFX_RB_RPTR 0x3d00 mmSDMA0_GFX_RB_RPTR_HI 0x mmSDMA0_GFX_RB_WPTR 0x3d00 mmSDMA0_GFX_RB_WPTR_HI 0x mmSDMA0_GFX_IB_OFFSET0x mmSDMA0_GFX_IB_BASE_LO 0x00927200 mmSDMA0_GFX_IB_BASE_HI 0x mmSDMA0_GFX_IB_CNTL
RE: [PATCH v1 3/3] drm/amdgpu: select compute ME engines dynamically
[AMD Official Use Only - AMD Internal Distribution Only] Thanks Alex -Original Message- From: Alex Deucher Sent: Tuesday, July 9, 2024 7:27 PM To: Khatri, Sunil Cc: Deucher, Alexander ; Koenig, Christian ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v1 3/3] drm/amdgpu: select compute ME engines dynamically Makes sense, although the pattern elsewhere is to just start at 1 for mec. Not sure if it's worth the effort to fix all of those cases up too. True, but will keep a check on gfx13 and onwards and may be we would have a more than one ME for gfx in some chip and then we have to take care of it explicitly. Series is: Acked-by: Alex Deucher On Tue, Jul 9, 2024 at 2:07 AM Sunil Khatri wrote: > > GFX ME right now is one but this could change in future SOC's. Use no > of ME for GFX as start point for ME for compute for GFX12. > > Signed-off-by: Sunil Khatri > --- > drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c > index 084b039eb765..f384be0d1800 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c > @@ -4946,7 +4946,7 @@ static void gfx_v12_ip_dump(void *handle) > for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) { > for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; > k++) { > /* ME0 is for GFX so start from 1 for CP */ > - soc24_grbm_select(adev, 1+i, j, k, 0); > + soc24_grbm_select(adev, > + adev->gfx.me.num_me + i, j, k, 0); > for (reg = 0; reg < reg_count; reg++) { > > adev->gfx.ip_dump_compute_queues[index + reg] = > > RREG32(SOC15_REG_ENTRY_OFFSET( > -- > 2.34.1 >
RE: [PATCH v1 1/3] drm/amdgpu: add gfx9 register support in ipdump
[AMD Official Use Only - AMD Internal Distribution Only] -Original Message- From: Alex Deucher Sent: Wednesday, May 29, 2024 7:16 PM To: Khatri, Sunil Cc: Deucher, Alexander ; Koenig, Christian ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v1 1/3] drm/amdgpu: add gfx9 register support in ipdump On Wed, May 29, 2024 at 5:50 AM Sunil Khatri wrote: > > Add general registers of gfx9 in ipdump for devcoredump support. > > Signed-off-by: Sunil Khatri > --- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 124 > +- > 1 file changed, 123 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > index 3c8c5abf35ab..528a20393313 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > @@ -149,6 +149,94 @@ MODULE_FIRMWARE("amdgpu/aldebaran_sjt_mec2.bin"); > #define mmGOLDEN_TSC_COUNT_LOWER_Renoir0x0026 > #define mmGOLDEN_TSC_COUNT_LOWER_Renoir_BASE_IDX 1 > > +static const struct amdgpu_hwip_reg_entry gc_reg_list_9[] = { > + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS2), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT1), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT2), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_STALLED_STAT1), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STALLED_STAT1), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_BUSY_STAT), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_ERROR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_BASE), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_RPTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_BASE), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_RPTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_WPTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_BASE), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_RPTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_WPTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_BASE), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_CMD_BUFSZ), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_CMD_BUFSZ), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_CMD_BUFSZ), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_CMD_BUFSZ), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_LO), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_HI), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BUFSZ), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_LO), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_HI), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BUFSZ), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_LO), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_HI), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BUFSZ), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_LO), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_HI), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BUFSZ), > + SOC15_REG_ENTRY_STR(GC, 0, mmCPF_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmCPC_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmCPG_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmGDS_PROTECTION_FAULT), > + SOC15_REG_ENTRY_STR(GC, 0, mmGDS_VM_PROTECTION_FAULT), > + SOC15_REG_ENTRY_STR(GC, 0, mmIA_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmIA_UTCL1_CNTL), > + SOC15_REG_ENTRY_STR(GC, 0, mmPA_CL_CNTL_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmRLC_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmRMI_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmSQC_DCACHE_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmSQC_ICACHE_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmSQ_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmTCP_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmWD_UTCL1_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL), > + SOC15_REG_ENTRY_STR(GC, 0, mmVM_L2_PROTECTION_FAULT_STATUS), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_DEBUG), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_MEC_CNTL), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_INSTR_PNTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_MEC1_INSTR_PNTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_MEC2_INSTR_PNTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_ME_INSTR_PNTR), > + SOC15_REG_ENTRY_STR(GC, 0, mmCP_PFP_INSTR_PNTR), > + SOC15_R
Re: [PATCH v3 2/4] drm/amdgpu: Add support to dump gfx10 cp registers
On 5/16/2024 1:40 AM, Deucher, Alexander wrote: [Public] -Original Message- From: Sunil Khatri Sent: Wednesday, May 15, 2024 8:18 AM To: Deucher, Alexander ; Koenig, Christian Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil Subject: [PATCH v3 2/4] drm/amdgpu: Add support to dump gfx10 cp registers add support to dump registers of all instances of cp registers in gfx10 Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 117 +++- 2 files changed, 114 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index 30d7f9c29478..d96873c154ed 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -436,6 +436,7 @@ struct amdgpu_gfx { /* IP reg dump */ uint32_t*ipdump_core; + uint32_t*ipdump_cp; I'd call this ip_dump_compute or ip_dump_compute_queues to align with that the registers represent. Sure Alex }; struct amdgpu_gfx_ras_reg_entry { diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index f6d6a4b9802d..daf9a3571183 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -381,6 +381,49 @@ static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = { SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3) }; +static const struct amdgpu_hwip_reg_entry gc_cp_reg_list_10[] = { + /* compute registers */ + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_WG_STATE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS) }; + static const struct soc15_reg_golden golden_settings_gc_10_1[] = { SOC15_REG_GOLDEN_VALUE(GC, 0, mmCB_HW_CONTROL_4, 0x, 0x00400014), SOC15_REG_GOLDEN_VALUE(GC, 0, mmCGTT_CPF_CLK_CTRL, 0xfcff8fff, 0xf8000100), @@ -4595,10 +4638,11 @@ static int gfx_v10_0_compute_ring_init(struct amdgpu_device *adev, int ring_id, hw_prio, NULL); } -static void gfx_v10_0_alloc_dump_mem(struct amdgpu_device *adev) +static void gfx_v10_0_alloc_ip_dump(struct amdgpu_device *adev) { uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1); uint32_t *ptr; + uint32_t inst; ptr = kcalloc(reg_count, sizeof(uint32_t), GFP_KERNEL); if (ptr == NULL) { @@ -4607,6 +4651,19 @@ static void gfx_v10_0_alloc_dump_mem(struct amdgpu_device *adev) } else { adev->gfx.ipdump_core = ptr; } + + /* Allocate memory for gfx cp registers for all the instances */ + reg_count = ARRAY_SIZE(gc_cp_reg_list_10); + inst = adev->gfx.mec.num_mec * adev->gfx.mec.num_pipe_per_mec * +
Re: [PATCH v3 3/4] drm/amdgpu: add support to dump gfx10 queue registers
On 5/16/2024 1:42 AM, Deucher, Alexander wrote: [Public] -Original Message- From: Sunil Khatri Sent: Wednesday, May 15, 2024 8:18 AM To: Deucher, Alexander ; Koenig, Christian Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil Subject: [PATCH v3 3/4] drm/amdgpu: add support to dump gfx10 queue registers Add gfx queue register for all instances in ip dump for gfx10. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 86 + 2 files changed, 87 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index d96873c154ed..54232066cd3b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -437,6 +437,7 @@ struct amdgpu_gfx { /* IP reg dump */ uint32_t*ipdump_core; uint32_t*ipdump_cp; + uint32_t*ipdump_gfx_queue; I'd call this ip_dump_gfx or ip_dump_gfx_queues to better align with that it stores. }; struct amdgpu_gfx_ras_reg_entry { diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index daf9a3571183..5b8132ecc039 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -424,6 +424,33 @@ static const struct amdgpu_hwip_reg_entry gc_cp_reg_list_10[] = { SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS) }; +static const struct amdgpu_hwip_reg_entry gc_gfx_queue_reg_list_10[] = { + /* gfx queue registers */ + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_ACTIVE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_QUEUE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CSMD_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_WPTR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_DEQUEUE_REQUEST), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_MAPPED), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_QUE_MGR_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_HQ_CONTROL0), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_HQ_STATUS0), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR_POLL_ADDR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR_POLL_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_CSMD_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HQD_CE_WPTR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_MQD_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_MQD_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR_POLL_ADDR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR_POLL_ADDR_HI) }; + static const struct soc15_reg_golden golden_settings_gc_10_1[] = { SOC15_REG_GOLDEN_VALUE(GC, 0, mmCB_HW_CONTROL_4, 0x, 0x00400014), SOC15_REG_GOLDEN_VALUE(GC, 0, mmCGTT_CPF_CLK_CTRL, 0xfcff8fff, 0xf8000100), @@ -4664,6 +4691,19 @@ static void gfx_v10_0_alloc_ip_dump(struct amdgpu_device *adev) } else { adev->gfx.ipdump_cp = ptr; } + + /* Allocate memory for gfx cp queue registers for all the instances */ + reg_count = ARRAY_SIZE(gc_gfx_queue_reg_list_10); + inst = adev->gfx.me.num_me * adev->gfx.me.num_pipe_per_me * + adev->gfx.me.num_queue_per_pipe; + + ptr = kcalloc(reg_count * inst, sizeof(uint32_t), GFP_KERNEL); + if (ptr == NULL) { + DRM_ERROR("Failed to allocate memory for GFX CP IP Dump\n"); + adev->gfx.ipdump_gfx_queue = NULL; + } else { + adev->gfx.ipdump_gfx_queue = ptr; + } } static int gfx_v10_0_sw_init(void *handle) @@ -4874,6 +4914,7 @@ static int gfx_v10_0_sw_fini(void *handle) kfree(adev->gfx.ipdump_core); kfree(adev->gfx.ipdump_cp); + kfree(adev->gfx.ipdump_gfx_queue); return 0; } @@ -9368,6 +9409,26 @@ static void gfx_v10_ip_print(void *handle, struct drm_printer *p) } } } + + /* print gfx queue registers for all instances */ + if (!adev->gfx.ipdump_gfx_queue) + return; + + reg_count = ARRAY_SIZE(gc_gfx_queue_reg_list_10); + + for (i = 0; i < adev->gfx.me.num_me; i++) { + for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) { + for (k = 0; k < adev->gfx.me.num_queue_per_pipe; k++) { + drm_printf(p, "me %d, pipe %d, queue %d\n", i, j, k); + for (reg = 0; reg < reg_count; reg++) { + drm_
Re: [PATCH v3 1/4] drm/amdgpu: update the ip_dump to ipdump_core
On 5/16/2024 1:37 AM, Deucher, Alexander wrote: [Public] -Original Message- From: Sunil Khatri Sent: Wednesday, May 15, 2024 8:18 AM To: Deucher, Alexander ; Koenig, Christian Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil Subject: [PATCH v3 1/4] drm/amdgpu: update the ip_dump to ipdump_core Update the memory pointer from ip_dump to ipdump_core to make it specific to core registers and rest other registers to be dumped in their respective memories. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 2 +- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 14 +++--- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index 109f471ff315..30d7f9c29478 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -435,7 +435,7 @@ struct amdgpu_gfx { boolmcbp; /* mid command buffer preemption */ /* IP reg dump */ - uint32_t*ip_dump; + uint32_t*ipdump_core; I think this looks cleaner as ip_dump_core. Noted Alex }; struct amdgpu_gfx_ras_reg_entry { diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 953df202953a..f6d6a4b9802d 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -4603,9 +4603,9 @@ static void gfx_v10_0_alloc_dump_mem(struct amdgpu_device *adev) ptr = kcalloc(reg_count, sizeof(uint32_t), GFP_KERNEL); if (ptr == NULL) { DRM_ERROR("Failed to allocate memory for IP Dump\n"); - adev->gfx.ip_dump = NULL; + adev->gfx.ipdump_core = NULL; } else { - adev->gfx.ip_dump = ptr; + adev->gfx.ipdump_core = ptr; } } @@ -4815,7 +4815,7 @@ static int gfx_v10_0_sw_fini(void *handle) gfx_v10_0_free_microcode(adev); - kfree(adev->gfx.ip_dump); + kfree(adev->gfx.ipdump_core); return 0; } @@ -9283,13 +9283,13 @@ static void gfx_v10_ip_print(void *handle, struct drm_printer *p) uint32_t i; uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1); - if (!adev->gfx.ip_dump) + if (!adev->gfx.ipdump_core) return; for (i = 0; i < reg_count; i++) drm_printf(p, "%-50s \t 0x%08x\n", gc_reg_list_10_1[i].reg_name, -adev->gfx.ip_dump[i]); +adev->gfx.ipdump_core[i]); } static void gfx_v10_ip_dump(void *handle) @@ -9298,12 +9298,12 @@ static void gfx_v10_ip_dump(void *handle) uint32_t i; uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1); - if (!adev->gfx.ip_dump) + if (!adev->gfx.ipdump_core) return; amdgpu_gfx_off_ctrl(adev, false); for (i = 0; i < reg_count; i++) - adev->gfx.ip_dump[i] = RREG32(SOC15_REG_ENTRY_OFFSET(gc_reg_list_10_1[i])); + adev->gfx.ipdump_core[i] = +RREG32(SOC15_REG_ENTRY_OFFSET(gc_reg_list_10_1[i])); amdgpu_gfx_off_ctrl(adev, true); } -- 2.34.1
Re: [PATCH v1 3/4] drm/amdgpu: add compute registers in ip dump for gfx10
On 5/3/2024 9:52 PM, Alex Deucher wrote: On Fri, May 3, 2024 at 12:09 PM Khatri, Sunil wrote: On 5/3/2024 9:18 PM, Khatri, Sunil wrote: On 5/3/2024 8:52 PM, Alex Deucher wrote: On Fri, May 3, 2024 at 4:45 AM Sunil Khatri wrote: add compute registers in set of registers to dump during ip dump for gfx10. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 42 +- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 953df202953a..00c7a842ea3b 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -378,7 +378,47 @@ static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = { SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE0), SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE1), SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE2), - SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3) + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3), + /* compute registers */ + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_WG_STATE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS) The registers in patches 3 and 4 are multi-instance, so we should ideally print every instance of them rather than just one. Use nv_grbm_select() to select the pipes and queues. Make sure to protect access using the adev->srbm_mutex mutex. E.g., for the compute registers (patch 3): mutex_lock(>srbm_mutex); for (i = 0; i < adev->gfx.mec.num_mec; ++i) { for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) { for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) { drm_printf("mec %d, pipe %d, queue %d\n", i, j, k); nv_grbm_select(adev, i, j, k, 0); for (reg = 0; reg < ARRAY_SIZE(compute_regs); reg++) drm_printf(...RREG(compute_regs[reg])); } } } nv_grbm_select(adev, 0, 0, 0, 0); mutex_unlock(>srbm_mutex); For gfx registers (patch 4): mutex_lock(>srbm_mutex); for (i = 0; i < adev->gfx.me.num_me; ++i) { for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) { for (k = 0; k < adev->gfx.me.num_queue_per_pipe; k++) { drm_printf("me %d, pipe %d, queue %d\n", i, j, k); nv_grbm_select(adev, i, j, k, 0); for (reg = 0; reg < ARRAY_SIZE(gfx_regs); reg++) drm_printf(...RREG(gfx_regs[
Re: [PATCH v1 3/4] drm/amdgpu: add compute registers in ip dump for gfx10
On 5/3/2024 9:18 PM, Khatri, Sunil wrote: On 5/3/2024 8:52 PM, Alex Deucher wrote: On Fri, May 3, 2024 at 4:45 AM Sunil Khatri wrote: add compute registers in set of registers to dump during ip dump for gfx10. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 42 +- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 953df202953a..00c7a842ea3b 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -378,7 +378,47 @@ static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = { SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE0), SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE1), SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE2), - SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3) + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3), + /* compute registers */ + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_WG_STATE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS) The registers in patches 3 and 4 are multi-instance, so we should ideally print every instance of them rather than just one. Use nv_grbm_select() to select the pipes and queues. Make sure to protect access using the adev->srbm_mutex mutex. E.g., for the compute registers (patch 3): mutex_lock(>srbm_mutex); for (i = 0; i < adev->gfx.mec.num_mec; ++i) { for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) { for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) { drm_printf("mec %d, pipe %d, queue %d\n", i, j, k); nv_grbm_select(adev, i, j, k, 0); for (reg = 0; reg < ARRAY_SIZE(compute_regs); reg++) drm_printf(...RREG(compute_regs[reg])); } } } nv_grbm_select(adev, 0, 0, 0, 0); mutex_unlock(>srbm_mutex); For gfx registers (patch 4): mutex_lock(>srbm_mutex); for (i = 0; i < adev->gfx.me.num_me; ++i) { for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) { for (k = 0; k < adev->gfx.me.num_queue_per_pipe; k++) { drm_printf("me %d, pipe %d, queue %d\n", i, j, k); nv_grbm_select(adev, i, j, k, 0); for (reg = 0; reg < ARRAY_SIZE(gfx_regs); reg++) drm_printf(...RREG(gfx_regs[reg])); I see one problem here, we dump the registers in memory allocated first and read before and store and the
Re: [PATCH v1 3/4] drm/amdgpu: add compute registers in ip dump for gfx10
On 5/3/2024 8:52 PM, Alex Deucher wrote: On Fri, May 3, 2024 at 4:45 AM Sunil Khatri wrote: add compute registers in set of registers to dump during ip dump for gfx10. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 42 +- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 953df202953a..00c7a842ea3b 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -378,7 +378,47 @@ static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = { SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE0), SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE1), SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE2), - SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3) + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS_SE3), + /* compute registers */ + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_VMID), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PERSISTENT_STATE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PIPE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUEUE_PRIORITY), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_QUANTUM), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_POLL_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_DOORBELL_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_IB_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_REQUEST), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_EVENTS), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_BASE_ADDR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_CONTROL), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CNTL_STACK_SIZE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_WG_STATE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_CTX_SAVE_SIZE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_GDS_RESOURCE_STATE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_ERROR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_EOP_WPTR_MEM), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_PQ_WPTR_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_CNTL_STACK_DW_CNT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_SUSPEND_WG_STATE_OFFSET), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_HQD_DEQUEUE_STATUS) The registers in patches 3 and 4 are multi-instance, so we should ideally print every instance of them rather than just one. Use nv_grbm_select() to select the pipes and queues. Make sure to protect access using the adev->srbm_mutex mutex. E.g., for the compute registers (patch 3): mutex_lock(>srbm_mutex); for (i = 0; i < adev->gfx.mec.num_mec; ++i) { for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) { for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) { drm_printf("mec %d, pipe %d, queue %d\n", i, j, k); nv_grbm_select(adev, i, j, k, 0); for (reg = 0; reg < ARRAY_SIZE(compute_regs); reg++) drm_printf(...RREG(compute_regs[reg])); } } } nv_grbm_select(adev, 0, 0, 0, 0); mutex_unlock(>srbm_mutex); For gfx registers (patch 4): mutex_lock(>srbm_mutex); for (i = 0; i < adev->gfx.me.num_me; ++i) { for (j = 0; j < adev->gfx.me.num_pipe_per_me; j++) { for (k = 0; k < adev->gfx.me.num_queue_per_pipe; k++) { drm_printf("me %d, pipe %d, queue %d\n", i, j, k); nv_grbm_select(adev, i, j, k, 0); for (reg = 0; reg < ARRAY_SIZE(gfx_regs); reg++) drm_printf(...RREG(gfx_regs[reg])); } } } nv_grbm_select(adev, 0, 0, 0, 0); mutex_unlock(>srbm_mutex); Thanks for pointing that out and suggesting the sample code of how it should be. Will take care of this in next
Re: [PATCH] drm/amdgpu: skip ip dump if devcoredump flag is set
On 4/25/2024 7:43 PM, Lazar, Lijo wrote: On 4/25/2024 3:53 PM, Sunil Khatri wrote: Do not dump the ip registers during driver reload in passthrough environment. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 869256394136..b50758482530 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5372,10 +5372,12 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, amdgpu_reset_reg_dumps(tmp_adev); Probably not related, can the above step be clubbed with what's being done below? Or, can we move all such to start with amdgpu_reset_dump_*? Sure lizo I will club both dump_ip_state and amdgpu_reset_reg_dumps under one if condition in the patch to push. Regards Sunil /* Trigger ip dump before we reset the asic */ - for (i = 0; i < tmp_adev->num_ip_blocks; i++) - if (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state) - tmp_adev->ip_blocks[i].version->funcs->dump_ip_state( - (void *)tmp_adev); + if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) { + for (i = 0; i < tmp_adev->num_ip_blocks; i++) + if (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state) + tmp_adev->ip_blocks[i].version->funcs + ->dump_ip_state((void *)tmp_adev); + } Anyway, Reviewed-by: Lijo Lazar Thanks, Lijo reset_context->reset_device_list = device_list_handle; r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
Re: [PATCH v5 2/6] drm/amdgpu: add support of gfx10 register dump
On 4/17/2024 10:21 PM, Alex Deucher wrote: On Wed, Apr 17, 2024 at 12:24 PM Lazar, Lijo wrote: [AMD Official Use Only - General] Yes, right now that API doesn't return anything. What I meant is to add that check as well as coredump API is essentially used in hang situations. Old times, access to registers while in GFXOFF resulted in system hang (basically it won't go beyond this point). If that happens, then the purpose of the patch - to get the context of a device hang - is lost. We may not even get a proper dmesg log. Maybe add a call to amdgpu_get_gfx_off_status(), but unfortunately, it's not implemented on every chip yet. So we need both the things do gfx_off and then try status and then read reg and enable gfx_off again. amdgpu_gfx_off_ctrl(adev, false); r= amdgpu_get_gfx_off_status if (!r) { for (i = 0; i < reg_count; i++) adev->gfx.ip_dump[i] = RREG32(SOC15_REG_ENTRY_OFFSET(gc_reg_list_10_1[i])); } amdgpu_gfx_off_ctrl(adev, true); Sunil Alex Thanks, Lijo -Original Message- From: Khatri, Sunil Sent: Wednesday, April 17, 2024 9:42 PM To: Lazar, Lijo ; Alex Deucher ; Khatri, Sunil Cc: Deucher, Alexander ; Koenig, Christian ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v5 2/6] drm/amdgpu: add support of gfx10 register dump On 4/17/2024 9:31 PM, Lazar, Lijo wrote: On 4/17/2024 9:21 PM, Alex Deucher wrote: On Wed, Apr 17, 2024 at 5:38 AM Sunil Khatri wrote: Adding gfx10 gc registers to be used for register dump via devcoredump during a gpu reset. Signed-off-by: Sunil Khatri Reviewed-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 8 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 4 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 130 +- drivers/gpu/drm/amd/amdgpu/soc15.h| 2 + .../include/asic_reg/gc/gc_10_1_0_offset.h| 12 ++ 5 files changed, 155 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index e0d7f4ee7e16..cac0ca64367b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -139,6 +139,14 @@ enum amdgpu_ss { AMDGPU_SS_DRV_UNLOAD }; +struct amdgpu_hwip_reg_entry { + u32 hwip; + u32 inst; + u32 seg; + u32 reg_offset; + const char *reg_name; +}; + struct amdgpu_watchdog_timer { bool timeout_fatal_disable; uint32_t period; /* maxCycles = (1 << period), the number of cycles before a timeout */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index 04a86dff71e6..64f197bbc866 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -433,6 +433,10 @@ struct amdgpu_gfx { uint32_tnum_xcc_per_xcp; struct mutexpartition_mutex; boolmcbp; /* mid command buffer preemption */ + + /* IP reg dump */ + uint32_t*ip_dump; + uint32_treg_count; }; struct amdgpu_gfx_ras_reg_entry { diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index a0bc4196ff8b..4a54161f4837 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin"); +static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = { + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS2), + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS3), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_ERROR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HPD_STATUS0), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP
Re: [PATCH v5 2/6] drm/amdgpu: add support of gfx10 register dump
On 4/17/2024 9:31 PM, Lazar, Lijo wrote: On 4/17/2024 9:21 PM, Alex Deucher wrote: On Wed, Apr 17, 2024 at 5:38 AM Sunil Khatri wrote: Adding gfx10 gc registers to be used for register dump via devcoredump during a gpu reset. Signed-off-by: Sunil Khatri Reviewed-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 8 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 4 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 130 +- drivers/gpu/drm/amd/amdgpu/soc15.h| 2 + .../include/asic_reg/gc/gc_10_1_0_offset.h| 12 ++ 5 files changed, 155 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index e0d7f4ee7e16..cac0ca64367b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -139,6 +139,14 @@ enum amdgpu_ss { AMDGPU_SS_DRV_UNLOAD }; +struct amdgpu_hwip_reg_entry { + u32 hwip; + u32 inst; + u32 seg; + u32 reg_offset; + const char *reg_name; +}; + struct amdgpu_watchdog_timer { bool timeout_fatal_disable; uint32_t period; /* maxCycles = (1 << period), the number of cycles before a timeout */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index 04a86dff71e6..64f197bbc866 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -433,6 +433,10 @@ struct amdgpu_gfx { uint32_tnum_xcc_per_xcp; struct mutexpartition_mutex; boolmcbp; /* mid command buffer preemption */ + + /* IP reg dump */ + uint32_t*ip_dump; + uint32_treg_count; }; struct amdgpu_gfx_ras_reg_entry { diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index a0bc4196ff8b..4a54161f4837 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin"); +static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = { + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS2), + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS3), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_ERROR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HPD_STATUS0), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_CMD_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_CMD_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_CMD_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_CMD_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCPF_UTCL1_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmCPC_UTCL1_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmCPG_UTCL1_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmGDS_PROTECTION_FAULT), + SOC15_REG_ENTRY_STR(GC, 0,
Re: [PATCH v4 2/6] drm/amdgpu: add support of gfx10 register dump
On 4/17/2024 2:15 PM, Christian König wrote: Am 17.04.24 um 10:18 schrieb Sunil Khatri: Adding gfx10 gc registers to be used for register dump via devcoredump during a gpu reset. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 8 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 4 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 130 +- drivers/gpu/drm/amd/amdgpu/soc15.h | 2 + .../include/asic_reg/gc/gc_10_1_0_offset.h | 12 ++ 5 files changed, 155 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index e0d7f4ee7e16..210af65a744c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -139,6 +139,14 @@ enum amdgpu_ss { AMDGPU_SS_DRV_UNLOAD }; +struct amdgpu_hwip_reg_entry { + u32 hwip; + u32 inst; + u32 seg; + u32 reg_offset; + char reg_name[50]; Make that a const char *. Otherwise it bloats up the final binary because the compiler has to add zeros at the end. Noted. +}; + struct amdgpu_watchdog_timer { bool timeout_fatal_disable; uint32_t period; /* maxCycles = (1 << period), the number of cycles before a timeout */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index 04a86dff71e6..64f197bbc866 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -433,6 +433,10 @@ struct amdgpu_gfx { uint32_t num_xcc_per_xcp; struct mutex partition_mutex; bool mcbp; /* mid command buffer preemption */ + + /* IP reg dump */ + uint32_t *ip_dump; + uint32_t reg_count; }; struct amdgpu_gfx_ras_reg_entry { diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index a0bc4196ff8b..4a54161f4837 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin"); +static const struct amdgpu_hwip_reg_entry gc_reg_list_10_1[] = { + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS2), + SOC15_REG_ENTRY_STR(GC, 0, mmGRBM_STATUS3), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_STALLED_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STALLED_STAT1), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPC_BUSY_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_BUSY_STAT2), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CPF_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_ERROR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_GFX_HPD_STATUS0), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB0_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_RPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB1_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_BASE), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_RB2_WPTR), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_CMD_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_CMD_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_CMD_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_CMD_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB1_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_CE_IB2_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB1_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_LO), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BASE_HI), + SOC15_REG_ENTRY_STR(GC, 0, mmCP_IB2_BUFSZ), + SOC15_REG_ENTRY_STR(GC, 0, mmCPF_UTCL1_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmCPC_UTCL1_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmCPG_UTCL1_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmGDS_PROTECTION_FAULT), + SOC15_REG_ENTRY_STR(GC, 0, mmGDS_VM_PROTECTION_FAULT), + SOC15_REG_ENTRY_STR(GC, 0, mmIA_UTCL1_STATUS), + SOC15_REG_ENTRY_STR(GC, 0, mmIA_UTCL1_STATUS_2), + SOC15_REG_ENTRY_STR(GC, 0, mmPA_CL_CNTL_STATUS), + SOC15_REG_ENTRY_STR(GC, 0,
Re: [PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload
On 4/17/2024 1:19 PM, Lazar, Lijo wrote: On 4/17/2024 1:14 PM, Khatri, Sunil wrote: On 4/17/2024 1:06 PM, Khatri, Sunil wrote: devcoredump is used to debug gpu hangs/resets. So in normal process when there is a hang due to ring timeout or page fault we are doing a hard reset as soft reset fail in those cases. How are we making sure that the devcoredump is triggered in those cases and captured? Regards Sunil Khatri On 4/17/2024 9:43 AM, Ahmad Rehman wrote: In passthrough environment, the driver triggers the mode-1 reset on reload. The reset causes the core dump collection which is delayed task and prevents driver from unloading until it is completed. Since we do not need to collect data on "reset on reload" case, we can skip core dump collection. v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well. Signed-off-by: Ahmad Rehman --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1b2e177bc2d6..c718982cffa8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, /* Try reset handler method first */ tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device, reset_list); - amdgpu_reset_reg_dumps(tmp_adev); + + if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) + amdgpu_reset_reg_dumps(tmp_adev); reset_context->reset_device_list = device_list_handle; r = amdgpu_reset_perform_reset(tmp_adev, reset_context); @@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, vram_lost = amdgpu_device_check_vram_lost(tmp_adev); - amdgpu_coredump(tmp_adev, vram_lost, reset_context); + if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) + amdgpu_coredump(tmp_adev, vram_lost, reset_context); if (vram_lost) { DRM_INFO("VRAM is lost due to GPU reset!\n"); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 6ea893ad9a36..c512f70b8272 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2481,6 +2481,7 @@ static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work) /* Use a common context, just need to make sure full reset is done */ set_bit(AMDGPU_SKIP_HW_RESET, _context.flags); + set_bit(AMDGPU_SKIP_COREDUMP, _context.flags); If this is used for guests only can we better have a flag like amdgpu_sriov_vf for setting the skip coredump flag ?? A reset is not always triggered just because of hang. There are other cases like we want to do a reset after a suspend/resume cycle so that the device starts from a clean state. Those are intentionally triggered by driver. Also, there are case like RAS errors where we reset and that also really doesn't need a core dump. In all such cases, this flag is required, and this is one such case (this patch only addresses passthrough). Thanks Lijo Able to verify that in normal hangs dump is working. Regards Sunil Thanks, Lijo Regards Sunil khatri r = amdgpu_do_asic_reset(_list, _context); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 66125d43cf21..b11d190ece53 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS { AMDGPU_NEED_FULL_RESET = 0, AMDGPU_SKIP_HW_RESET = 1, + AMDGPU_SKIP_COREDUMP = 2, }; struct amdgpu_reset_context {
Re: [PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload
On 4/17/2024 1:06 PM, Khatri, Sunil wrote: devcoredump is used to debug gpu hangs/resets. So in normal process when there is a hang due to ring timeout or page fault we are doing a hard reset as soft reset fail in those cases. How are we making sure that the devcoredump is triggered in those cases and captured? Regards Sunil Khatri On 4/17/2024 9:43 AM, Ahmad Rehman wrote: In passthrough environment, the driver triggers the mode-1 reset on reload. The reset causes the core dump collection which is delayed task and prevents driver from unloading until it is completed. Since we do not need to collect data on "reset on reload" case, we can skip core dump collection. v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well. Signed-off-by: Ahmad Rehman --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1b2e177bc2d6..c718982cffa8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, /* Try reset handler method first */ tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device, reset_list); - amdgpu_reset_reg_dumps(tmp_adev); + + if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) + amdgpu_reset_reg_dumps(tmp_adev); reset_context->reset_device_list = device_list_handle; r = amdgpu_reset_perform_reset(tmp_adev, reset_context); @@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, vram_lost = amdgpu_device_check_vram_lost(tmp_adev); - amdgpu_coredump(tmp_adev, vram_lost, reset_context); + if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) + amdgpu_coredump(tmp_adev, vram_lost, reset_context); if (vram_lost) { DRM_INFO("VRAM is lost due to GPU reset!\n"); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 6ea893ad9a36..c512f70b8272 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2481,6 +2481,7 @@ static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work) /* Use a common context, just need to make sure full reset is done */ set_bit(AMDGPU_SKIP_HW_RESET, _context.flags); + set_bit(AMDGPU_SKIP_COREDUMP, _context.flags); If this is used for guests only can we better have a flag like amdgpu_sriov_vf for setting the skip coredump flag ?? Regards Sunil khatri r = amdgpu_do_asic_reset(_list, _context); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 66125d43cf21..b11d190ece53 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS { AMDGPU_NEED_FULL_RESET = 0, AMDGPU_SKIP_HW_RESET = 1, + AMDGPU_SKIP_COREDUMP = 2, }; struct amdgpu_reset_context {
Re: [PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload
devcoredump is used to debug gpu hangs/resets. So in normal process when there is a hang due to ring timeout or page fault we are doing a hard reset as soft reset fail in those cases. How are we making sure that the devcoredump is triggered in those cases and captured? Regards Sunil Khatri On 4/17/2024 9:43 AM, Ahmad Rehman wrote: In passthrough environment, the driver triggers the mode-1 reset on reload. The reset causes the core dump collection which is delayed task and prevents driver from unloading until it is completed. Since we do not need to collect data on "reset on reload" case, we can skip core dump collection. v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well. Signed-off-by: Ahmad Rehman --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1b2e177bc2d6..c718982cffa8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, /* Try reset handler method first */ tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device, reset_list); - amdgpu_reset_reg_dumps(tmp_adev); + + if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) + amdgpu_reset_reg_dumps(tmp_adev); reset_context->reset_device_list = device_list_handle; r = amdgpu_reset_perform_reset(tmp_adev, reset_context); @@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, vram_lost = amdgpu_device_check_vram_lost(tmp_adev); -amdgpu_coredump(tmp_adev, vram_lost, reset_context); + if (!test_bit(AMDGPU_SKIP_COREDUMP, _context->flags)) + amdgpu_coredump(tmp_adev, vram_lost, reset_context); if (vram_lost) { DRM_INFO("VRAM is lost due to GPU reset!\n"); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 6ea893ad9a36..c512f70b8272 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2481,6 +2481,7 @@ static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work) /* Use a common context, just need to make sure full reset is done */ set_bit(AMDGPU_SKIP_HW_RESET, _context.flags); + set_bit(AMDGPU_SKIP_COREDUMP, _context.flags); r = amdgpu_do_asic_reset(_list, _context); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 66125d43cf21..b11d190ece53 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS { AMDGPU_NEED_FULL_RESET = 0, AMDGPU_SKIP_HW_RESET = 1, + AMDGPU_SKIP_COREDUMP = 2, }; struct amdgpu_reset_context {
Re: [PATCH 6/6] drm/amdgpu: add ip dump for each ip in devcoredump
On 4/16/2024 7:29 PM, Alex Deucher wrote: On Tue, Apr 16, 2024 at 8:08 AM Sunil Khatri wrote: Add ip dump for each ip of the asic in the devcoredump for all the ips where a callback is registered for register dump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c index 64fe564b8036..70167f63b4f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c @@ -262,6 +262,21 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, drm_printf(, "Faulty page starting at address: 0x%016llx\n", fault_info->addr); drm_printf(, "Protection fault status register: 0x%x\n\n", fault_info->status); + /* dump the ip state for each ip */ + drm_printf(, "Register Dump\n"); + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + if (coredump->adev->ip_blocks[i].version->funcs->print_ip_state) { + drm_printf(, "IP: %s\n", + coredump->adev->ip_blocks[i] + .version->funcs->name); + drm_printf(, "Offset \t Value\n"); I think we can drop the drm_printf line above if we use register names rather than offsets in the print functions. This also allows IPs to dump stuff besides registers if they want. Noted Sunil Alex + coredump->adev->ip_blocks[i] + .version->funcs->print_ip_state( + (void *)coredump->adev, ); + drm_printf(, "\n"); + } + } + /* Add ring buffer information */ drm_printf(, "Ring buffer information\n"); for (int i = 0; i < coredump->adev->num_rings; i++) { -- 2.34.1
Re: [PATCH 4/6] drm/amdgpu: add support for gfx v10 print
On 4/16/2024 7:27 PM, Alex Deucher wrote: On Tue, Apr 16, 2024 at 8:08 AM Sunil Khatri wrote: Add support to print ip information to be used to print registers in devcoredump buffer. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 822bee932041..a7c2a3ddd613 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -9268,6 +9268,21 @@ static void gfx_v10_0_emit_mem_sync(struct amdgpu_ring *ring) amdgpu_ring_write(ring, gcr_cntl); /* GCR_CNTL */ } +static void gfx_v10_ip_print(void *handle, struct drm_printer *p) +{ + struct amdgpu_device *adev = (struct amdgpu_device *)handle; + uint32_t i; + uint32_t reg_count = ARRAY_SIZE(gc_reg_list_10_1); + + if (!adev->gfx.ip_dump) + return; + + for (i = 0; i < reg_count; i++) + drm_printf(p, "0x%04x \t 0x%08x\n", + adev->gfx.ip_dump[i].offset, Print the name of the register rather than the offset here to make it output easier to read. See my comments from patch 2. Just register name and value is fine or we need the offset too. Also i am assuming stringify the macro is good enough ? eg: #definemmGRBM_STATUS0x0da4 so printing register name exactly like mmGRBM_STATUS is acceptable ? we dont need to remove mm as it makes it complicated. + adev->gfx.ip_dump[i].value); +} + static void gfx_v10_ip_dump(void *handle) { struct amdgpu_device *adev = (struct amdgpu_device *)handle; @@ -9300,7 +9315,7 @@ static const struct amd_ip_funcs gfx_v10_0_ip_funcs = { .set_powergating_state = gfx_v10_0_set_powergating_state, .get_clockgating_state = gfx_v10_0_get_clockgating_state, .dump_ip_state = gfx_v10_ip_dump, - .print_ip_state = NULL, + .print_ip_state = gfx_v10_ip_print, }; static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_gfx = { -- 2.34.1
Re: [PATCH 2/6] drm/amdgpu: add support of gfx10 register dump
On 4/16/2024 7:30 PM, Christian König wrote: Am 16.04.24 um 15:55 schrieb Alex Deucher: On Tue, Apr 16, 2024 at 8:08 AM Sunil Khatri wrote: Adding gfx10 gc registers to be used for register dump via devcoredump during a gpu reset. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 12 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 4 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 131 +- .../include/asic_reg/gc/gc_10_1_0_offset.h | 12 ++ 4 files changed, 158 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index e0d7f4ee7e16..e016ac33629d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -139,6 +139,18 @@ enum amdgpu_ss { AMDGPU_SS_DRV_UNLOAD }; +struct hwip_reg_entry { + u32 hwip; + u32 inst; + u32 seg; + u32 reg_offset; +}; + +struct reg_pair { + u32 offset; + u32 value; +}; + struct amdgpu_watchdog_timer { bool timeout_fatal_disable; uint32_t period; /* maxCycles = (1 << period), the number of cycles before a timeout */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index 04a86dff71e6..295a2c8d2e48 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -433,6 +433,10 @@ struct amdgpu_gfx { uint32_t num_xcc_per_xcp; struct mutex partition_mutex; bool mcbp; /* mid command buffer preemption */ + + /* IP reg dump */ + struct reg_pair *ip_dump; + uint32_t reg_count; }; struct amdgpu_gfx_ras_reg_entry { diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index a0bc4196ff8b..46e136609ff1 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin"); +static const struct hwip_reg_entry gc_reg_list_10_1[] = { + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS2) }, + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS3) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_BUSY_STAT) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT2) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT2) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_GFX_ERROR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_GFX_HPD_STATUS0) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB_BASE) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB_RPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_BASE) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_RPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_BASE) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_RPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_BASE) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_CMD_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_CMD_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_CMD_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_CMD_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BASE_LO) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BASE_HI) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BASE_LO) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BASE_HI) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BASE_LO) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BASE_HI) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BASE_LO) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BASE_HI) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCPF_UTCL1_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmCPC_UTCL1_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmCPG_UTCL1_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmGDS_PROTECTION_FAULT) }, + { SOC15_REG_ENTRY(GC, 0, mmGDS_VM_PROTECTION_FAULT) }, + { SOC15_REG_ENTRY(GC, 0, mmIA_UTCL1_STATUS) }, + {
Re: [PATCH 2/6] drm/amdgpu: add support of gfx10 register dump
On 4/16/2024 7:25 PM, Alex Deucher wrote: On Tue, Apr 16, 2024 at 8:08 AM Sunil Khatri wrote: Adding gfx10 gc registers to be used for register dump via devcoredump during a gpu reset. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 12 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 4 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 131 +- .../include/asic_reg/gc/gc_10_1_0_offset.h| 12 ++ 4 files changed, 158 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index e0d7f4ee7e16..e016ac33629d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -139,6 +139,18 @@ enum amdgpu_ss { AMDGPU_SS_DRV_UNLOAD }; +struct hwip_reg_entry { + u32 hwip; + u32 inst; + u32 seg; + u32 reg_offset; +}; + +struct reg_pair { + u32 offset; + u32 value; +}; + struct amdgpu_watchdog_timer { bool timeout_fatal_disable; uint32_t period; /* maxCycles = (1 << period), the number of cycles before a timeout */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index 04a86dff71e6..295a2c8d2e48 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -433,6 +433,10 @@ struct amdgpu_gfx { uint32_tnum_xcc_per_xcp; struct mutexpartition_mutex; boolmcbp; /* mid command buffer preemption */ + + /* IP reg dump */ + struct reg_pair *ip_dump; + uint32_treg_count; }; struct amdgpu_gfx_ras_reg_entry { diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index a0bc4196ff8b..46e136609ff1 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin"); +static const struct hwip_reg_entry gc_reg_list_10_1[] = { + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS2) }, + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS3) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_BUSY_STAT) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_BUSY_STAT2) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_BUSY_STAT2) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_GFX_ERROR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_GFX_HPD_STATUS0) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB_BASE) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB_RPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_BASE) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_RPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB0_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_BASE) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_RPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB1_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_BASE) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_RB2_WPTR) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_CMD_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_CMD_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_CMD_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_CMD_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BASE_LO) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BASE_HI) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB1_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BASE_LO) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BASE_HI) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CE_IB2_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BASE_LO) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BASE_HI) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB1_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BASE_LO) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BASE_HI) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_IB2_BUFSZ) }, + { SOC15_REG_ENTRY(GC, 0, mmCPF_UTCL1_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmCPC_UTCL1_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmCPG_UTCL1_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmGDS_PROTECTION_FAULT) }, + { SOC15_REG_ENTRY(GC, 0, mmGDS_VM_PROTECTION_FAULT) }, + { SOC15_REG_ENTRY(GC, 0, mmIA_UTCL1_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmIA_UTCL1_STATUS_2) }, + {
Re: [PATCH v3 4/5] drm/amdgpu: enable redirection of irq's for IH V6.0
On 4/16/2024 7:56 PM, Alex Deucher wrote: On Tue, Apr 16, 2024 at 9:34 AM Sunil Khatri wrote: Enable redirection of irq for pagefaults for specific clients to avoid overflow without dropping interrupts. So here we redirect the interrupts to another IH ring i.e ring1 where only these interrupts are processed. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/ih_v6_0.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c index 26dc99232eb6..8869aac03b82 100644 --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c @@ -346,6 +346,21 @@ static int ih_v6_0_irq_init(struct amdgpu_device *adev) DELAY, 3); WREG32_SOC15(OSSSYS, 0, regIH_MSI_STORM_CTRL, tmp); + /* Redirect the interrupts to IH RB1 fpr dGPU */ fpr -> for Sure will fix it when pushing the change to staging branch. Regards Sunil khatri Alex + if (adev->irq.ih1.ring_size) { + tmp = RREG32_SOC15(OSSSYS, 0, regIH_RING1_CLIENT_CFG_INDEX); + tmp = REG_SET_FIELD(tmp, IH_RING1_CLIENT_CFG_INDEX, INDEX, 0); + WREG32_SOC15(OSSSYS, 0, regIH_RING1_CLIENT_CFG_INDEX, tmp); + + tmp = RREG32_SOC15(OSSSYS, 0, regIH_RING1_CLIENT_CFG_DATA); + tmp = REG_SET_FIELD(tmp, IH_RING1_CLIENT_CFG_DATA, CLIENT_ID, 0xa); + tmp = REG_SET_FIELD(tmp, IH_RING1_CLIENT_CFG_DATA, SOURCE_ID, 0x0); + tmp = REG_SET_FIELD(tmp, IH_RING1_CLIENT_CFG_DATA, + SOURCE_ID_MATCH_ENABLE, 0x1); + + WREG32_SOC15(OSSSYS, 0, regIH_RING1_CLIENT_CFG_DATA, tmp); + } + pci_set_master(adev->pdev); /* enable interrupts */ -- 2.34.1
RE: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump
[AMD Official Use Only - General] -Original Message- From: Alex Deucher Sent: Saturday, April 13, 2024 1:56 AM To: Khatri, Sunil Cc: Khatri, Sunil ; Deucher, Alexander ; Koenig, Christian ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump On Fri, Apr 12, 2024 at 1:31 PM Khatri, Sunil wrote: > > > On 4/12/2024 10:42 PM, Alex Deucher wrote: > > On Fri, Apr 12, 2024 at 1:05 PM Khatri, Sunil wrote: > > On 4/12/2024 8:50 PM, Alex Deucher wrote: > > On Fri, Apr 12, 2024 at 10:00 AM Sunil Khatri wrote: > > Adding initial set of registers for ipdump during devcoredump starting > with gfx10 gc registers. > > ip dump is triggered when gpu reset happens via devcoredump and the > memory is allocated by each ip and is freed once the dump is complete > by devcoredump. > > Signed-off-by: Sunil Khatri > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 16 +++ > .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 22 +++ > > I would split this into two patches, one to add the core > infrastructure in devcoredump and one to add gfx10 support. The core > support could be squashed into patch 1 as well. > > Sure > > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 127 +- > .../include/asic_reg/gc/gc_10_1_0_offset.h| 12 ++ > 4 files changed, 176 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 65c17c59c152..e173ad86a241 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -139,6 +139,18 @@ enum amdgpu_ss { > AMDGPU_SS_DRV_UNLOAD > }; > > +struct hwip_reg_entry { > + u32 hwip; > + u32 inst; > + u32 seg; > + u32 reg_offset; > +}; > + > +struct reg_pair { > + u32 offset; > + u32 value; > +}; > + > struct amdgpu_watchdog_timer { > bool timeout_fatal_disable; > uint32_t period; /* maxCycles = (1 << period), the number of > cycles before a timeout */ @@ -1152,6 +1164,10 @@ struct amdgpu_device { > booldebug_largebar; > booldebug_disable_soft_recovery; > booldebug_use_vram_fw_buf; > + > + /* IP register dump */ > + struct reg_pair *ip_dump; > + uint32_tnum_regs; > }; > > static inline uint32_t amdgpu_ip_version(const struct amdgpu_device > *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c > index 1129e5e5fb42..2079f67c9fac 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c > @@ -261,6 +261,18 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, > size_t count, > drm_printf(, "Faulty page starting at address: 0x%016llx\n", > fault_info->addr); > drm_printf(, "Protection fault status register: 0x%x\n\n", > fault_info->status); > > + /* Add IP dump for each ip */ > + if (coredump->adev->ip_dump != NULL) { > + struct reg_pair *pair; > + > + pair = (struct reg_pair *)coredump->adev->ip_dump; > + drm_printf(, "IP register dump\n"); > + drm_printf(, "Offset \t Value\n"); > + for (int i = 0; i < coredump->adev->num_regs; i++) > + drm_printf(, "0x%04x \t 0x%08x\n", pair[i].offset, > pair[i].value); > + drm_printf(, "\n"); > + } > + > /* Add ring buffer information */ > drm_printf(, "Ring buffer information\n"); > for (int i = 0; i < coredump->adev->num_rings; i++) { @@ > -299,6 +311,11 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, > size_t count, > > static void amdgpu_devcoredump_free(void *data) > { > + struct amdgpu_coredump_info *temp = data; > + > + kfree(temp->adev->ip_dump); > + temp->adev->ip_dump = NULL; > + temp->adev->num_regs = 0; > kfree(data); > } > > @@ -337,6 +354,11 @@ void amdgpu_coredump(struct amdgpu_device *adev, > bool vram_lost, > > coredump->adev = adev; > > + /* Trigger ip dump here to capture the value of registers */ > + for (int i = 0; i < adev->num_ip_blocks; i++) > + if (adev->ip_blocks[i].version->funcs->dump_ip_state) > + > + adev-&g
Re: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump
On 4/12/2024 10:42 PM, Alex Deucher wrote: On Fri, Apr 12, 2024 at 1:05 PM Khatri, Sunil wrote: On 4/12/2024 8:50 PM, Alex Deucher wrote: On Fri, Apr 12, 2024 at 10:00 AM Sunil Khatri wrote: Adding initial set of registers for ipdump during devcoredump starting with gfx10 gc registers. ip dump is triggered when gpu reset happens via devcoredump and the memory is allocated by each ip and is freed once the dump is complete by devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 16 +++ .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 22 +++ I would split this into two patches, one to add the core infrastructure in devcoredump and one to add gfx10 support. The core support could be squashed into patch 1 as well. Sure drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 127 +- .../include/asic_reg/gc/gc_10_1_0_offset.h| 12 ++ 4 files changed, 176 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 65c17c59c152..e173ad86a241 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -139,6 +139,18 @@ enum amdgpu_ss { AMDGPU_SS_DRV_UNLOAD }; +struct hwip_reg_entry { + u32 hwip; + u32 inst; + u32 seg; + u32 reg_offset; +}; + +struct reg_pair { + u32 offset; + u32 value; +}; + struct amdgpu_watchdog_timer { bool timeout_fatal_disable; uint32_t period; /* maxCycles = (1 << period), the number of cycles before a timeout */ @@ -1152,6 +1164,10 @@ struct amdgpu_device { booldebug_largebar; booldebug_disable_soft_recovery; booldebug_use_vram_fw_buf; + + /* IP register dump */ + struct reg_pair *ip_dump; + uint32_tnum_regs; }; static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c index 1129e5e5fb42..2079f67c9fac 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c @@ -261,6 +261,18 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, drm_printf(, "Faulty page starting at address: 0x%016llx\n", fault_info->addr); drm_printf(, "Protection fault status register: 0x%x\n\n", fault_info->status); + /* Add IP dump for each ip */ + if (coredump->adev->ip_dump != NULL) { + struct reg_pair *pair; + + pair = (struct reg_pair *)coredump->adev->ip_dump; + drm_printf(, "IP register dump\n"); + drm_printf(, "Offset \t Value\n"); + for (int i = 0; i < coredump->adev->num_regs; i++) + drm_printf(, "0x%04x \t 0x%08x\n", pair[i].offset, pair[i].value); + drm_printf(, "\n"); + } + /* Add ring buffer information */ drm_printf(, "Ring buffer information\n"); for (int i = 0; i < coredump->adev->num_rings; i++) { @@ -299,6 +311,11 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, static void amdgpu_devcoredump_free(void *data) { + struct amdgpu_coredump_info *temp = data; + + kfree(temp->adev->ip_dump); + temp->adev->ip_dump = NULL; + temp->adev->num_regs = 0; kfree(data); } @@ -337,6 +354,11 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, coredump->adev = adev; + /* Trigger ip dump here to capture the value of registers */ + for (int i = 0; i < adev->num_ip_blocks; i++) + if (adev->ip_blocks[i].version->funcs->dump_ip_state) + adev->ip_blocks[i].version->funcs->dump_ip_state((void *)adev); + This seems too complicated. I think it would be easier to This is how all other per IP functions are called. What do you suggest ? ktime_get_ts64(>reset_time); dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT, diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index a0bc4196ff8b..66e2915a8b4d 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin"); +static const struct hwip_reg_entry gc_reg_list_10_1[] = { + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS) }, + { SOC15_RE
Re: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump
On 4/12/2024 8:50 PM, Alex Deucher wrote: On Fri, Apr 12, 2024 at 10:00 AM Sunil Khatri wrote: Adding initial set of registers for ipdump during devcoredump starting with gfx10 gc registers. ip dump is triggered when gpu reset happens via devcoredump and the memory is allocated by each ip and is freed once the dump is complete by devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 16 +++ .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 22 +++ I would split this into two patches, one to add the core infrastructure in devcoredump and one to add gfx10 support. The core support could be squashed into patch 1 as well. Sure drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 127 +- .../include/asic_reg/gc/gc_10_1_0_offset.h| 12 ++ 4 files changed, 176 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 65c17c59c152..e173ad86a241 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -139,6 +139,18 @@ enum amdgpu_ss { AMDGPU_SS_DRV_UNLOAD }; +struct hwip_reg_entry { + u32 hwip; + u32 inst; + u32 seg; + u32 reg_offset; +}; + +struct reg_pair { + u32 offset; + u32 value; +}; + struct amdgpu_watchdog_timer { bool timeout_fatal_disable; uint32_t period; /* maxCycles = (1 << period), the number of cycles before a timeout */ @@ -1152,6 +1164,10 @@ struct amdgpu_device { booldebug_largebar; booldebug_disable_soft_recovery; booldebug_use_vram_fw_buf; + + /* IP register dump */ + struct reg_pair *ip_dump; + uint32_tnum_regs; }; static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c index 1129e5e5fb42..2079f67c9fac 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c @@ -261,6 +261,18 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, drm_printf(, "Faulty page starting at address: 0x%016llx\n", fault_info->addr); drm_printf(, "Protection fault status register: 0x%x\n\n", fault_info->status); + /* Add IP dump for each ip */ + if (coredump->adev->ip_dump != NULL) { + struct reg_pair *pair; + + pair = (struct reg_pair *)coredump->adev->ip_dump; + drm_printf(, "IP register dump\n"); + drm_printf(, "Offset \t Value\n"); + for (int i = 0; i < coredump->adev->num_regs; i++) + drm_printf(, "0x%04x \t 0x%08x\n", pair[i].offset, pair[i].value); + drm_printf(, "\n"); + } + /* Add ring buffer information */ drm_printf(, "Ring buffer information\n"); for (int i = 0; i < coredump->adev->num_rings; i++) { @@ -299,6 +311,11 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, static void amdgpu_devcoredump_free(void *data) { + struct amdgpu_coredump_info *temp = data; + + kfree(temp->adev->ip_dump); + temp->adev->ip_dump = NULL; + temp->adev->num_regs = 0; kfree(data); } @@ -337,6 +354,11 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, coredump->adev = adev; + /* Trigger ip dump here to capture the value of registers */ + for (int i = 0; i < adev->num_ip_blocks; i++) + if (adev->ip_blocks[i].version->funcs->dump_ip_state) + adev->ip_blocks[i].version->funcs->dump_ip_state((void *)adev); + This seems too complicated. I think it would be easier to This is how all other per IP functions are called. What do you suggest ? ktime_get_ts64(>reset_time); dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT, diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index a0bc4196ff8b..66e2915a8b4d 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -276,6 +276,99 @@ MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_mec2.bin"); MODULE_FIRMWARE("amdgpu/gc_10_3_7_rlc.bin"); +static const struct hwip_reg_entry gc_reg_list_10_1[] = { + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS) }, + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS2) }, + { SOC15_REG_ENTRY(GC, 0, mmGRBM_STATUS3) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT1) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_STALLED_STAT2) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPC_STALLED_STAT1) }, + { SOC15_REG_ENTRY(GC, 0, mmCP_CPF_STALLED_STAT1) }, + {
Re: [PATCH v2 2/2] drm/amdgpu: Add support of gfx10 register dump
On 4/12/2024 8:50 PM, Alex Deucher wrote: I would split this into two patches, one to add the core infrastructure in devcoredump and one to add gfx10 support. The core support could be squashed into patch 1 as well. Sure would push the v3 with the changes. Regards Sunil
RE: [PATCH 0/2] First set in IP dump patches
[AMD Official Use Only - General] Ignore the series sent by mistake -Original Message- From: Sunil Khatri Sent: Friday, April 12, 2024 2:30 PM To: Deucher, Alexander ; Koenig, Christian Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil Subject: [PATCH 0/2] First set in IP dump patches Adding infrastructure needed for ipdump along with dumping gfx10 registers. Sunil Khatri (2): drm/amdgpu: add prototype to dump ip state drm/amdgpu: Add support of gfx10 register dump drivers/gpu/drm/amd/amdgpu/amdgpu.h | 16 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_acp.c | 1 + .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 22 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 1 + drivers/gpu/drm/amd/amdgpu/cik.c | 1 + drivers/gpu/drm/amd/amdgpu/cik_ih.c | 1 + drivers/gpu/drm/amd/amdgpu/cik_sdma.c | 1 + drivers/gpu/drm/amd/amdgpu/cz_ih.c| 1 + drivers/gpu/drm/amd/amdgpu/dce_v10_0.c| 1 + drivers/gpu/drm/amd/amdgpu/dce_v11_0.c| 1 + drivers/gpu/drm/amd/amdgpu/dce_v6_0.c | 1 + drivers/gpu/drm/amd/amdgpu/dce_v8_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 142 ++ drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c| 1 + drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 1 + drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 1 + drivers/gpu/drm/amd/amdgpu/iceland_ih.c | 1 + drivers/gpu/drm/amd/amdgpu/ih_v6_0.c | 1 + drivers/gpu/drm/amd/amdgpu/ih_v6_1.c | 1 + drivers/gpu/drm/amd/amdgpu/ih_v7_0.c | 1 + drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c| 1 + drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c| 2 + drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c| 1 + drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c| 1 + drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 1 + drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c | 1 + drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 1 + drivers/gpu/drm/amd/amdgpu/mes_v10_1.c| 1 + drivers/gpu/drm/amd/amdgpu/mes_v11_0.c| 1 + drivers/gpu/drm/amd/amdgpu/navi10_ih.c| 1 + drivers/gpu/drm/amd/amdgpu/nv.c | 1 + drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c| 1 + drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c| 1 + drivers/gpu/drm/amd/amdgpu/si.c | 1 + drivers/gpu/drm/amd/amdgpu/si_dma.c | 1 + drivers/gpu/drm/amd/amdgpu/si_ih.c| 1 + drivers/gpu/drm/amd/amdgpu/soc15.c| 1 + drivers/gpu/drm/amd/amdgpu/soc21.c| 1 + drivers/gpu/drm/amd/amdgpu/tonga_ih.c | 1 + drivers/gpu/drm/amd/amdgpu/uvd_v3_1.c | 1 + drivers/gpu/drm/amd/amdgpu/uvd_v4_2.c | 1 + drivers/gpu/drm/amd/amdgpu/uvd_v5_0.c | 1 + drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c | 1 + drivers/gpu/drm/amd/amdgpu/vce_v2_0.c | 1 + drivers/gpu/drm/amd/amdgpu/vce_v3_0.c | 1 + drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c | 1 + drivers/gpu/drm/amd/amdgpu/vcn_v2_0.c | 1 + drivers/gpu/drm/amd/amdgpu/vcn_v2_5.c | 2 + drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 1 + drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 1 + drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 1 + drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 1 + drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 1 + drivers/gpu/drm/amd/amdgpu/vi.c | 1 + .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 1 + drivers/gpu/drm/amd/include/amd_shared.h | 1 + drivers/gpu/drm/amd/pm/legacy-dpm/kv_dpm.c| 1 + drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c| 1 + .../gpu/drm/amd/pm/powerplay/amd_powerplay.c | 1 + 66 files changed, 245 insertions(+) -- 2.34.1
RE: [PATCH 2/2] drm/amdgpu: Add support of gfx10 register dump
[AMD Official Use Only - General] Ignore sent by mistake. -Original Message- From: Sunil Khatri Sent: Friday, April 12, 2024 2:30 PM To: Deucher, Alexander ; Koenig, Christian Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil Subject: [PATCH 2/2] drm/amdgpu: Add support of gfx10 register dump Adding initial set of registers for ipdump during devcoredump starting with gfx10 gc registers. ip dump is triggered when gpu reset happens via devcoredump and the memory is allocated by each ip and is freed once the dump is complete by devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 16 ++ .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 22 +++ drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 143 +- 3 files changed, 180 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 65c17c59c152..e173ad86a241 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -139,6 +139,18 @@ enum amdgpu_ss { AMDGPU_SS_DRV_UNLOAD }; +struct hwip_reg_entry { + u32 hwip; + u32 inst; + u32 seg; + u32 reg_offset; +}; + +struct reg_pair { + u32 offset; + u32 value; +}; + struct amdgpu_watchdog_timer { bool timeout_fatal_disable; uint32_t period; /* maxCycles = (1 << period), the number of cycles before a timeout */ @@ -1152,6 +1164,10 @@ struct amdgpu_device { booldebug_largebar; booldebug_disable_soft_recovery; booldebug_use_vram_fw_buf; + + /* IP register dump */ + struct reg_pair *ip_dump; + uint32_tnum_regs; }; static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c index 1129e5e5fb42..2079f67c9fac 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c @@ -261,6 +261,18 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, drm_printf(, "Faulty page starting at address: 0x%016llx\n", fault_info->addr); drm_printf(, "Protection fault status register: 0x%x\n\n", fault_info->status); + /* Add IP dump for each ip */ + if (coredump->adev->ip_dump != NULL) { + struct reg_pair *pair; + + pair = (struct reg_pair *)coredump->adev->ip_dump; + drm_printf(, "IP register dump\n"); + drm_printf(, "Offset \t Value\n"); + for (int i = 0; i < coredump->adev->num_regs; i++) + drm_printf(, "0x%04x \t 0x%08x\n", pair[i].offset, pair[i].value); + drm_printf(, "\n"); + } + /* Add ring buffer information */ drm_printf(, "Ring buffer information\n"); for (int i = 0; i < coredump->adev->num_rings; i++) { @@ -299,6 +311,11 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, static void amdgpu_devcoredump_free(void *data) { + struct amdgpu_coredump_info *temp = data; + + kfree(temp->adev->ip_dump); + temp->adev->ip_dump = NULL; + temp->adev->num_regs = 0; kfree(data); } @@ -337,6 +354,11 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, coredump->adev = adev; + /* Trigger ip dump here to capture the value of registers */ + for (int i = 0; i < adev->num_ip_blocks; i++) + if (adev->ip_blocks[i].version->funcs->dump_ip_state) + adev->ip_blocks[i].version->funcs->dump_ip_state((void *)adev); + ktime_get_ts64(>reset_time); dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT, diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index a0bc4196ff8b..05c4b1d62132 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -47,6 +47,22 @@ #include "gfx_v10_0.h" #include "nbio_v2_3.h" +/* + * Manually adding some of the missing gfx10 registers from spec */ +#define mmCP_DEBUG_BASE_IDX0 +#define mmCP_DEBUG 0x1e1f +#define mmCP_MES_DEBUG_INTERRUPT_INSTR_PNTR_BASE_IDX 1 +#define mmCP_MES_DEBUG_INTERRUPT_INSTR_PNTR0x2840 +#define mmRLC_GPM_DEBUG_INST_A_BASE_IDX1 +#define mmRLC_GPM_DEBUG_INST_A 0x4c22 +#define mmRLC_GPM_DEBUG_INST_B_BASE_IDX1 +#define mmRLC_GPM_DEBUG_INST_B 0x
Re: [PATCH] drm/amdgpu: add IP's FW information to devcoredump
On 3/28/2024 8:38 AM, Alex Deucher wrote: On Tue, Mar 26, 2024 at 1:31 PM Sunil Khatri wrote: Add FW information of all the IP's in the devcoredump. Signed-off-by: Sunil Khatri Might want to include the vbios version info as well, e.g., atom_context->name atom_context->vbios_pn atom_context->vbios_ver_str atom_context->date Sure i will add those parameters too. Regards Sunil Either way, Reviewed-by: Alex Deucher --- .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 122 ++ 1 file changed, 122 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c index 44c5da8aa9ce..d598b6520ec9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c @@ -69,6 +69,124 @@ const char *hw_ip_names[MAX_HWIP] = { [PCIE_HWIP] = "PCIE", }; +static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, + struct drm_printer *p) +{ + uint32_t version; + uint32_t feature; + uint8_t smu_program, smu_major, smu_minor, smu_debug; + + drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n", + adev->vce.fb_version, adev->vce.fw_version); + drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", 0, + adev->uvd.fw_version); + drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", 0, + adev->gmc.fw_version); + drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n", + adev->gfx.me_feature_version, adev->gfx.me_fw_version); + drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n", + adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version); + drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n", + adev->gfx.ce_feature_version, adev->gfx.ce_fw_version); + drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version); + + drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlc_feature_version, + adev->gfx.rlc_srlc_fw_version); + drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlg_feature_version, + adev->gfx.rlc_srlg_fw_version); + drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srls_feature_version, + adev->gfx.rlc_srls_fw_version); + drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcp_ucode_feature_version, + adev->gfx.rlcp_ucode_version); + drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcv_ucode_feature_version, + adev->gfx.rlcv_ucode_version); + drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec_feature_version, adev->gfx.mec_fw_version); + + if (adev->gfx.mec2_fw) + drm_printf(p, "MEC2 feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec2_feature_version, + adev->gfx.mec2_fw_version); + + drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n", 0, + adev->gfx.imu_fw_version); + drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n", + adev->psp.sos.feature_version, adev->psp.sos.fw_version); + drm_printf(p, "PSP ASD feature version: %u, fw version: 0x%08x\n", + adev->psp.asd_context.bin_desc.feature_version, + adev->psp.asd_context.bin_desc.fw_version); + + drm_printf(p, "TA XGMI feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.xgmi_context.context.bin_desc.feature_version, + adev->psp.xgmi_context.context.bin_desc.fw_version); + drm_printf(p, "TA RAS feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.ras_context.context.bin_desc.feature_version, + adev->psp.ras_context.context.bin_desc.fw_version); + drm_printf(p, "TA HDCP feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.hdcp_context.context.bin_desc.feature_version, + adev->psp.hdcp_context.context.bin_desc.fw_version); + drm_printf(p, "TA DTM feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.dtm_context.context.bin_desc.feature_version, + adev->psp.dtm_context.context.bin_desc.fw_version); + drm_printf(p, "TA RAP feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.rap_context.context.bin_desc.feature_version, +
Re: [PATCH] drm/amdgpu: add support of bios dump in devcoredump
On 3/26/2024 10:23 PM, Alex Deucher wrote: On Tue, Mar 26, 2024 at 10:38 AM Sunil Khatri wrote: dump the bios binary in the devcoredump. Signed-off-by: Sunil Khatri --- .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 20 +++ 1 file changed, 20 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c index 44c5da8aa9ce..f33963d777eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c @@ -132,6 +132,26 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, drm_printf(, "Faulty page starting at address: 0x%016llx\n", fault_info->addr); drm_printf(, "Protection fault status register: 0x%x\n\n", fault_info->status); + /* Dump BIOS */ + if (coredump->adev->bios && coredump->adev->bios_size) { + int i = 0; + + drm_printf(, "BIOS Binary dump\n"); + drm_printf(, "Valid BIOS Size:%d bytes type:%s\n", + coredump->adev->bios_size, + coredump->adev->is_atom_fw ? + "Atom bios":"Non Atom Bios"); + + while (i < coredump->adev->bios_size) { + /* Printing 15 bytes in a line */ + if (i % 15 == 0) + drm_printf(, "\n"); + drm_printf(, "0x%x \t", coredump->adev->bios[i]); + i++; + } + drm_printf(, "\n"); + } I don't think it's too useful to dump this as text. I was hoping it could be a binary. I guess, we can just get this from debugfs if we need it if a binary is not possible. Yes , this dumps in text format only and the binary is already available in debugfs. So discarding the patch. Alex + /* Add ring buffer information */ drm_printf(, "Ring buffer information\n"); for (int i = 0; i < coredump->adev->num_rings; i++) { -- 2.34.1
RE: [PATCH v2] drm/amdgpu: refactor code to reuse system information
[AMD Official Use Only - General] Ignore this as I have send v3. -Original Message- From: Sunil Khatri Sent: Tuesday, March 19, 2024 8:41 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; linux-ker...@vger.kernel.org; Zhang, Hawking ; Kuehling, Felix ; Lazar, Lijo ; Khatri, Sunil Subject: [PATCH v2] drm/amdgpu: refactor code to reuse system information Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between ioctl, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c | 146 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.h | 33 + drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 117 +-- 4 files changed, 182 insertions(+), 116 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.h diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..2c5c800c1ed6 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o +amdgpu_coreinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c new file mode 100644 index ..597fc9d432ce --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_coreinfo.c @@ -0,0 +1,146 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person +obtaining a + * copy of this software and associated documentation files (the +"Software"), + * to deal in the Software without restriction, including without +limitation + * the rights to use, copy, modify, merge, publish, distribute, +sublicense, + * and/or sell copies of the Software, and to permit persons to whom +the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be +included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT +SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, +DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR +OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE +OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu_coreinfo.h" +#include "amd_pcie.h" + + +void amdgpu_coreinfo_devinfo(struct amdgpu_device *adev, struct +drm_amdgpu_info_device *dev_info) { + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10; + if (adev->pm.dpm_enabled) { + dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 10; + dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 10; + dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 10; + dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 10; + } else { + dev_info->max_engine_clock = + dev_info->min_engine_clock = + adev->clock.default_sclk * 10; + dev_info->max_memory_clock = + dev_info->min_memory_clock = + adev->clock.default_mclk * 10; + } + dev_info->enab
Re: [PATCH] drm/amdgpu: refactor code to reuse system information
Sent a new patch based on discussion with Alex. On 3/19/2024 8:34 PM, Christian König wrote: Am 19.03.24 um 15:59 schrieb Alex Deucher: On Tue, Mar 19, 2024 at 10:56 AM Christian König wrote: Am 19.03.24 um 15:26 schrieb Alex Deucher: On Tue, Mar 19, 2024 at 8:32 AM Sunil Khatri wrote: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between sysfs, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..d2c15a1dcb0d --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) We can probably keep this in amdgpu_kms.c unless that file is getting too big. I don't think it warrants a new file at this point. If you do keep it in amdgpu_kms.c, I'd recommend renaming it to something like amdgpu_kms_device_info() to keep the naming conventions. We should not be using this for anything new in the first place. A whole bunch of the stuff inside the devinfo structure has been deprecated because we found that putting everything into one structure was a bad idea. It's a convenient way to collect a lot of useful information that we want in the core dump. Plus it's not going anywhere because we need to keep compatibility in the IOCTL. Yeah and exactly that is what I'm strictly against. The devinfo wasn't used for new stuff because we found that it is way to inflexible. That's why we have multiple separate IOCTLs for the memory and firmware information for example. We should really *not* reuse that for the device core dumping. Rather just use the same information from the different IPs and subsystems directly. E.g. add a function to the VM, GFX etc for printing out devcoredump infos. I have pushed new v2 based
Re: [PATCH v2] drm/amdgpu: refactor code to reuse system information
On 3/19/2024 8:07 PM, Christian König wrote: Am 19.03.24 um 15:25 schrieb Sunil Khatri: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between ioctl, debugfs and devcoredump. Ok, taking a closer look that is certainly not a good idea. The devinfo structure was just created because somebody thought that mixing all that stuff into one structure would be a good idea. We have pretty much deprecated that approach and should *really* not change anything here any more. To support the ioctl we are keeping that information same without changing it. The intent to add a new file is because we will have more information coming in this new file. Next in line is firmware information which is again a huge function with lot of information and to use that information in devcoredump and ioctl and sysfs the new file seems to be right idea after some discussions. FYI: this will not be just one function in new file but more to come so code can be reused without copying it. Regards, Christian. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..fdcbc1984031 --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) +{ + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + if (dev_info == NULL) + return -EINVAL; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ +
Re: [PATCH] drm/amdgpu: refactor code to reuse system information
On 3/19/2024 7:43 PM, Lazar, Lijo wrote: On 3/19/2024 7:27 PM, Khatri, Sunil wrote: On 3/19/2024 7:19 PM, Lazar, Lijo wrote: On 3/19/2024 6:02 PM, Sunil Khatri wrote: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between sysfs, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..d2c15a1dcb0d --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) +{ + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + if (dev_info == NULL) + return -EINVAL; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10; + if (adev->pm.dpm_enabled) { + dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 10; + dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 10; + dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 10; + dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 10; + } else { + dev_info->max_engine_clock = + dev_info->min_engine_clock = + adev->clock.default_sclk * 10; + dev_info->max_memory_clock = + dev_info->min_memory_clock = + adev->clock.default_mc
Re: [PATCH] drm/amdgpu: refactor code to reuse system information
On 3/19/2024 7:19 PM, Lazar, Lijo wrote: On 3/19/2024 6:02 PM, Sunil Khatri wrote: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between sysfs, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..d2c15a1dcb0d --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) +{ + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + if (dev_info == NULL) + return -EINVAL; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10; + if (adev->pm.dpm_enabled) { + dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 10; + dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 10; + dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 10; + dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 10; + } else { + dev_info->max_engine_clock = + dev_info->min_engine_clock = + adev->clock.default_sclk * 10; + dev_info->max_memory_clock = + dev_info->min_memory_clock = + adev->clock.default_mclk * 10; + } +
Re: [PATCH] drm/amdgpu: refactor code to reuse system information
Validated the code by using the function in same way as ioctl would use in devcoredump and getting the valid values. Also this would be the container of the information that we need to share between ioctl, debugfs and devcoredump and keep updating this based on information needed. On 3/19/2024 6:02 PM, Sunil Khatri wrote: Refactor the code so debugfs and devcoredump can reuse the common information and avoid unnecessary copy of it. created a new file which would be the right place to hold functions which will be used between sysfs, debugfs and devcoredump. Cc: Christian König Cc: Alex Deucher Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c | 151 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 118 +-- 4 files changed, 157 insertions(+), 115 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index 4536c8ad0e11..05d34f4b18f5 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -80,7 +80,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \ amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \ amdgpu_fw_attestation.o amdgpu_securedisplay.o \ amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \ - amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o + amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_devinfo.o amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 9c62552bec34..0267870aa9b1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1609,4 +1609,5 @@ extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c new file mode 100644 index ..d2c15a1dcb0d --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devinfo.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright 2024 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#include "amdgpu.h" +#include "amd_pcie.h" + +#include + +int amdgpu_device_info(struct amdgpu_device *adev, struct drm_amdgpu_info_device *dev_info) +{ + int ret; + uint64_t vm_size; + uint32_t pcie_gen_mask; + + if (dev_info == NULL) + return -EINVAL; + + dev_info->device_id = adev->pdev->device; + dev_info->chip_rev = adev->rev_id; + dev_info->external_rev = adev->external_rev_id; + dev_info->pci_rev = adev->pdev->revision; + dev_info->family = adev->family; + dev_info->num_shader_engines = adev->gfx.config.max_shader_engines; + dev_info->num_shader_arrays_per_engine = adev->gfx.config.max_sh_per_se; + /* return all clocks in KHz */ + dev_info->gpu_counter_freq = amdgpu_asic_get_xclk(adev) * 10; + if (adev->pm.dpm_enabled) { + dev_info->max_engine_clock = amdgpu_dpm_get_sclk(adev, false) * 10; + dev_info->max_memory_clock = amdgpu_dpm_get_mclk(adev, false) * 10; + dev_info->min_engine_clock = amdgpu_dpm_get_sclk(adev, true) * 10; + dev_info->min_memory_clock = amdgpu_dpm_get_mclk(adev, true) * 10; + } else { + dev_info->max_engine_clock = + dev_info->min_engine_clock = +
RE: [bug report] drm/amdgpu: add ring buffer information in devcoredump
[AMD Official Use Only - General] Got it. Thanks for reported that. Sent the patch for review. Regards Sunil khatri -Original Message- From: Dan Carpenter Sent: Saturday, March 16, 2024 2:42 PM To: Khatri, Sunil Cc: Khatri, Sunil ; Koenig, Christian ; Deucher, Alexander ; amd-gfx@lists.freedesktop.org Subject: Re: [bug report] drm/amdgpu: add ring buffer information in devcoredump The static checker is just complaining about NULL checking that doesn't make sense. It raises the question, can the pointer be NULL or not? Based on your comments and from reviewing the code, I do not think it can be NULL. Thus the correct thing is to remove the unnecessary NULL check. regards, dan carpenter
Re: [bug report] drm/amdgpu: add ring buffer information in devcoredump
Thanks for pointing these. I do have some doubt and i raised inline. On 3/15/2024 8:46 PM, Dan Carpenter wrote: Hello Sunil Khatri, Commit 42742cc541bb ("drm/amdgpu: add ring buffer information in devcoredump") from Mar 11, 2024 (linux-next), leads to the following Smatch static checker warning: drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c:219 amdgpu_devcoredump_read() error: we previously assumed 'coredump->adev' could be null (see line 206) drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 171 static ssize_t 172 amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, 173 void *data, size_t datalen) 174 { 175 struct drm_printer p; 176 struct amdgpu_coredump_info *coredump = data; 177 struct drm_print_iterator iter; 178 int i; 179 180 iter.data = buffer; 181 iter.offset = 0; 182 iter.start = offset; 183 iter.remain = count; 184 185 p = drm_coredump_printer(); 186 187 drm_printf(, " AMDGPU Device Coredump \n"); 188 drm_printf(, "version: " AMDGPU_COREDUMP_VERSION "\n"); 189 drm_printf(, "kernel: " UTS_RELEASE "\n"); 190 drm_printf(, "module: " KBUILD_MODNAME "\n"); 191 drm_printf(, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, 192 coredump->reset_time.tv_nsec); 193 194 if (coredump->reset_task_info.pid) 195 drm_printf(, "process_name: %s PID: %d\n", 196coredump->reset_task_info.process_name, 197coredump->reset_task_info.pid); 198 199 if (coredump->ring) { 200 drm_printf(, "\nRing timed out details\n"); 201 drm_printf(, "IP Type: %d Ring Name: %s\n", 202coredump->ring->funcs->type, 203coredump->ring->name); 204 } 205 206 if (coredump->adev) { ^^ Check for NULL This is the check for NULL. Is there any issue here ? 207 struct amdgpu_vm_fault_info *fault_info = 208 >adev->vm_manager.fault_info; 209 210 drm_printf(, "\n[%s] Page fault observed\n", 211fault_info->vmhub ? "mmhub" : "gfxhub"); 212 drm_printf(, "Faulty page starting at address: 0x%016llx\n", 213fault_info->addr); 214 drm_printf(, "Protection fault status register: 0x%x\n\n", 215fault_info->status); 216 } 217 218 drm_printf(, "Ring buffer information\n"); --> 219 for (int i = 0; i < coredump->adev->num_rings; i++) { ^^ Unchecked dereference Agree 220 int j = 0; 221 struct amdgpu_ring *ring = coredump->adev->rings[i]; 222 223 drm_printf(, "ring name: %s\n", ring->name); 224 drm_printf(, "Rptr: 0x%llx Wptr: 0x%llx RB mask: %x\n", 225amdgpu_ring_get_rptr(ring), 226amdgpu_ring_get_wptr(ring), 227ring->buf_mask); 228 drm_printf(, "Ring size in dwords: %d\n", 229ring->ring_size / 4); 230 drm_printf(, "Ring contents\n"); 231 drm_printf(, "Offset \t Value\n"); 232 233 while (j < ring->ring_size) { 234 drm_printf(, "0x%x \t 0x%x\n", j, ring->ring[j/4]); 235 j += 4; 236 } 237 } 238 239 if (coredump->reset_vram_lost) 240 drm_printf(, "VRAM is lost due to GPU reset!\n"); 241 if (coredump->adev->reset_info.num_regs) { ^^ Here too Agree. 242 drm_printf(, "AMDGPU register dumps:\nOffset: Value:\n"); 243 244 for (i = 0; i < coredump->adev->reset_info.num_regs; i++) 245 drm_printf(, "0x%08x: 0x%08x\n", 246 coredump->adev->reset_info.reset_dump_reg_list[i], 247 coredump->adev->reset_info.reset_dump_reg_value[i]); 248 } 249 250 return count - iter.remain; 251 } Although adev is a global structure and never in the code it is being checked for NULL as it wont be NULL until the driver is unloaded. I can add a check for adev in the beginning of the function amdgpu_devcoredump_read for
Re: [PATCH] drm/amdgpu: add the hw_ip version of all IP's
On 3/15/2024 6:45 PM, Alex Deucher wrote: On Fri, Mar 15, 2024 at 8:13 AM Sunil Khatri wrote: Add all the IP's version information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri This looks great. Reviewed-by: Alex Deucher Thanks Alex --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 62 +++ 1 file changed, 62 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..3d4bfe0a5a7c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -29,6 +29,43 @@ #include "sienna_cichlid.h" #include "smu_v13_0_10.h" +const char *hw_ip_names[MAX_HWIP] = { + [GC_HWIP] = "GC", + [HDP_HWIP] = "HDP", + [SDMA0_HWIP]= "SDMA0", + [SDMA1_HWIP]= "SDMA1", + [SDMA2_HWIP]= "SDMA2", + [SDMA3_HWIP]= "SDMA3", + [SDMA4_HWIP]= "SDMA4", + [SDMA5_HWIP]= "SDMA5", + [SDMA6_HWIP]= "SDMA6", + [SDMA7_HWIP]= "SDMA7", + [LSDMA_HWIP]= "LSDMA", + [MMHUB_HWIP]= "MMHUB", + [ATHUB_HWIP]= "ATHUB", + [NBIO_HWIP] = "NBIO", + [MP0_HWIP] = "MP0", + [MP1_HWIP] = "MP1", + [UVD_HWIP] = "UVD/JPEG/VCN", + [VCN1_HWIP] = "VCN1", + [VCE_HWIP] = "VCE", + [VPE_HWIP] = "VPE", + [DF_HWIP] = "DF", + [DCE_HWIP] = "DCE", + [OSSSYS_HWIP] = "OSSSYS", + [SMUIO_HWIP]= "SMUIO", + [PWR_HWIP] = "PWR", + [NBIF_HWIP] = "NBIF", + [THM_HWIP] = "THM", + [CLK_HWIP] = "CLK", + [UMC_HWIP] = "UMC", + [RSMU_HWIP] = "RSMU", + [XGMI_HWIP] = "XGMI", + [DCI_HWIP] = "DCI", + [PCIE_HWIP] = "PCIE", +}; + + int amdgpu_reset_init(struct amdgpu_device *adev) { int ret = 0; @@ -196,6 +233,31 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + + drm_printf(, "\nIP Information\n"); + drm_printf(, "SOC Family: %d\n", coredump->adev->family); + drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id); + drm_printf(, "SOC External Revision id: %d\n", + coredump->adev->external_rev_id); + + for (int i = 1; i < MAX_HWIP; i++) { + for (int j = 0; j < HWIP_MAX_INSTANCE; j++) { + int ver = coredump->adev->ip_versions[i][j]; + + if (ver) + drm_printf(, "HWIP: %s[%d][%d]: v%d.%d.%d.%d.%d\n", + hw_ip_names[i], i, j, + IP_VERSION_MAJ(ver), + IP_VERSION_MIN(ver), + IP_VERSION_REV(ver), + IP_VERSION_VARIANT(ver), + IP_VERSION_SUBREV(ver)); + } + } + } + if (coredump->ring) { drm_printf(, "\nRing timed out details\n"); drm_printf(, "IP Type: %d Ring Name: %s\n", -- 2.34.1
RE: [PATCH] drm/amdgpu: add the hw_ip version of all IP's
[AMD Official Use Only - General] Hello Alex Added the information directly from the ip_version and also added names for each ip so the version information makes more sense to the user. Below is the output in devcoredump now: IP Information SOC Family: 143 SOC Revision id: 0 SOC External Revision id: 50 HWIP: GC[1][0]: v10.3.2.0.0 HWIP: HDP[2][0]: v5.0.3.0.0 HWIP: SDMA0[3][0]: v5.2.2.0.0 HWIP: SDMA1[4][0]: v5.2.2.0.0 HWIP: MMHUB[12][0]: v2.1.0.0.0 HWIP: ATHUB[13][0]: v2.1.0.0.0 HWIP: NBIO[14][0]: v3.3.1.0.0 HWIP: MP0[15][0]: v11.0.11.0.0 HWIP: MP1[16][0]: v11.0.11.0.0 HWIP: UVD/JPEG/VCN[17][0]: v3.0.0.0.0 HWIP: UVD/JPEG/VCN[17][1]: v3.0.1.0.0 HWIP: DF[21][0]: v3.7.3.0.0 HWIP: DCE[22][0]: v3.0.0.0.0 HWIP: OSSSYS[23][0]: v5.0.3.0.0 HWIP: SMUIO[24][0]: v11.0.6.0.0 HWIP: NBIF[26][0]: v3.3.1.0.0 HWIP: THM[27][0]: v11.0.5.0.0 HWIP: CLK[28][0]: v11.0.3.0.0 HWIP: CLK[28][1]: v11.0.3.0.0 HWIP: CLK[28][2]: v11.0.3.0.0 HWIP: CLK[28][3]: v11.0.3.0.0 HWIP: CLK[28][4]: v11.0.3.0.0 HWIP: CLK[28][5]: v11.0.3.0.0 HWIP: CLK[28][6]: v11.0.3.0.0 HWIP: CLK[28][7]: v11.0.3.0.0 HWIP: UMC[29][0]: v8.7.1.0.0 HWIP: UMC[29][1]: v8.7.1.0.0 HWIP: UMC[29][2]: v8.7.1.0.0 HWIP: UMC[29][3]: v8.7.1.0.0 HWIP: UMC[29][4]: v8.7.1.0.0 HWIP: UMC[29][5]: v8.7.1.0.0 HWIP: PCIE[33][0]: v6.5.0.0.0 -Original Message- From: Sunil Khatri Sent: Friday, March 15, 2024 5:43 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; linux-ker...@vger.kernel.org; Khatri, Sunil Subject: [PATCH] drm/amdgpu: add the hw_ip version of all IP's Add all the IP's version information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 62 +++ 1 file changed, 62 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..3d4bfe0a5a7c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -29,6 +29,43 @@ #include "sienna_cichlid.h" #include "smu_v13_0_10.h" +const char *hw_ip_names[MAX_HWIP] = { + [GC_HWIP] = "GC", + [HDP_HWIP] = "HDP", + [SDMA0_HWIP]= "SDMA0", + [SDMA1_HWIP]= "SDMA1", + [SDMA2_HWIP]= "SDMA2", + [SDMA3_HWIP]= "SDMA3", + [SDMA4_HWIP]= "SDMA4", + [SDMA5_HWIP]= "SDMA5", + [SDMA6_HWIP]= "SDMA6", + [SDMA7_HWIP]= "SDMA7", + [LSDMA_HWIP]= "LSDMA", + [MMHUB_HWIP]= "MMHUB", + [ATHUB_HWIP]= "ATHUB", + [NBIO_HWIP] = "NBIO", + [MP0_HWIP] = "MP0", + [MP1_HWIP] = "MP1", + [UVD_HWIP] = "UVD/JPEG/VCN", + [VCN1_HWIP] = "VCN1", + [VCE_HWIP] = "VCE", + [VPE_HWIP] = "VPE", + [DF_HWIP] = "DF", + [DCE_HWIP] = "DCE", + [OSSSYS_HWIP] = "OSSSYS", + [SMUIO_HWIP]= "SMUIO", + [PWR_HWIP] = "PWR", + [NBIF_HWIP] = "NBIF", + [THM_HWIP] = "THM", + [CLK_HWIP] = "CLK", + [UMC_HWIP] = "UMC", + [RSMU_HWIP] = "RSMU", + [XGMI_HWIP] = "XGMI", + [DCI_HWIP] = "DCI", + [PCIE_HWIP] = "PCIE", +}; + + int amdgpu_reset_init(struct amdgpu_device *adev) { int ret = 0; @@ -196,6 +233,31 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + + drm_printf(, "\nIP Information\n"); + drm_printf(, "SOC Family: %d\n", coredump->adev->family); + drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id); + drm_printf(, "SOC External Revision id: %d\n", + coredump->adev->external_rev_id); + + for (int i = 1; i < MAX_HWIP; i++) { + for (int j = 0; j < HWIP_MAX_INSTANCE; j++) { + int ver = coredump->adev->ip_versions[i][j]; + + if (ver) +
Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc
On 3/14/2024 8:12 PM, Alex Deucher wrote: On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil wrote: On 3/14/2024 1:58 AM, Alex Deucher wrote: On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri wrote: Add all the IP's information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..611fdb90a1fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + drm_printf(, "\nIP Information\n"); + drm_printf(, "SOC Family: %d\n", coredump->adev->family); + drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id); + + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + struct amdgpu_ip_block *ip = + >adev->ip_blocks[i]; + drm_printf(, "IP type: %d IP name: %s\n", + ip->version->type, + ip->version->funcs->name); + drm_printf(, "IP version: (%d,%d,%d)\n\n", + ip->version->major, + ip->version->minor, + ip->version->rev); + } + } I think the IP discovery table would be more useful. Either walk the adev->ip_versions structure, or just include the IP discovery binary. I did explore the adev->ip_versions and if i just go through the array it doesn't give any useful information directly. There are no ways to find directly from adev->ip_versions below things until i also reparse the discovery binary again like done the discovery amdgpu_discovery_reg_base_init and walk through the headers of various ips using discovery binary. a. Which IP is available on soc or not. b. How many instances are there Also i again have to change back to major, minor and rev convention for this information to be useful. I am exploring it more if i find some other information i will update. adev->ip_block[] is derived from ip discovery only for each block which is there on the SOC, so we are not reading information which isnt applicable for the soc. We have name , type and version no of the IPs available on the soc. If you want i could add no of instances of each IP too if you think that's useful information here. Could you share what information is missing in this approach so i can include that. I was hoping to get the actual IP versions for the IPs from IP discovery rather than the versions from the ip_block array. The latter are common so you can end up with the same version used across a wide variety of chips (e.g., all gfx10.x based chips use the same gfx 10 IP code even if the actual IP version is different for most of the chips). Got it. let me check how to get it could be done rightly. For dumping the IP discovery binary, i dont understand how that information would be useful directly and needs to be decoded like we are doing in discovery init. Please correct me if my understanding is wrong here. It's probably not a high priority, I was just thinking it might be useful to have in case there ended up being some problem related to the IP discovery table on some boards. E.g., we'd know that all boards with a certain harvest config seem to align with a reported problem. Similar for vbios. It's more for telemetry. E.g., all the boards reporting some problem have a particular powerplay config or whatever. I got it. But two points of contention here in my understanding. The dump works only where there is reset and not sure if it could be used very early in development of not. Second point is that devcoredump is 4096 bytes/4Kbyte of memory where we are dumping all the information. Not sure if that could be increased but it might not be enough if we are planning to dump all to it. Another point is since we have sysfs/debugfs/info ioctl etc information available. We should sort out what really is helpful in debugging GPU hang and that's added in devcore. Regards Sunil Alex Alex + if (coredump->ring) { drm_printf(, "\nRing timed out details\n"); drm_printf(, "IP Type: %d Ring Name: %s\n", -- 2.34.1
Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's
On 3/14/2024 11:40 AM, Sharma, Shashank wrote: On 14/03/2024 06:58, Khatri, Sunil wrote: On 3/14/2024 2:06 AM, Alex Deucher wrote: On Tue, Mar 12, 2024 at 8:42 AM Sunil Khatri wrote: Add firmware version information of each IP and each instance where applicable. Is there a way we can share some common code with devcoredump, debugfs, and the info IOCTL? All three places need to query this information and the same logic is repeated in each case. Hello Alex, Yes you re absolutely right the same information is being retrieved again as done in debugfs. I can reorganize the code so same function could be used by debugfs and devcoredump but this is exactly what i tried to avoid here. I did try to use minimum functionality in devcoredump without shuffling a lot of code here and there. Also our devcoredump is implemented in amdgpu_reset.c and not all the information is available here and there we might have to include lot of header and cross functions in amdgpu_reset until we want a dedicated file for devcoredump. I think Alex is suggesting to have one common backend to generate all the core debug info, and then different wrapper functions which can pack this raw info into the packets aligned with respective infra like devcore/debugfs/info IOCTL, which seems like a good idea to me. If you think you need a new file for this backend it should be fine. My suggestion was on same lines that if we want to use the same infra to access information from different parts of the code then we need to reorganize. And at same time since there is quite some data we are planning to add in devcoredump so i recommend to have a dedicated .c/.h instead of using amdgpu_reset.c so a clean include is easy to maintain. Once Alex confirms it i can start working on design and what all information we need on this. Regards Sunil something like: amdgpu_debug_core.c:: struct amdgpu_core_debug_info { /* Superset of all the info you are collecting from HW */ }; - amdgpu_debug_generate_core_info { /* This function collects the core debug info from HW and saves in amdgpu_core_debug_info, we can update this periodically regardless of a request */ } and then: devcore_info *amdgpu_debug_pack_for_devcore(core_debug_info) { /* convert core debug info into devcore aligned format/data */ } ioctl_info *amdgpu_debug_pack_for_info_ioctl(core_debug_info) { /* convert core debug info into info IOCTL aligned format/data */ } debugfs_info *amdgpu_debug_pack_for_debugfs(core_debug_info) { /* convert core debug info into debugfs aligned format/data */ } - Shashank Info IOCTL does have a lot of information which also is in pipeline to be dumped but this if we want to reuse the functionality of IOCTL we need to reorganize a lot of code. If that is the need of the hour i could work on that. Please let me know. This is my suggestion if it makes sense: 1. If we want to reuse a lot of functionality then we need to modularize some of the functions further so they could be consumed directly by devcoredump. 2. We should also have a dedicated file for devcoredump.c/.h so its easy to include headers of needed functionality cleanly and easy to expand devcoredump. 3. based on the priority and importance of this task we can add information else some repetition is a real possibility. Alex Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++ 1 file changed, 122 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 611fdb90a1fc..78ddc58aef67 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, { } #else +static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct drm_printer *p) +{ + uint32_t version; + uint32_t feature; + uint8_t smu_program, smu_major, smu_minor, smu_debug; + + drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n", + adev->vce.fb_version, adev->vce.fw_version); + drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", + 0, adev->uvd.fw_version); + drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", + 0, adev->gmc.fw_version); + drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n", + adev->gfx.me_feature_version, adev->gfx.me_fw_version); + drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n", + adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version); + drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n", + adev->gfx.ce_feature_version, adev->gfx.ce_fw_version); +
Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's
On 3/14/2024 2:06 AM, Alex Deucher wrote: On Tue, Mar 12, 2024 at 8:42 AM Sunil Khatri wrote: Add firmware version information of each IP and each instance where applicable. Is there a way we can share some common code with devcoredump, debugfs, and the info IOCTL? All three places need to query this information and the same logic is repeated in each case. Hello Alex, Yes you re absolutely right the same information is being retrieved again as done in debugfs. I can reorganize the code so same function could be used by debugfs and devcoredump but this is exactly what i tried to avoid here. I did try to use minimum functionality in devcoredump without shuffling a lot of code here and there. Also our devcoredump is implemented in amdgpu_reset.c and not all the information is available here and there we might have to include lot of header and cross functions in amdgpu_reset until we want a dedicated file for devcoredump. Info IOCTL does have a lot of information which also is in pipeline to be dumped but this if we want to reuse the functionality of IOCTL we need to reorganize a lot of code. If that is the need of the hour i could work on that. Please let me know. This is my suggestion if it makes sense: 1. If we want to reuse a lot of functionality then we need to modularize some of the functions further so they could be consumed directly by devcoredump. 2. We should also have a dedicated file for devcoredump.c/.h so its easy to include headers of needed functionality cleanly and easy to expand devcoredump. 3. based on the priority and importance of this task we can add information else some repetition is a real possibility. Alex Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++ 1 file changed, 122 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 611fdb90a1fc..78ddc58aef67 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, { } #else +static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct drm_printer *p) +{ + uint32_t version; + uint32_t feature; + uint8_t smu_program, smu_major, smu_minor, smu_debug; + + drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n", + adev->vce.fb_version, adev->vce.fw_version); + drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", + 0, adev->uvd.fw_version); + drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", + 0, adev->gmc.fw_version); + drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n", + adev->gfx.me_feature_version, adev->gfx.me_fw_version); + drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n", + adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version); + drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n", + adev->gfx.ce_feature_version, adev->gfx.ce_fw_version); + drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version); + + drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlc_feature_version, + adev->gfx.rlc_srlc_fw_version); + drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlg_feature_version, + adev->gfx.rlc_srlg_fw_version); + drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srls_feature_version, + adev->gfx.rlc_srls_fw_version); + drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcp_ucode_feature_version, + adev->gfx.rlcp_ucode_version); + drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcv_ucode_feature_version, + adev->gfx.rlcv_ucode_version); + drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec_feature_version, + adev->gfx.mec_fw_version); + + if (adev->gfx.mec2_fw) + drm_printf(p, + "MEC2 feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec2_feature_version, + adev->gfx.mec2_fw_version); + + drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n", + 0, adev->gfx.imu_fw_version); + drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n", + adev->psp.sos.feature_version, + adev->psp.sos.fw_version); + drm_printf(p, "PSP ASD
Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc
On 3/14/2024 1:58 AM, Alex Deucher wrote: On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri wrote: Add all the IP's information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..611fdb90a1fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + drm_printf(, "\nIP Information\n"); + drm_printf(, "SOC Family: %d\n", coredump->adev->family); + drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id); + + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + struct amdgpu_ip_block *ip = + >adev->ip_blocks[i]; + drm_printf(, "IP type: %d IP name: %s\n", + ip->version->type, + ip->version->funcs->name); + drm_printf(, "IP version: (%d,%d,%d)\n\n", + ip->version->major, + ip->version->minor, + ip->version->rev); + } + } I think the IP discovery table would be more useful. Either walk the adev->ip_versions structure, or just include the IP discovery binary. I did explore the adev->ip_versions and if i just go through the array it doesn't give any useful information directly. There are no ways to find directly from adev->ip_versions below things until i also reparse the discovery binary again like done the discovery amdgpu_discovery_reg_base_init and walk through the headers of various ips using discovery binary. a. Which IP is available on soc or not. b. How many instances are there Also i again have to change back to major, minor and rev convention for this information to be useful. I am exploring it more if i find some other information i will update. adev->ip_block[] is derived from ip discovery only for each block which is there on the SOC, so we are not reading information which isnt applicable for the soc. We have name , type and version no of the IPs available on the soc. If you want i could add no of instances of each IP too if you think that's useful information here. Could you share what information is missing in this approach so i can include that. For dumping the IP discovery binary, i dont understand how that information would be useful directly and needs to be decoded like we are doing in discovery init. Please correct me if my understanding is wrong here. Alex + if (coredump->ring) { drm_printf(, "\nRing timed out details\n"); drm_printf(, "IP Type: %d Ring Name: %s\n", -- 2.34.1
Re: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's
[AMD Official Use Only - General] Gentle reminder Regards Sunil Get Outlook for Android<https://aka.ms/AAb9ysg> From: Sunil Khatri Sent: Tuesday, March 12, 2024 6:11:48 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-gfx@lists.freedesktop.org ; dri-de...@lists.freedesktop.org ; linux-ker...@vger.kernel.org ; Khatri, Sunil Subject: [PATCH 2/2] drm:amdgpu: add firmware information of all IP's Add firmware version information of each IP and each instance where applicable. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++ 1 file changed, 122 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 611fdb90a1fc..78ddc58aef67 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -168,6 +168,123 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, { } #else +static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev, struct drm_printer *p) +{ + uint32_t version; + uint32_t feature; + uint8_t smu_program, smu_major, smu_minor, smu_debug; + + drm_printf(p, "VCE feature version: %u, fw version: 0x%08x\n", + adev->vce.fb_version, adev->vce.fw_version); + drm_printf(p, "UVD feature version: %u, fw version: 0x%08x\n", + 0, adev->uvd.fw_version); + drm_printf(p, "GMC feature version: %u, fw version: 0x%08x\n", + 0, adev->gmc.fw_version); + drm_printf(p, "ME feature version: %u, fw version: 0x%08x\n", + adev->gfx.me_feature_version, adev->gfx.me_fw_version); + drm_printf(p, "PFP feature version: %u, fw version: 0x%08x\n", + adev->gfx.pfp_feature_version, adev->gfx.pfp_fw_version); + drm_printf(p, "CE feature version: %u, fw version: 0x%08x\n", + adev->gfx.ce_feature_version, adev->gfx.ce_fw_version); + drm_printf(p, "RLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_feature_version, adev->gfx.rlc_fw_version); + + drm_printf(p, "RLC SRLC feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlc_feature_version, + adev->gfx.rlc_srlc_fw_version); + drm_printf(p, "RLC SRLG feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srlg_feature_version, + adev->gfx.rlc_srlg_fw_version); + drm_printf(p, "RLC SRLS feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlc_srls_feature_version, + adev->gfx.rlc_srls_fw_version); + drm_printf(p, "RLCP feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcp_ucode_feature_version, + adev->gfx.rlcp_ucode_version); + drm_printf(p, "RLCV feature version: %u, fw version: 0x%08x\n", + adev->gfx.rlcv_ucode_feature_version, + adev->gfx.rlcv_ucode_version); + drm_printf(p, "MEC feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec_feature_version, + adev->gfx.mec_fw_version); + + if (adev->gfx.mec2_fw) + drm_printf(p, + "MEC2 feature version: %u, fw version: 0x%08x\n", + adev->gfx.mec2_feature_version, + adev->gfx.mec2_fw_version); + + drm_printf(p, "IMU feature version: %u, fw version: 0x%08x\n", + 0, adev->gfx.imu_fw_version); + drm_printf(p, "PSP SOS feature version: %u, fw version: 0x%08x\n", + adev->psp.sos.feature_version, + adev->psp.sos.fw_version); + drm_printf(p, "PSP ASD feature version: %u, fw version: 0x%08x\n", + adev->psp.asd_context.bin_desc.feature_version, + adev->psp.asd_context.bin_desc.fw_version); + + drm_printf(p, "TA XGMI feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.xgmi_context.context.bin_desc.feature_version, + adev->psp.xgmi_context.context.bin_desc.fw_version); + drm_printf(p, "TA RAS feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.ras_context.context.bin_desc.feature_version, + adev->psp.ras_context.context.bin_desc.fw_version); + drm_printf(p, "TA HDCP feature version: 0x%08x, fw version: 0x%08x\n", + adev->psp.hdcp_context.context.bin_desc.feature_version, + adev->psp.hdcp_context.context.b
Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc
[AMD Official Use Only - General] Gentle reminder for review. Regards Sunil Get Outlook for Android<https://aka.ms/AAb9ysg> From: Sunil Khatri Sent: Tuesday, March 12, 2024 6:11:47 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-gfx@lists.freedesktop.org ; dri-de...@lists.freedesktop.org ; linux-ker...@vger.kernel.org ; Khatri, Sunil Subject: [PATCH 1/2] drm/amdgpu: add the IP information of the soc Add all the IP's information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..611fdb90a1fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + drm_printf(, "\nIP Information\n"); + drm_printf(, "SOC Family: %d\n", coredump->adev->family); + drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id); + + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + struct amdgpu_ip_block *ip = + >adev->ip_blocks[i]; + drm_printf(, "IP type: %d IP name: %s\n", + ip->version->type, + ip->version->funcs->name); + drm_printf(, "IP version: (%d,%d,%d)\n\n", + ip->version->major, + ip->version->minor, + ip->version->rev); + } + } + if (coredump->ring) { drm_printf(, "\nRing timed out details\n"); drm_printf(, "IP Type: %d Ring Name: %s\n", -- 2.34.1
Re: [PATCH 1/2] drm/amdgpu: add the IP information of the soc
[AMD Official Use Only - General] Gentle Reminder for review. Regards, Sunil Get Outlook for Android<https://aka.ms/AAb9ysg> From: Sunil Khatri Sent: Tuesday, March 12, 2024 6:11:47 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-gfx@lists.freedesktop.org ; dri-de...@lists.freedesktop.org ; linux-ker...@vger.kernel.org ; Khatri, Sunil Subject: [PATCH 1/2] drm/amdgpu: add the IP information of the soc Add all the IP's information on a SOC to the devcoredump. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a0dbccad2f53..611fdb90a1fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + /* GPU IP's information of the SOC */ + if (coredump->adev) { + drm_printf(, "\nIP Information\n"); + drm_printf(, "SOC Family: %d\n", coredump->adev->family); + drm_printf(, "SOC Revision id: %d\n", coredump->adev->rev_id); + + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { + struct amdgpu_ip_block *ip = + >adev->ip_blocks[i]; + drm_printf(, "IP type: %d IP name: %s\n", + ip->version->type, + ip->version->funcs->name); + drm_printf(, "IP version: (%d,%d,%d)\n\n", + ip->version->major, + ip->version->minor, + ip->version->rev); + } + } + if (coredump->ring) { drm_printf(, "\nRing timed out details\n"); drm_printf(, "IP Type: %d Ring Name: %s\n", -- 2.34.1
Re: [PATCH] drm/amdgpu: add ring buffer information in devcoredump
On 3/11/2024 7:29 PM, Christian König wrote: Am 11.03.24 um 13:22 schrieb Sunil Khatri: Add relevant ringbuffer information such as rptr, wptr, ring name, ring size and also the ring contents for each ring on a gpu reset. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 + 1 file changed, 21 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 6d059f853adc..1992760039da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, fault_info->status); } + drm_printf(, "Ring buffer information\n"); + for (int i = 0; i < coredump->adev->num_rings; i++) { + int j = 0; + struct amdgpu_ring *ring = coredump->adev->rings[i]; + + drm_printf(, "ring name: %s\n", ring->name); + drm_printf(, "Rptr: 0x%llx Wptr: 0x%llx\n", + amdgpu_ring_get_rptr(ring) & ring->buf_mask, + amdgpu_ring_get_wptr(ring) & ring->buf_mask); Don't apply the mask here. We do have some use cases where the rptr and wptr are outside the ring buffer. Sure i will remove the mask. + drm_printf(, "Ring size in dwords: %d\n", + ring->ring_size / 4); Rather print the mask as additional value here. Does that help adding the mask value ? + drm_printf(, "Ring contents\n"); + drm_printf(, "Offset \t Value\n"); + + while (j < ring->ring_size) { + drm_printf(, "0x%x \t 0x%x\n", j, ring->ring[j/4]); + j += 4; + } + drm_printf(, "Ring dumped\n"); That seems superfluous. Noted Regards Sunil Regards, Christian. + } + if (coredump->reset_vram_lost) drm_printf(, "VRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) {
RE: [PATCH] drm/amdgpu: add all ringbuffer information in devcoredump
Ignore this as I updated commit message and subject so sending new mail. -Original Message- From: Sunil Khatri Sent: Monday, March 11, 2024 5:04 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; linux-ker...@vger.kernel.org; Khatri, Sunil Subject: [PATCH] drm/amdgpu: add all ringbuffer information in devcoredump Add ringbuffer information such as: rptr, wptr, ring name, ring size and also the ring contents for each ring on a gpu reset. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 21 + 1 file changed, 21 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 6d059f853adc..1992760039da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -215,6 +215,27 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, fault_info->status); } + drm_printf(, "Ring buffer information\n"); + for (int i = 0; i < coredump->adev->num_rings; i++) { + int j = 0; + struct amdgpu_ring *ring = coredump->adev->rings[i]; + + drm_printf(, "ring name: %s\n", ring->name); + drm_printf(, "Rptr: 0x%llx Wptr: 0x%llx\n", + amdgpu_ring_get_rptr(ring) & ring->buf_mask, + amdgpu_ring_get_wptr(ring) & ring->buf_mask); + drm_printf(, "Ring size in dwords: %d\n", + ring->ring_size / 4); + drm_printf(, "Ring contents\n"); + drm_printf(, "Offset \t Value\n"); + + while (j < ring->ring_size) { + drm_printf(, "0x%x \t 0x%x\n", j, ring->ring[j/4]); + j += 4; + } + drm_printf(, "Ring dumped\n"); + } + if (coredump->reset_vram_lost) drm_printf(, "VRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { -- 2.34.1
Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
On 3/8/2024 2:39 PM, Christian König wrote: Am 07.03.24 um 21:50 schrieb Sunil Khatri: Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed Faulty page starting at address: 0x Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..8794a3c21176 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->adev) { + struct amdgpu_vm_fault_info *fault_info = + >adev->vm_manager.fault_info; + + drm_printf(, "\n[%s] Page fault observed\n", + fault_info->vmhub ? "mmhub" : "gfxhub"); + drm_printf(, "Faulty page starting at address: 0x%016llx\n", + fault_info->addr); + drm_printf(, "Protection fault status register: 0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(, "VRAM is lost due to GPU reset!\n"); + drm_printf(, "\nVRAM is lost due to GPU reset!\n"); Why this additional new line? The intent is the devcoredump have different sections clearly demarcated with an new line else "VRAM is lost due to GPU reset!" seems part of the page fault information. [gfxhub] Page fault observed Faulty page starting at address: 0x Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Regards Sunil Apart from that looks really good to me. Regards, Christian. if (coredump->adev->reset_info.num_regs) { drm_printf(, "AMDGPU register dumps:\nOffset: Value:\n");
Re: [PATCH 2/2] drm/amdgpu: add vm fault information to devcoredump
On 3/8/2024 12:44 AM, Alex Deucher wrote: On Thu, Mar 7, 2024 at 12:00 PM Sunil Khatri wrote: Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed Faulty page starting at address 0x Do you want a : before the address for consistency? sure. Protection fault status register:0x301031 How about a space after the : for consistency? For parsability, it may make more sense to just have a list of key value pairs: [GPU page fault] hub: addr: status: [Ring timeout details] IP: ring: name: etc. Sure i agree but till now i was capturing information like we shared in dmesg which is user readable. But surely one we have enough data i could arrange all in key: value pairs like you suggest in a patch later if that works ? VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..dd39e614d907 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->adev) { + struct amdgpu_vm_fault_info *fault_info = + >adev->vm_manager.fault_info; + + drm_printf(, "\n[%s] Page fault observed\n", + fault_info->vmhub ? "mmhub" : "gfxhub"); + drm_printf(, "Faulty page starting at address 0x%016llx\n", + fault_info->addr); + drm_printf(, "Protection fault status register:0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(, "VRAM is lost due to GPU reset!\n"); + drm_printf(, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(, "AMDGPU register dumps:\nOffset: Value:\n"); -- 2.34.1
Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump
On 3/7/2024 6:10 PM, Christian König wrote: Am 07.03.24 um 09:37 schrieb Khatri, Sunil: On 3/7/2024 1:47 PM, Christian König wrote: Am 06.03.24 um 19:19 schrieb Sunil Khatri: Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed for GPU family:143 Faulty page starting at address 0x Protection fault status register:0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..d7fea6cdf2f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->fault_info.status) { + struct amdgpu_vm_fault_info *fault_info = >fault_info; + + drm_printf(, "\n[%s] Page fault observed for GPU family:%d\n", + fault_info->vmhub ? "mmhub" : "gfxhub", + coredump->adev->family); + drm_printf(, "Faulty page starting at address 0x%016llx\n", + fault_info->addr); + drm_printf(, "Protection fault status register:0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(, "VRAM is lost due to GPU reset!\n"); + drm_printf(, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(, "AMDGPU register dumps:\nOffset: Value:\n"); @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, if (job) { s_job = >base; coredump->ring = to_amdgpu_ring(s_job->sched); + coredump->fault_info = job->vm->fault_info; That's illegal. The VM pointer might already be stale at this point. I think you need to add the fault info of the last fault globally in the VRAM manager or move this to the process info Shashank is working on. Are you saying that during the reset or otherwise a vm which is part of this job could have been freed and we might have a NULL dereference or invalid reference? Till now based on the resets and pagefaults that i have created till now using the same app which we are using for IH overflow i am able to get the valid vm only. Assuming amdgpu_vm is freed for this job or stale, are you suggesting to update this information in adev-> vm_manager along with existing per vm fault_info or only in vm_manager ? Good question. having it both in the VM as well as the VM manager sounds like the simplest option for now. Let me update the patch then with information in VM manager. Regards Sunil Regards, Christian. Regards, Christian. } coredump->adev = adev; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 60522963aaca..3197955264f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -98,6 +98,7 @@ struct amdgpu_coredump_info { struct timespec64 reset_time; bool reset_vram_lost; struct amdgpu_ring *ring; + struct amdgpu_vm_fault_info fault_info; }; #endif
Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump
On 3/7/2024 1:47 PM, Christian König wrote: Am 06.03.24 um 19:19 schrieb Sunil Khatri: Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed for GPU family:143 Faulty page starting at address 0x Protection fault status register:0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..d7fea6cdf2f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->fault_info.status) { + struct amdgpu_vm_fault_info *fault_info = >fault_info; + + drm_printf(, "\n[%s] Page fault observed for GPU family:%d\n", + fault_info->vmhub ? "mmhub" : "gfxhub", + coredump->adev->family); + drm_printf(, "Faulty page starting at address 0x%016llx\n", + fault_info->addr); + drm_printf(, "Protection fault status register:0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(, "VRAM is lost due to GPU reset!\n"); + drm_printf(, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(, "AMDGPU register dumps:\nOffset: Value:\n"); @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, if (job) { s_job = >base; coredump->ring = to_amdgpu_ring(s_job->sched); + coredump->fault_info = job->vm->fault_info; That's illegal. The VM pointer might already be stale at this point. I think you need to add the fault info of the last fault globally in the VRAM manager or move this to the process info Shashank is working on. Are you saying that during the reset or otherwise a vm which is part of this job could have been freed and we might have a NULL dereference or invalid reference? Till now based on the resets and pagefaults that i have created till now using the same app which we are using for IH overflow i am able to get the valid vm only. Assuming amdgpu_vm is freed for this job or stale, are you suggesting to update this information in adev-> vm_manager along with existing per vm fault_info or only in vm_manager ? Regards, Christian. } coredump->adev = adev; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 60522963aaca..3197955264f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -98,6 +98,7 @@ struct amdgpu_coredump_info { struct timespec64 reset_time; bool reset_vram_lost; struct amdgpu_ring *ring; + struct amdgpu_vm_fault_info fault_info; }; #endif
Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump
On 3/7/2024 12:51 AM, Deucher, Alexander wrote: [Public] -Original Message- From: Sunil Khatri Sent: Wednesday, March 6, 2024 1:20 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; linux- ker...@vger.kernel.org; Joshi, Mukul ; Paneer Selvam, Arunpravin ; Khatri, Sunil Subject: [PATCH] drm/amdgpu: add vm fault information to devcoredump Add page fault information to the devcoredump. Output of devcoredump: AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed for GPU family:143 Faulty page starting at I think we should add a separate section for the GPU identification information (family, PCI ids, IP versions, etc.). For this patch, I think fine to just print the fault address and status. Noted Regards Sunil Alex address 0x Protection fault status register:0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..d7fea6cdf2f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->fault_info.status) { + struct amdgpu_vm_fault_info *fault_info = fault_info; + + drm_printf(, "\n[%s] Page fault observed for GPU family:%d\n", +fault_info->vmhub ? "mmhub" : "gfxhub", +coredump->adev->family); + drm_printf(, "Faulty page starting at address 0x%016llx\n", +fault_info->addr); + drm_printf(, "Protection fault status register:0x%x\n", +fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(, "VRAM is lost due to GPU reset!\n"); + drm_printf(, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(, "AMDGPU register dumps:\nOffset: Value:\n"); @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, if (job) { s_job = >base; coredump->ring = to_amdgpu_ring(s_job->sched); + coredump->fault_info = job->vm->fault_info; } coredump->adev = adev; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 60522963aaca..3197955264f9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -98,6 +98,7 @@ struct amdgpu_coredump_info { struct timespec64 reset_time; boolreset_vram_lost; struct amdgpu_ring *ring; + struct amdgpu_vm_fault_info fault_info; }; #endif -- 2.34.1
RE: [PATCH] drm/amdgpu: cache in more vm fault information
[AMD Official Use Only - General] Ignore this. Triggered wrongly. -Original Message- From: Sunil Khatri Sent: Wednesday, March 6, 2024 11:50 PM To: Deucher, Alexander ; Koenig, Christian ; Sharma, Shashank Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; linux-ker...@vger.kernel.org; Joshi, Mukul ; Paneer Selvam, Arunpravin ; Khatri, Sunil Subject: [PATCH] drm/amdgpu: cache in more vm fault information When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Add all such information in the last cached pagefault from an interrupt handler. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +- 7 files changed, 18 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 4299ce386322..b77e8e28769d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m) * Cache the fault info for later use by userspace in debugging. */ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub) @@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, xa_lock_irqsave(>vm_manager.pasids, flags); - vm = xa_load(>vm_manager.pasids, pasid); + vm = xa_load(>vm_manager.pasids, entry->pasid); /* Don't update the fault cache if status is 0. In the multiple * fault case, subsequent faults will return a 0 status which is * useless for userspace and replaces the useful fault status, so @@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, if (vm && status) { vm->fault_info.addr = addr; vm->fault_info.status = status; + vm->fault_info.client_id = entry->client_id; + vm->fault_info.src_id = entry->src_id; + vm->fault_info.vmid = entry->vmid; + vm->fault_info.pasid = entry->pasid; + vm->fault_info.ring_id = entry->ring_id; if (AMDGPU_IS_GFXHUB(vmhub)) { vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX; vm->fault_info.vmhub |= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 047ec1930d12..c7782a89bdb5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info { uint32_tstatus; /* which vmhub? gfxhub, mmhub, etc. */ unsigned intvmhub; + unsigned intclient_id; + unsigned intsrc_id; + unsigned intring_id; + unsigned intpasid; + unsigned intvmid; }; struct amdgpu_vm { @@ -605,7 +610,7 @@ static inline void amdgpu_vm_eviction_unlock(struct amdgpu_vm *vm) } void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub); diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index d933e19e0cf5..6b177ce8db0e 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c @@ -150,7 +150,7 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device *adev, status = RREG32(hub->vm_l2_pro_fault_status); WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1); - amdgpu_vm_update_fault_cache(adev, entry->pasid, addr, status, + amdgpu_vm_update_fault_cache(adev, entry, addr, status, entry->vmid_src ? AMDGPU_MMHUB0(0) : AMDGPU_GFXHUB(0)); } diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c index 527dc917e049..bcf254856a3e 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c @@ -121,7 +121,7 @@ static int gmc_v11_0_process_interrupt(struct amdgpu_
Re: [PATCH] drm/amdgpu: cache in more vm fault information
As discussion we decided that we dont need the client id, srcid, pasid etc in page fault information dump. So this patch isnt needed anymore. So dropping this patch and will add the new information in the devcoredump for pagefault which is all available in existing structures. As discussed, we just need to provide faulting address, Fault status register with gpu family to decode the fault along with process information. Regards Sunil Khatri On 3/6/2024 9:56 PM, Khatri, Sunil wrote: On 3/6/2024 9:49 PM, Christian König wrote: Am 06.03.24 um 17:06 schrieb Khatri, Sunil: On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:59 PM, Alex Deucher wrote: On Wed, Mar 6, 2024 at 11:21 AM Khatri, Sunil wrote: On 3/6/2024 9:45 PM, Alex Deucher wrote: On Wed, Mar 6, 2024 at 11:06 AM Khatri, Sunil wrote: On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:49 PM, Christian König wrote: Am 06.03.24 um 17:06 schrieb Khatri, Sunil: On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in irq context which is writing more bytes to memory i gu
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:45 PM, Alex Deucher wrote: On Wed, Mar 6, 2024 at 11:06 AM Khatri, Sunil wrote: On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in irq context which is writing more bytes to memory i guess compared to an integer:) Well I totally ag
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:07 PM, Christian König wrote: Am 06.03.24 um 16:13 schrieb Khatri, Sunil: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in irq context which is writing more bytes to memory i guess compared to an integer:) Well I totally agree that we shouldn't fiddle to much in the interrupt hand
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 9:19 PM, Alex Deucher wrote: On Wed, Mar 6, 2024 at 10:32 AM Alex Deucher wrote: On Wed, Mar 6, 2024 at 10:13 AM Khatri, Sunil wrote: On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. We also need to include the status register value. That contains the important information (type of access, fault type, client, etc.). Client_id and src_id are only used to route the interrupt to the right software code. E.g., a different client_id and src_id would be a completely different interrupt (e.g., vblank or fence, etc.). For GPU page faults the client_id and src_id will always be the same. The devcoredump should also include information about the GPU itself as well (e.g., PCI DID/VID, maybe some of the relevant IP versions). We already have "status" which is register "GCVM_L2_PROTECTION_FAULT_STATUS". But the problem here is this all needs to be captured in interrupt context which i want to avoid and this is family specific calls. chip family would also be good. And also vram size. If we have a way to identify the chip and we have the vm status register and vm fault address, we can decode all of the fault information. In this patch i am focusing on page fault specific information only[taking one at a time]. But eventually will be adding more information as per the devcoredump JIRA plan. will keep this in todo too for other information that you
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 8:34 PM, Christian König wrote: Am 06.03.24 um 15:29 schrieb Alex Deucher: On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil wrote: On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. I think amdgpu_vm_fault_info() has everything you need already (vmhub, status, and addr). client_id and src_id are just tokens in the interrupt cookie so we know which IP to route the interrupt to. We know what they will be because otherwise we'd be in the interrupt handler for a different IP. I don't think ring_id has any useful information in this context and vmid and pasid are probably not too useful because they are just tokens to associate the fault with a process. It would be better to have the process name. Just to share context here Alex, i am preparing this for devcoredump, my intention was to replicate the information which in KMD we are sharing in Dmesg for page faults. If assuming we do not add client id specially we would not be able to share enough information in devcoredump. It would be just address and hub(gfxhub/mmhub) and i think that is partial information as src id and client id and ip block shares good information. For process related information we are capturing that information part of dump from existing functionality. AMDGPU Device Coredump version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 45.084775181 process_name: soft_recovery_p PID: 1780 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) VRAM is lost due to GPU reset! Regards Sunil The decoded client name would be really useful I think since the fault handled is a catch all and handles a whole bunch of different clients. But that should be ideally passed in as const string instead of the hw generation specific client_id. As long as it's only a pointer we also don't run into the trouble that we need to allocate memory for it. I agree but i prefer adding the client id and decoding it in devcorecump using soc15_ih_clientid_name[fault_info->client_id]) is better else we have to do an sprintf this string to fault_info in irq context which is writing more bytes to memory i guess compared to an integer:) We can argue on values like pasid and vmid and ring id to be taken off if they are totally not useful. Regards Sunil Christian. Alex regards sunil Regards, Christ
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 6:12 PM, Christian König wrote: Am 06.03.24 um 11:40 schrieb Khatri, Sunil: On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) This is a perfect example what I mean. You record in the patch is the client_id, but this is is basically meaningless unless you have access to the AMD internal hw documentation. What you really need is the client in decoded form, in this case UTCL2. You can keep the client_id additionally, but the decoded client string is mandatory to have I think. Sure i am capturing that information as i am trying to minimise the memory interaction to minimum as we are still in interrupt context here that why i recorded the integer information compared to decoding and writing strings there itself but to postpone till we dump. Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client string from client id. So are we good to go with the information with the above information of sharing details in devcoredump using the additional information from pagefault cached. regards sunil Regards, Christian. Regards Sunil Khatri Regards, Christian. Add all such information in the last cached pagefault from an interrupt handler. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +- 7 files changed, 18 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 4299ce386322..b77e8e28769d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m) * Cache the fault info for later use by userspace in debugging. */ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub) @@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, xa_lock_irqsave(>vm_manager.pasids, flags); - vm = xa_load(>vm_manager.pasids, pasid); + vm = xa_load(>vm_manager.pasids, entry->pasid); /* Don't update the fault cache if status is 0. In the multiple * fault case, subsequent faults will return a 0 status which is * useless for userspace and replaces the useful fault status, so @@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, if (vm && status) { vm->fault_info.addr = addr; vm->fault_info.status = status; + vm->fault_info.client_id = entry->client_id; + vm->fault_info.src_id = entry->src_id; + vm->fault_info.vmid = entry->vmid; + vm->fault_info.pasid = entry->pasid; + vm->fault_info.ring_id = entry->ring_id; if (AMDGPU_IS_GFXHUB(vmhub)) { vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX; vm->fault_info.vmhub |= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gp
Re: [PATCH] drm/amdgpu: cache in more vm fault information
On 3/6/2024 3:37 PM, Christian König wrote: Am 06.03.24 um 10:04 schrieb Sunil Khatri: When an page fault interrupt is raised there is a lot more information that is useful for developers to analyse the pagefault. Well actually those information are not that interesting because they are hw generation specific. You should probably rather use the decoded strings here, e.g. hub, client, xcc_id, node_id etc... See gmc_v9_0_process_interrupt() an example. I saw this v9 does provide more information than what v10 and v11 provide like node_id and fault from which die but thats again very specific to IP_VERSION(9, 4, 3)) i dont know why thats information is not there in v10 and v11. I agree to your point but, as of now during a pagefault we are dumping this information which is useful like which client has generated an interrupt and for which src and other information like address. So i think to provide the similar information in the devcoredump. Currently we do not have all this information from either job or vm being derived from the job during a reset. We surely could add more relevant information later on as per request but this information is useful as eventually its developers only who would use the dump file provided by customer to debug. Below is the information that i dump in devcore and i feel that is good information but new information could be added which could be picked later. Page fault information [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773) in page starting at address 0x from client 0x1b (UTCL2) Regards Sunil Khatri Regards, Christian. Add all such information in the last cached pagefault from an interrupt handler. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +- 7 files changed, 18 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 4299ce386322..b77e8e28769d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m) * Cache the fault info for later use by userspace in debugging. */ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub) @@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, xa_lock_irqsave(>vm_manager.pasids, flags); - vm = xa_load(>vm_manager.pasids, pasid); + vm = xa_load(>vm_manager.pasids, entry->pasid); /* Don't update the fault cache if status is 0. In the multiple * fault case, subsequent faults will return a 0 status which is * useless for userspace and replaces the useful fault status, so @@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, if (vm && status) { vm->fault_info.addr = addr; vm->fault_info.status = status; + vm->fault_info.client_id = entry->client_id; + vm->fault_info.src_id = entry->src_id; + vm->fault_info.vmid = entry->vmid; + vm->fault_info.pasid = entry->pasid; + vm->fault_info.ring_id = entry->ring_id; if (AMDGPU_IS_GFXHUB(vmhub)) { vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX; vm->fault_info.vmhub |= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 047ec1930d12..c7782a89bdb5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info { uint32_t status; /* which vmhub? gfxhub, mmhub, etc. */ unsigned int vmhub; + unsigned int client_id; + unsigned int src_id; + unsigned int ring_id; + unsigned int pasid; + unsigned int vmid; }; struct amdgpu_vm { @@ -605,7 +610,7 @@ static inline void amdgpu_vm_eviction_unlock(struct amdgpu_vm *vm) } void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, - unsigned int pasid, + struct amdgpu_iv_entry *entry, uint64_t addr, uint32_t status, unsigned int vmhub); diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index d933e19e0cf5..6b177ce8db0e 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c @@ -150,7 +150,7
Re: [PATCH v2] drm/amdgpu: add ring timeout information in devcoredump
On 3/5/2024 6:40 PM, Christian König wrote: Am 05.03.24 um 12:58 schrieb Sunil Khatri: Add ring timeout related information in the amdgpu devcoredump file for debugging purposes. During the gpu recovery process the registered call is triggered and add the debug information in data file created by devcoredump framework under the directory /sys/class/devcoredump/devcdx/ Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index a59364e9b6ed..aa7fed59a0d5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -196,6 +196,13 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->reset_task_info.process_name, coredump->reset_task_info.pid); + if (coredump->ring_timeout) { + drm_printf(, "\nRing timed out details\n"); + drm_printf(, "IP Type: %d Ring Name: %s \n", + coredump->ring->funcs->type, + coredump->ring->name); + } + if (coredump->reset_vram_lost) drm_printf(, "VRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { @@ -220,6 +227,8 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, { struct amdgpu_coredump_info *coredump; struct drm_device *dev = adev_to_drm(adev); + struct amdgpu_job *job = reset_context->job; + struct drm_sched_job *s_job; coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT); @@ -228,6 +237,12 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, return; } + if (job) { + s_job = >base; + coredump->ring = to_amdgpu_ring(s_job->sched); + coredump->ring_timeout = TRUE; + } + coredump->reset_vram_lost = vram_lost; if (reset_context->job && reset_context->job->vm) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h index 19899f6b9b2b..6d67001a1057 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h @@ -97,6 +97,8 @@ struct amdgpu_coredump_info { struct amdgpu_task_info reset_task_info; struct timespec64 reset_time; bool reset_vram_lost; + struct amdgpu_ring *ring; + bool ring_timeout; I think you can drop ring_timeout, just having ring as optional information should be enough. Apart from that looks pretty good I think. - GPU reset could happen due to two possibilities atleast: 1. via sysfs cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover there is no timeout or page fault here. In this case we need information if ringtimeout happened or not else it will try to print empty information in devcoredump. Same goes for pagefault also in that case also we need to see if recovery ran due to pagefault and then only add that information. So to cover all use cases i added this parameter. Thanks Sunil Regards, Christian. }; #endif
Re: [PATCH] drm/amdgpu: add ring timeout information in devcoredump
On 3/5/2024 2:53 PM, Christian König wrote: > Am 01.03.24 um 13:43 schrieb Sunil Khatri: >> Add ring timeout related information in the amdgpu >> devcoredump file for debugging purposes. >> >> During the gpu recovery process the registered call >> is triggered and add the debug information in data >> file created by devcoredump framework under the >> directory /sys/class/devcoredump/devcdx/ >> >> Signed-off-by: Sunil Khatri >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 15 +++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 11 +++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 12 +++- >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + >> 4 files changed, 38 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> index 9246bca0a008..9f57c7795c47 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> @@ -816,6 +816,17 @@ struct amdgpu_reset_info { >> #endif >> }; >> +/* >> + * IP and Queue information during timeout >> + */ >> +struct amdgpu_ring_timeout_info { >> + bool timeout; > > What should that be good for? > In case of page fault or gpu reset due to other reasons there is no > time out. In that case we are not adding any information and we are > using this flag while dumping information. > >> + char ring_name[32]; >> + enum amdgpu_ring_type ip_type; > > Those information should already be available in the core dump. Will update. > >> + bool soft_recovery; > > That doesn't make sense since we don't do a core dump in case of a > soft recovery. Noted, this can be removed. > >> +}; >> + >> + >> /* >> * Non-zero (true) if the GPU has VRAM. Zero (false) otherwise. >> */ >> @@ -1150,6 +1161,10 @@ struct amdgpu_device { >> bool debug_largebar; >> bool debug_disable_soft_recovery; >> bool debug_use_vram_fw_buf; >> + >> +#ifdef CONFIG_DEV_COREDUMP >> + struct amdgpu_ring_timeout_info ring_timeout_info; >> +#endif > > Please never store core dump related information in the amdgpu_device > structure. Let me see to it. Point taken Thanks Sunil > > Regards, > Christian. > >> }; >> static inline uint32_t amdgpu_ip_version(const struct >> amdgpu_device *adev, >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> index 71a5cf37b472..e36b7352b2de 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> @@ -51,8 +51,19 @@ static enum drm_gpu_sched_stat >> amdgpu_job_timedout(struct drm_sched_job *s_job) >> memset(, 0, sizeof(struct amdgpu_task_info)); >> adev->job_hang = true; >> +#ifdef CONFIG_DEV_COREDUMP >> + /* Update the ring timeout info for coredump*/ >> + adev->ring_timeout_info.timeout = TRUE; >> + sprintf(adev->ring_timeout_info.ring_name, s_job->sched->name); >> + adev->ring_timeout_info.ip_type = ring->funcs->type; >> + adev->ring_timeout_info.soft_recovery = FALSE; >> +#endif >> + >> if (amdgpu_gpu_recovery && >> amdgpu_ring_soft_recovery(ring, job->vmid, >> s_job->s_fence->parent)) { >> +#ifdef CONFIG_DEV_COREDUMP >> + adev->ring_timeout_info.soft_recovery = TRUE; >> +#endif >> DRM_ERROR("ring %s timeout, but soft recovered\n", >> s_job->sched->name); >> goto exit; >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> index 4baa300121d8..d4f892ed105f 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> @@ -196,8 +196,16 @@ amdgpu_devcoredump_read(char *buffer, loff_t >> offset, size_t count, >> coredump->reset_task_info.process_name, >> coredump->reset_task_info.pid); >> + if (coredump->timeout_info.timeout) { >> + drm_printf(, "\nRing timed out details\n"); >> + drm_printf(, "IP Type: %d Ring Name: %s Soft Recovery: %s\n", >> + coredump->timeout_info.ip_type, >> + coredump->timeout_info.ring_name, >> + coredump->timeout_info.soft_recovery ? >> "Successful":"Failed"); >> + } >> + >> if (coredump->reset_vram_lost) >> - drm_printf(, "VRAM is lost due to GPU reset!\n"); >> + drm_printf(, "\nVRAM is lost due to GPU reset!\n"); >> if (coredump->adev->reset_info.num_regs) { >> drm_printf(, "AMDGPU register dumps:\nOffset: >> Value:\n"); >> @@ -228,6 +236,7 @@ void amdgpu_coredump(struct amdgpu_device >> *adev, bool vram_lost, >> return; >> } >> + coredump->timeout_info = adev->ring_timeout_info; >> coredump->reset_vram_lost = vram_lost; >> if (reset_context->job && reset_context->job->vm) >> @@ -236,6 +245,7 @@ void amdgpu_coredump(struct amdgpu_device
RE: [PATCH] drm/amdgpu/gmc11: implement get_vbios_fb_size()
[AMD Official Use Only - General] Acked-by: Sunil Khatri -Original Message- From: amd-gfx On Behalf Of Alex Deucher Sent: Thursday, May 11, 2023 8:13 PM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander Subject: [PATCH] drm/amdgpu/gmc11: implement get_vbios_fb_size() Implement get_vbios_fb_size() so we can properly reserve the vbios splash screen to avoid potential artifacts on the screen during the transition from the pre-OS console to the OS console. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 21 - 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c index f73c238f3145..2f570fb5febe 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c @@ -31,6 +31,8 @@ #include "umc_v8_10.h" #include "athub/athub_3_0_0_sh_mask.h" #include "athub/athub_3_0_0_offset.h" +#include "dcn/dcn_3_2_0_offset.h" +#include "dcn/dcn_3_2_0_sh_mask.h" #include "oss/osssys_6_0_0_offset.h" #include "ivsrcid/vmc/irqsrcs_vmc_1_0.h" #include "navi10_enum.h" @@ -546,7 +548,24 @@ static void gmc_v11_0_get_vm_pte(struct amdgpu_device *adev, static unsigned gmc_v11_0_get_vbios_fb_size(struct amdgpu_device *adev) { - return 0; + u32 d1vga_control = RREG32_SOC15(DCE, 0, regD1VGA_CONTROL); + unsigned size; + + if (REG_GET_FIELD(d1vga_control, D1VGA_CONTROL, D1VGA_MODE_ENABLE)) { + size = AMDGPU_VBIOS_VGA_ALLOCATION; + } else { + u32 viewport; + u32 pitch; + + viewport = RREG32_SOC15(DCE, 0, regHUBP0_DCSURF_PRI_VIEWPORT_DIMENSION); + pitch = RREG32_SOC15(DCE, 0, regHUBPREQ0_DCSURF_SURFACE_PITCH); + size = (REG_GET_FIELD(viewport, + HUBP0_DCSURF_PRI_VIEWPORT_DIMENSION, PRI_VIEWPORT_HEIGHT) * + REG_GET_FIELD(pitch, HUBPREQ0_DCSURF_SURFACE_PITCH, PITCH) * + 4); + } + + return size; } static const struct amdgpu_gmc_funcs gmc_v11_0_gmc_funcs = { -- 2.40.1
RE: Help debug amdgpu faults
[AMD Official Use Only - General] Hello Alex, Robert I too have similar issues which I am facing on chrome. Are there any tools in linux environment which can help debug such issues like page faults, kernel panic caused by invalid pointer access. I have used tools like ramdump parser which can be used to use the ramdump after a crash and check a lot of static data in the memory and even the page table could be checked by walking through them manually. We used to load the kernel symbols along with ramdump to go line by line. Appreciate if you can point to some document or some tools which is already used by linux graphics teams either UMD or KMD drivers so chrome team can also exploit those to debug issues. Regards Sunil Khatri -Original Message- From: amd-gfx On Behalf Of Alex Deucher Sent: Tuesday, November 22, 2022 7:42 PM To: Robert Beckett Cc: Dmitrii Osipenko ; Adrián Martínez Larumbe ; Koenig, Christian ; amd-gfx@lists.freedesktop.org; Daniel Stone Subject: Re: Help debug amdgpu faults On Tue, Nov 22, 2022 at 6:53 AM Robert Beckett wrote: > > Hi, > > > does anyone know any documentation, or can provide advice on debugging amdgpu > fault reports? This is a GPU page fault so it refers the the GPU virtual address space of the application . Each process (well fd really), gets its own GPU virtual address space into which system memory, system mmio space, or vram can be mapped. The user mode drivers control their GPU virtual address space. > > > e.g: > > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: [gfxhub] > page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid > 999 thread vkcube pid 999) This is the process that caused the fault. > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: in page > starting at address 0x80010070 from client 0x1b (UTCL2) This is the virtual address that faulted. > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10 > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: Faulty > UTCL2 client ID: SDMA0 (0xd) The fault came from the SDMA engine. > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: > MORE_FAULTS: 0x0 > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: > WALKER_ERROR: 0x0 > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: > PERMISSION_FAULTS: 0x1 The page was not marked as valid in the GPU page table. > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: > MAPPING_ERROR: 0x0 > Nov 21 19:17:06 fedora kernel: amdgpu :01:00.0: amdgpu: RW: 0x0 SDMA attempted to read an invalid page. > > > > see > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F2267data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=vep6PKgDjRz02A3xYI8f7600QV2%2B7GYXsx%2FVYPY1M2I%3Dreserved=0 > for more context. > > We have a complicated setup involving rendering then blitting to virtio-gpu > exported dmabufs, with plenty of hacks in the mesa and xwayland stacks, so we > are considering this our issue to debug, and not an issue with the driver at > this point. > However, having debugged all the interesting parts leading to these faults, I > am unable to decode the fault report to correlate to a buffer. > > in the fault report, what address space is the address from? > given that the fault handler shifts the reported addres up by 12, I assume it > is a 4K pfn which makes me assume a physical address is this correct? > if so, is that a vram pa or a host system memory pa? > is there any documentation for the other fields reported like the fault > status etc? See the comments above. There is some kernel doc as well: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.kernel.org%2Fgpu%2Famdgpu%2Fdriver-core.html%23amdgpu-virtual-memorydata=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=dd971OoEZSJl%2FJif4%2Bypv9Dp0deeMVsQuCMc2o9BgQk%3Dreserved=0 > > I'd appreciate any advice you could give to help us debug further. Some operation you are doing in the user mode driver is reading an invalid page. Possibly reading past the end of a buffer or something mis-aligned. Compare the faulting GPU address to the GPU virtual address space in the application and you should be able to track down what is happening. Alex > > Thanks > > Bob > <>
RE: [PATCH] drm/amdgpu: enable tmz by default for skyrim
[AMD Official Use Only - General] @Ernst Sjöstrand<mailto:ern...@gmail.com> Make sense. Thanks for Review. Pushed another patch without any such names. Regards Sunil khatri From: Ernst Sjöstrand Sent: Tuesday, May 31, 2022 1:47 AM To: Khatri, Sunil Cc: Deucher, Alexander ; amd-gfx mailing list Subject: Re: [PATCH] drm/amdgpu: enable tmz by default for skyrim Skyrim is maybe not the best code name ever for a GPU, perhaps not include it upstream if it's not official? Regards //Ernst Den mån 30 maj 2022 kl 20:03 skrev Sunil Khatri mailto:sunil.kha...@amd.com>>: Enable tmz feature by default for skyrim i.e IP GC 10.3.7 Signed-off-by: Sunil Khatri mailto:sunil.kha...@amd.com>> --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index 798c56214a23..aebc384531ac 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c @@ -518,6 +518,8 @@ void amdgpu_gmc_tmz_set(struct amdgpu_device *adev) case IP_VERSION(9, 1, 0): /* RENOIR looks like RAVEN */ case IP_VERSION(9, 3, 0): + /* GC 10.3.7 */ + case IP_VERSION(10, 3, 7): if (amdgpu_tmz == 0) { adev->gmc.tmz_enabled = false; dev_info(adev->dev, @@ -540,8 +542,6 @@ void amdgpu_gmc_tmz_set(struct amdgpu_device *adev) case IP_VERSION(10, 3, 1): /* YELLOW_CARP*/ case IP_VERSION(10, 3, 3): - /* GC 10.3.7 */ - case IP_VERSION(10, 3, 7): /* Don't enable it by default yet. */ if (amdgpu_tmz < 1) { -- 2.25.1