[PATCH v2] drm/amdgpu: fix memleak of ring sched and fence driver

2025-09-07 Thread Lin . Cao
commit 4220d2c7c41b ("drm/amdgpu: remove is_mes_queue flag") set ring->adev->ring[ring-idx] as NULL at the end of function amdgpu_ring_fini() which will cause function amdgpu_fence_driver_sw_fini() skip drm_sched_fini() and free fence_drv.fence then cause memory leak. Remove set rings[ring->idx] a

[PATCH] drm/amdgpu: fix memleak of ring sched and fence driver

2025-08-26 Thread Lin . Cao
commit 4220d2c7c41b ("drm/amdgpu: remove is_mes_queue flag") set ring->adev->ring[ring-idx] as NULL at the end of function amdgpu_ring_fini() which will cause function amdgpu_fence_driver_sw_fini() skip drm_sched_fini() and free fence_drv.fence then cause memory leak. Release these resource at the

[PATCH] drm/scheduler: Fix sched hang when killing app with dependent jobs

2025-07-09 Thread Lin . Cao
When Application A submits jobs (a1, a2, a3) and application B submits job b1 with a dependency on a2's scheduler fence, killing application A before run_job(a1) causes drm_sched_entity_kill_jobs_work() to force signal all jobs sequentially. However, due to missing work_run_job or work_free_job in

[PATCH] drm/amdgpu: put ctx's ref count in amdgpu_ctx_mgr_entity_fini()

2025-06-24 Thread Lin . Cao
patch "daf823f1d0cd drm/amdgpu: Remove duplicated "context still alive" check" removed ctx put, which will cause amdgpu_ctx_fini() cannot be called and then cause some finished fence that added by amdgpu_ctx_add_fence() cannot be released and cause memleak. Signed-off-by: Lin.Cao --- drivers/gpu

[PATCH] drm/amdgpu: Disable cleaner shader in sr-iov multi-vf environment

2025-06-13 Thread Lin . Cao
Cleaner shader will cause function level reset when run compute benchmark and gfx benchmark at same time in multi vf environment. Disable cleaner shader in multi vf environment. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion

[PATCH] drm/amdgpu: remove variable vm in function amdgpu_ib_schedule

2025-01-20 Thread Lin . Cao
use job && job->vm to check ib has vmid and use job && job->vmid to check if switch buffer should be emitted Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/driv

[PATCH] drm/amdgpu: fix ring timeout issue in gfx10 sr-iov environment

2025-01-14 Thread Lin . Cao
'commit 6e66dc05b54f ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare")' set job->vm as NULL if there is no fence. It will cause emit switch buffer be skippen if job->vm set as NULL. Check job rather than vm could solve this problem. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/

[PATCH] drm/amdgpu/swsmu: disable force reprogram HW state in SR-IOV environment

2024-09-19 Thread Lin . Cao
SRIOV do not need to forece reprogram HW state on init which should be set from host side. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_

[PATCH] drm/buddy: fix issue that force_merge cannot free all roots

2024-08-13 Thread Lin . Cao
If buddy manager have more than one roots and each root have sub-block need to be free. When drm_buddy_fini called, the first loop of force_merge will merge and free all of the sub block of first root, which offset is 0x0 and size is biggest(more than have of the mm size). In subsequent force_merge

[PATCH] drm/buddy: fix issue that force_merge cannot free all roots

2024-08-07 Thread Lin . Cao
If buddy manager have more than one roots and each root have sub-block need to be free. When drm_buddy_fini called, the first loop of force_merge will merge and free all of the sub block of first root, which offset is 0x0 and size is biggest(more than have of the mm size). In subsequent force_merge

[PATCH] drm/amdgpu: fix failure mapping legacy queue when FLR

2024-05-31 Thread Lin . Cao
Flag "mes.ring.shced.ready" will be set as true after mes hw init and set as false when mes hw fini to avoid duplicate initialization. But hw fini will not be called when function level reset, which will cause mes hw init be skipped during FLR, which will leads to mapping legacy queue fail. Set thi

[PATCH] drm/amdkfd: Check debug trap enable before write dbg_ev_file

2024-05-06 Thread Lin . Cao
In interrupt context, write dbg_ev_file will be run by work queue. It will cause write dbg_ev_file execution after debug_trap_disable, which will cause NULL pointer access. v2: cancel work "debug_event_workarea" before set dbg_ev_file as NULL. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdkf

[PATCH] drm/amdkfd: Check debug trap enable before write dbg_ev_file

2024-04-23 Thread Lin . Cao
In interrupt context, write dbg_ev_file will be run by work queue. It will cause write dbg_ev_file execution after debug_trap_disable, which will cause NULL pointer access. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)

[PATCH] drm/amd/pm set pp_dpm_*clk as read only for SRIOV one VF mode

2024-03-14 Thread Lin . Cao
pp_dpm_*clk should be set as read only for SRIOV one VF mode, remove S_IWUGO flag and _store function of these debugfs in one VF mode. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/pm/amdgpu_pm.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/p

[PATCH v2] drm/amdgpu doorbell range should be set when gpu recovery

2023-10-30 Thread Lin . Cao
GFX doorbell range should be set after flr otherwise the gfx doorbell range will be overlap with MEC. v2: remove "amdgpu_sriov_vf" and "amdgpu_in_reset" check, and add grbm select for the case of 2 gfx rings. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 7 +++ 1 file

[PATCH] drm/amdgpu set doorbell range when gpu recovery in sriov environment

2023-10-27 Thread Lin . Cao
GFX doorbell range should be set after flr otherwise the GFX doorbell range will overlap with MEC. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_

[PATCH] drm/amd check num of link levels when update pcie param

2023-10-19 Thread Lin . Cao
In SR-IOV environment, the value of pcie_table->num_of_link_levels will be 0, and num_of_levels - 1 will cause array index out of bounds Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/pm/swsmu/s

[PATCH] drm/amdgpu remove restriction of sriov max_pfn on Vega10

2023-10-17 Thread Lin . Cao
Remove restriction of sriov max_pfn so that TBA and TMA can move to high 47 bits address. Regression test: change range alloc flag of libdrm as AMDGPU_VA_RANGE_HIGH and there is no flr occur when testing amdgpu_test of drm. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 7 ++

[PATCH] drm/amdgpu: save VCN instances init info before jpeg init

2023-10-09 Thread Lin . Cao
JPEG init header will overwirte vcn init header info which will loss some debug information Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c i

[PATCH] drm/amdgpu: Return -EINVAL when MMSCH init status incorrect

2023-10-08 Thread Lin . Cao
Return -EINVAL when MMSCH init fail which can be handle by function amdgpu_device_reset_sriov correctly. Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c b/drivers/gpu

[PATCH] drm/amdkfd: update struct pm4_mes_runlist Struct pm4_mes_runlist in amdgpu is conflict with spec Add last dword of the design of spec into struct pm4_mes_runlist

2023-09-06 Thread Lin . Cao
Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdkfd/kfd_pm4_headers_ai.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_pm4_headers_ai.h b/drivers/gpu/drm/amd/amdkfd/kfd_pm4_headers_ai.h index 8b6b2bd5c148..ed937f70895c 100644 --- a/drivers/gpu/d

[PATCH] SWDEV-420310 - struct pm4_mes_runlist in amdgpu is conflict with spec struct pm4_mes_runlist is different with mes pm4 packet nv10 spec Modification: add last dword of the design of spec into

2023-09-05 Thread Lin . Cao
Signed-off-by: Lin.Cao Change-Id: I1322c010d1428b2c1df5080b72da94e90cf17fec --- drivers/gpu/drm/amd/amdkfd/kfd_pm4_headers_ai.h | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_pm4_headers_ai.h b/drivers/gpu/drm/amd/amdkfd/kfd_pm4_headers_ai.h inde

[PATCH] drm/amdgpu: Fix vram recover doesn't work after whole GPU reset

2023-05-04 Thread Lin . Cao
v1: Vmbo->shadow is used to back vram bo up when vram lost. So that we should set shadow as vmbo->shadow to recover vmbo->bo v2: Modify if(vmbo->shadow) shadow = vmbo->shadow as if(!vmbo->shadow) continue; Fix: 'commit e18aaea733da ("drm/amdgpu: move shadow_list to amdgpu_bo_vm")' Signed-off-by: L

[PATCH] drm/amdgpu: Recover vram from vmbo->shadow rather than vmbo->bo

2023-04-26 Thread Lin . Cao
Vmbo->shadow is used to back vram bo up when vram lost. So that we should set shadow as vmbo->shadow to recover vmbo->bo. Fix: 'commit e18aaea733da ("drm/amdgpu: move shadow_list to amdgpu_bo_vm")' Signed-off-by: Lin.Cao --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 +++- 1 file change

[PATCH] drm/amdgpu: Call trace info was found in dmesg when loading amdgpu

2022-07-13 Thread lin cao
In the case of SRIOV, the register smnMp1_PMI_3_FIFO will get an invalid value which will cause the "shift out of bound". In Ubuntu22.04, this issue will be checked an related call trace will be reported in dmesg. Signed-off-by: lin cao --- drivers/gpu/drm/amd/pm/s