amdgpu_sriov_vf would return 0x0 or 0x4 to indicate if sriov.
but F32_POLL_ENABLE need 0x0 or 0x1 to determine if enabled.
set 0x4 into F32_POLL_ENABLE would make SDMA0_GFX_RB_WPTR_POLL_CNTL not working.
Change-Id: I7d13ed35469ebd7bdf10c90341181977c6cfd38d
Signed-off-by: Wentao Lou
---
drivers/g
amdgpu_bo_destroy had a bug by calling amdgpu_bo_unref outside mutex_lock.
If amdgpu_device_recover_vram executed between amdgpu_bo_unref and
list_del_init,
it would get NULL of shadow->parent, then caused Call Trace and GPU reset
failed.
Change-Id: I41d7b54605e613e87ee03c3ad89c191063c19230
Sign
amdgpu_bo_destroy had a bug by calling amdgpu_bo_unref outside mutex_lock.
If amdgpu_device_recover_vram executed between amdgpu_bo_unref and
list_del_init,
it would get NULL of shadow->parent, then caused Call Trace and GPU reset
failed.
Change-Id: I41d7b54605e613e87ee03c3ad89c191063c19230
Sign
shadow was added into shadow_list by amdgpu_bo_create_shadow.
meanwhile, shadow->tbo.mem was not fully configured.
tbo.mem would be fully configured by amdgpu_vm_sdma_map_table until calling
amdgpu_vm_clear_bo.
If sriov TDR occurred between amdgpu_bo_create_shadow and
amdgpu_vm_sdma_map_table,
am
amdgpu_bo_restore_shadow would assign zero to r if succeeded.
r would remain zero if there is only one node in shadow_list.
current code would always return failure when r <= 0.
restart the timeout for each wait was a rather problematic bug as well.
The value of tmo SHOULD be changed, otherwise we
amdgpu_bo_restore_shadow would assign zero to r if succeeded.
r would remain zero if there is only one node in shadow_list.
current code would always return failure when r <= 0.
restart the timeout for each wait was a rather problematic bug as well.
The value of tmo SHOULD be changed, otherwise we
amdgpu_bo_restore_shadow would assign zero to r if succeeded.
r would remain zero if there is only one node in shadow_list.
current code would always return failure when r <= 0.
Change-Id: Iae6880e7c78b71fde6a6754c69665c2e312a80a5
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_d
add amdgpu_amdkfd_pre_reset and amdgpu_amdkfd_post_reset inside
amdgpu_device_reset_sriov.
Change-Id: Icf2839f0b620ce9d47d6414b6c32b9d06672f2ac
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/amd
sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate
recover in TDR.
TDR's gpu_recover would be triggered by amdgpu_job_timedout,
that could avoid vk-cts failure by unexpected recover.
Change-Id: I840dfc145e4e1be9ece6eac8d9f3501da9b28ebf
Signed-off-by: wentalou
--
sriov need to restrict max_pfn below AMDGPU_GMC_HOLE.
access the hole results in a range fault interrupt IIRC.
Change-Id: I0add197a24a54388a128a545056e9a9f0330abfb
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 3 +--
drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 6 +-
2
sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate
recover in TDR.
TDR's gpu_recover would be triggered by amdgpu_job_timedout,
that could avoid vk-cts failure by unexpected recover.
Change-Id: Ifcba4ac43a0229ae19061aad3b0ddc96957ff9c6
Signed-off-by: wentalou
--
since vm_size enlarged to 0x4 GB,
sriov need to put csa below AMDGPU_GMC_HOLE.
or amdgpu_vm_alloc_pts would receive saddr among AMDGPU_GMC_HOLE,
and result in a range fault interrupt IIRC.
Change-Id: I405a25a01d949f3130889b346f71bedad8ebcae7
Signed-off-by: Wenta Lou
---
drivers/gpu/drm/amd/a
sriov would meet guest driver load failure,
if calling amdgpu_asic_reset in amdgpu_device_init.
sriov should skip asic_reset in device_init.
Change-Id: I6c03b2fcdbf29200fab09459bbffd87726047908
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
1 file changed, 1 ins
After removing unnecessary VM size calculations,
vm_manager.max_pfn would reach 0x10,,
max_pfn << AMDGPU_GPU_PAGE_SHIFT exceeding AMDGPU_GMC_HOLE_START
would caused GPU reset.
Change-Id: I47ad0be2b0bd9fb7490c4e1d7bb7bdacf71132cb
Signed-off-by: wentalou
---
drivers/gpu/drm/amd/
distinguish ip_reinit_early_sriov and ip_reinit_late_sriov
by different log RE-INIT-early and RE-INIT-late
Change-Id: If4dd78cb807790e9f8daffb04d893cc7fd2b0e60
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff
When 2 rings met timeout at same time, triggered job_timedout separately.
Each job_timedout called gpu_recover, but one of gpu_recover locked by
another's mutex_lock.
Bad jod’s callback should be removed by dma_fence_remove_callback but locked
inside mutex_lock.
So dma_fence_remove_callback could
psp_ring_destroy inside psp_load_fw cause psp->km_ring.ring_mem NULL.
Call Trace occurred when psp_cmd_submit.
should be psp_ring_stop instead.
Change-Id: Ib332004b3b9edc9e002adc532b2d45cdad929b05
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-
1 file changed, 1 ins
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
but outside req_full_gpu of sriov.
It would make sriov hang during reset.
Change-Id: I5b3e2a42c77b3b9635419df4470d021df7be34d1
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++
1 file changed, 6 ins
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
but outside req_full_gpu of sriov.
It would make sriov hang during reset.
Change-Id: I5b3e2a42c77b3b9635419df4470d021df7be34d1
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++
1 file changed, 6 ins
amdgpu_ring_soft_recovery would have Call-Trace,
when s_fence->parent was NULL inside amdgpu_job_timedout.
Check fence first, as drm_sched_hw_job_reset did.
Change-Id: Ibb062e36feb4e2522a59641fe0d2d76b9773cda7
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
1 file
amdgpu_ring_soft_recovery would have Call-Trace,
when s_job->s_fence->parent was NULL inside amdgpu_job_timedout.
Check parent first, as drm_sched_hw_job_reset did.
Change-Id: I0b674ffd96afd44bcefe37a66fb157b1dbba61a0
Signed-off-by: Wentao Lou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
KIQ in VF’s init delayed by another VF’s reset,
which would cause late_init failed occasionally.
MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue.
Change-Id: Iac680af3cbd6afe4f8e408785f0795e1b23dba83
Signed-off-by: wentalou
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +-
1 file
SWDEV-171843: KIQ in VF’s init delayed by another VF’s reset.
late_init failed occasionally if overlapped with another VF’s reset.
MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue.
Change-Id: I841774bdd9ebf125c5aa2046b1dcebd65e07
Signed-off-by: wentalou
---
drivers/gpu/drm/amd
23 matches
Mail list logo