On 4/8/26 15:44, Chenglei Xie wrote: > During GPU reset, the application could still run CPU page table updates. > Each commit called > amdgpu_device_flush_hdp(), which on SR-IOV sends work through the KIQ ring. > That can advance sync_seq while the GPU is being reset, > leaving fence writeback out of sync and causing amdgpu_fence_emit_polling() > to time out on later KIQ use. > > Fix: > amdgpu_vm_cpu_commit(): > Reset will flush HDP anyway, the HDP flush in amdgpu_vm_cpu_commit() can be > skipped > when a reset is ongoging. > Take reset_domain->sem with down_read_trylock() before > amdgpu_device_flush_hdp(). > If the reset path holds the write lock, skip the HDP flush so no > HDP-related HW > access (including KIQ) runs during reset; state is re-established after > reset. > > Signed-off-by: Chenglei Xie <[email protected]> > Change-Id: I938bce0cab93a794dbdb02fe3ca9e041f9ac1424
Reviewed-by: Christian König <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > index 22e2e5b473415..f078db3fef79e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > @@ -21,6 +21,8 @@ > */ > > #include "amdgpu_vm.h" > +#include "amdgpu.h" > +#include "amdgpu_reset.h" > #include "amdgpu_object.h" > #include "amdgpu_trace.h" > > @@ -108,11 +110,19 @@ static int amdgpu_vm_cpu_update(struct > amdgpu_vm_update_params *p, > static int amdgpu_vm_cpu_commit(struct amdgpu_vm_update_params *p, > struct dma_fence **fence) > { > + struct amdgpu_device *adev = p->adev; > + > if (p->needs_flush) > atomic64_inc(&p->vm->tlb_seq); > > mb(); > - amdgpu_device_flush_hdp(p->adev, NULL); > + /* A reset flushed the HDP anyway, so that here can be skipped when a > reset is ongoing */ > + if (!down_read_trylock(&adev->reset_domain->sem)) > + return 0; > + > + amdgpu_device_flush_hdp(adev, NULL); > + up_read(&adev->reset_domain->sem); > + > return 0; > } >
