On 4/8/26 15:44, Chenglei Xie wrote:
> During GPU reset, the application could still run CPU page table updates. 
> Each commit called
> amdgpu_device_flush_hdp(), which on SR-IOV sends work through the KIQ ring.
> That can advance sync_seq while the GPU is being reset,
> leaving fence writeback out of sync and causing amdgpu_fence_emit_polling()
> to time out on later KIQ use.
> 
> Fix:
> amdgpu_vm_cpu_commit():
>   Reset will flush HDP anyway, the HDP flush in amdgpu_vm_cpu_commit() can be 
> skipped
>   when a reset is ongoging.
>   Take reset_domain->sem with down_read_trylock() before 
> amdgpu_device_flush_hdp().
>   If the reset path holds the write lock, skip the HDP flush so no 
> HDP-related HW
>   access (including KIQ) runs during reset; state is re-established after 
> reset.
> 
> Signed-off-by: Chenglei Xie <[email protected]>
> Change-Id: I938bce0cab93a794dbdb02fe3ca9e041f9ac1424

Reviewed-by: Christian König <[email protected]>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> index 22e2e5b473415..f078db3fef79e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> @@ -21,6 +21,8 @@
>   */
>  
>  #include "amdgpu_vm.h"
> +#include "amdgpu.h"
> +#include "amdgpu_reset.h"
>  #include "amdgpu_object.h"
>  #include "amdgpu_trace.h"
>  
> @@ -108,11 +110,19 @@ static int amdgpu_vm_cpu_update(struct 
> amdgpu_vm_update_params *p,
>  static int amdgpu_vm_cpu_commit(struct amdgpu_vm_update_params *p,
>                               struct dma_fence **fence)
>  {
> +     struct amdgpu_device *adev = p->adev;
> +
>       if (p->needs_flush)
>               atomic64_inc(&p->vm->tlb_seq);
>  
>       mb();
> -     amdgpu_device_flush_hdp(p->adev, NULL);
> +     /* A reset flushed the HDP anyway, so that here can be skipped when a 
> reset is ongoing */
> +     if (!down_read_trylock(&adev->reset_domain->sem))
> +             return 0;
> +
> +     amdgpu_device_flush_hdp(adev, NULL);
> +     up_read(&adev->reset_domain->sem);
> +
>       return 0;
>  }
>  

Reply via email to