On 4/7/26 18:25, Chenglei Xie wrote:
> During GPU reset, the application could still run CPU page table updates. 
> Each commit called
> amdgpu_device_flush_hdp(), which on SR-IOV sends work through the KIQ ring.
> That can advance sync_seq while the GPU is being reset,
> leaving fence writeback out of sync and causing amdgpu_fence_emit_polling()
> to time out on later KIQ use.
> 
> Fix:
> amdgpu_vm_cpu_commit():
>   Take reset_domain->sem with down_read_trylock() before 
> amdgpu_device_flush_hdp().
>   If the reset path holds the write lock, skip the HDP flush so no 
> HDP-related HW
>   access (including KIQ) runs during reset; state is re-established after 
> reset.
> 
> amdgpu_virt_pre_reset():
>   After stopping the data exchange thread and setting MP1 FLR state, call
>   amdgpu_fence_driver_force_completion() on each initialized KIQ ring so 
> pending
>   fences are signalled and writeback is aligned before reset proceeds.
> 
> Signed-off-by: Chenglei Xie <[email protected]>
> Change-Id: I938bce0cab93a794dbdb02fe3ca9e041f9ac1424
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 11 +++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 16 +++++++++++++++-
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 6974b1c5b56c2..0127b0d6c7277 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -1188,9 +1188,20 @@ enum amdgpu_sriov_vf_mode 
> amdgpu_virt_get_sriov_vf_mode(struct amdgpu_device *ad
>  
>  void amdgpu_virt_pre_reset(struct amdgpu_device *adev)
>  {
> +     int i;
> +
>       /* stop the data exchange thread */
>       amdgpu_virt_fini_data_exchange(adev);
>       amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_FLR);
> +
> +     /* Force completion on KIQ ring fences so pending fences are signalled. 
> */
> +     for (i = 0; i < AMDGPU_MAX_GC_INSTANCES; i++) {
> +             struct amdgpu_ring *ring = &adev->gfx.kiq[i].ring;
> +
> +             if (!ring->fence_drv.initialized)
> +                     continue;
> +             amdgpu_fence_driver_force_completion(ring);

Well that is unrelated and clearly incorrect. The KIQ is re-initialized through 
a reset and should *NEVER* be force signaled.

> +     }
>  }
>  
>  void amdgpu_virt_post_reset(struct amdgpu_device *adev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> index 22e2e5b473415..a9e33b7e87406 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> @@ -21,6 +21,8 @@
>   */
>  
>  #include "amdgpu_vm.h"
> +#include "amdgpu.h"
> +#include "amdgpu_reset.h"
>  #include "amdgpu_object.h"
>  #include "amdgpu_trace.h"
>  
> @@ -108,11 +110,23 @@ static int amdgpu_vm_cpu_update(struct 
> amdgpu_vm_update_params *p,
>  static int amdgpu_vm_cpu_commit(struct amdgpu_vm_update_params *p,
>                               struct dma_fence **fence)
>  {
> +     struct amdgpu_device *adev = p->adev;
> +
>       if (p->needs_flush)
>               atomic64_inc(&p->vm->tlb_seq);
>  
>       mb();
> -     amdgpu_device_flush_hdp(p->adev, NULL);
> +     /*
> +      * While GPU reset holds reset_domain write lock, skip HDP flush 
> entirely so
> +      * no HDP-related HW access runs during reset;
> +      * reset re-establishes consistent state afterward.
> +      */

That comment explains what is done but not why.

Rather use something like this:

/* A reset flushed the HDP anyway, so that here can be skipped when a reset is 
ongoing */

Regards,
Christian.

> +     if (!down_read_trylock(&adev->reset_domain->sem))
> +             return 0;
> +
> +     amdgpu_device_flush_hdp(adev, NULL);
> +     up_read(&adev->reset_domain->sem);
> +
>       return 0;
>  }
>  

Reply via email to