[AMD Official Use Only - AMD Internal Distribution Only]

Hi @Koenig, Christian,

>  void amdgpu_virt_pre_reset(struct amdgpu_device *adev)  {
> +     int i;
> +
>       /* stop the data exchange thread */
>       amdgpu_virt_fini_data_exchange(adev);
>       amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_FLR);
> +
> +     /* Force completion on KIQ ring fences so pending fences are signalled. 
> */
> +     for (i = 0; i < AMDGPU_MAX_GC_INSTANCES; i++) {
> +             struct amdgpu_ring *ring = &adev->gfx.kiq[i].ring;
> +
> +             if (!ring->fence_drv.initialized)
> +                     continue;
> +             amdgpu_fence_driver_force_completion(ring);

>>Well that is unrelated and clearly incorrect. The KIQ is re-initialized 
>>through a reset and should *NEVER* be force signaled.

HW KIQ is re-inited after reset, but that path does not reinitialize or reset 
ring->fence_drv.
amdgpu_fence_driver_force_completion() is SW-only: pending dma_fences get 
-ECANCELED, writeback is set to sync_seq, and fences are signaled so 
bookkeeping and waiters reflect that the queue’s prior work will not complete 
on HW. Without that, seq/writeback can stay wrong and 
amdgpu_fence_emit_polling() can time out on later KIQ use.

The generic loop in amdgpu_device_pre_asic_reset() only force-completes 
scheduler-ready rings, so it skips KIQ. This VF path covers KIQ unless we add 
equivalent logic elsewhere. So we still need this SW cleanup to avoid KIQ fence 
state mismatch after reset.

Thanks,
Chenglei

-----Original Message-----
From: Koenig, Christian <[email protected]>
Sent: Tuesday, April 7, 2026 1:07 PM
To: Xie, Chenglei <[email protected]>; [email protected]
Cc: Lazar, Lijo <[email protected]>; Kuehling, Felix <[email protected]>; 
Deucher, Alexander <[email protected]>; Chan, Hing Pong 
<[email protected]>; Luo, Zhigang <[email protected]>; Kasiviswanathan, 
Harish <[email protected]>; Zhao, Victor <[email protected]>; 
Yat Sin, David <[email protected]>; Dhinakararam, Lokesh 
<[email protected]>
Subject: Re: [[PATCH v6]] drm/amdgpu: gate VM CPU HDP flush on reset lock; 
force-complete KIQ before VF reset

On 4/7/26 18:25, Chenglei Xie wrote:
> During GPU reset, the application could still run CPU page table
> updates. Each commit called amdgpu_device_flush_hdp(), which on SR-IOV sends 
> work through the KIQ ring.
> That can advance sync_seq while the GPU is being reset, leaving fence
> writeback out of sync and causing amdgpu_fence_emit_polling() to time
> out on later KIQ use.
>
> Fix:
> amdgpu_vm_cpu_commit():
>   Take reset_domain->sem with down_read_trylock() before 
> amdgpu_device_flush_hdp().
>   If the reset path holds the write lock, skip the HDP flush so no 
> HDP-related HW
>   access (including KIQ) runs during reset; state is re-established after 
> reset.
>
> amdgpu_virt_pre_reset():
>   After stopping the data exchange thread and setting MP1 FLR state, call
>   amdgpu_fence_driver_force_completion() on each initialized KIQ ring so 
> pending
>   fences are signalled and writeback is aligned before reset proceeds.
>
> Signed-off-by: Chenglei Xie <[email protected]>
> Change-Id: I938bce0cab93a794dbdb02fe3ca9e041f9ac1424
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 11 +++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 16 +++++++++++++++-
>  2 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 6974b1c5b56c2..0127b0d6c7277 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -1188,9 +1188,20 @@ enum amdgpu_sriov_vf_mode
> amdgpu_virt_get_sriov_vf_mode(struct amdgpu_device *ad
>
>  void amdgpu_virt_pre_reset(struct amdgpu_device *adev)  {
> +     int i;
> +
>       /* stop the data exchange thread */
>       amdgpu_virt_fini_data_exchange(adev);
>       amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_FLR);
> +
> +     /* Force completion on KIQ ring fences so pending fences are signalled. 
> */
> +     for (i = 0; i < AMDGPU_MAX_GC_INSTANCES; i++) {
> +             struct amdgpu_ring *ring = &adev->gfx.kiq[i].ring;
> +
> +             if (!ring->fence_drv.initialized)
> +                     continue;
> +             amdgpu_fence_driver_force_completion(ring);

Well that is unrelated and clearly incorrect. The KIQ is re-initialized through 
a reset and should *NEVER* be force signaled.

> +     }
>  }
>
>  void amdgpu_virt_post_reset(struct amdgpu_device *adev) diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> index 22e2e5b473415..a9e33b7e87406 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> @@ -21,6 +21,8 @@
>   */
>
>  #include "amdgpu_vm.h"
> +#include "amdgpu.h"
> +#include "amdgpu_reset.h"
>  #include "amdgpu_object.h"
>  #include "amdgpu_trace.h"
>
> @@ -108,11 +110,23 @@ static int amdgpu_vm_cpu_update(struct
> amdgpu_vm_update_params *p,  static int amdgpu_vm_cpu_commit(struct 
> amdgpu_vm_update_params *p,
>                               struct dma_fence **fence)
>  {
> +     struct amdgpu_device *adev = p->adev;
> +
>       if (p->needs_flush)
>               atomic64_inc(&p->vm->tlb_seq);
>
>       mb();
> -     amdgpu_device_flush_hdp(p->adev, NULL);
> +     /*
> +      * While GPU reset holds reset_domain write lock, skip HDP flush 
> entirely so
> +      * no HDP-related HW access runs during reset;
> +      * reset re-establishes consistent state afterward.
> +      */

That comment explains what is done but not why.

Rather use something like this:

/* A reset flushed the HDP anyway, so that here can be skipped when a reset is 
ongoing */

Regards,
Christian.

> +     if (!down_read_trylock(&adev->reset_domain->sem))
> +             return 0;
> +
> +     amdgpu_device_flush_hdp(adev, NULL);
> +     up_read(&adev->reset_domain->sem);
> +
>       return 0;
>  }
>

Reply via email to