[AMD Official Use Only - AMD Internal Distribution Only]

I see, I will create new patch for review. I tried a few loops of sanity test 
and the results looks positive.

Thanks,
Chenglei

-----Original Message-----
From: Koenig, Christian <[email protected]>
Sent: Wednesday, April 8, 2026 4:09 AM
To: Xie, Chenglei <[email protected]>; [email protected]
Cc: Lazar, Lijo <[email protected]>; Kuehling, Felix <[email protected]>; 
Deucher, Alexander <[email protected]>; Chan, Hing Pong 
<[email protected]>; Luo, Zhigang <[email protected]>; Zhao, Victor 
<[email protected]>; Yat Sin, David <[email protected]>; Dhinakararam, 
Lokesh <[email protected]>
Subject: Re: [[PATCH v6]] drm/amdgpu: gate VM CPU HDP flush on reset lock; 
force-complete KIQ before VF reset

On 4/7/26 22:55, Xie, Chenglei wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi @Koenig, Christian,
>
>>  void amdgpu_virt_pre_reset(struct amdgpu_device *adev)  {
>> +     int i;
>> +
>>       /* stop the data exchange thread */
>>       amdgpu_virt_fini_data_exchange(adev);
>>       amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_FLR);
>> +
>> +     /* Force completion on KIQ ring fences so pending fences are 
>> signalled. */
>> +     for (i = 0; i < AMDGPU_MAX_GC_INSTANCES; i++) {
>> +             struct amdgpu_ring *ring = &adev->gfx.kiq[i].ring;
>> +
>> +             if (!ring->fence_drv.initialized)
>> +                     continue;
>> +             amdgpu_fence_driver_force_completion(ring);
>
>>> Well that is unrelated and clearly incorrect. The KIQ is re-initialized 
>>> through a reset and should *NEVER* be force signaled.
>
> HW KIQ is re-inited after reset, but that path does not reinitialize or reset 
> ring->fence_drv.
> amdgpu_fence_driver_force_completion() is SW-only: pending dma_fences get 
> -ECANCELED, writeback is set to sync_seq, and fences are signaled so 
> bookkeeping and waiters reflect that the queue’s prior work will not complete 
> on HW. Without that, seq/writeback can stay wrong and 
> amdgpu_fence_emit_polling() can time out on later KIQ use.
>
> The generic loop in amdgpu_device_pre_asic_reset() only force-completes 
> scheduler-ready rings, so it skips KIQ. This VF path covers KIQ unless we add 
> equivalent logic elsewhere. So we still need this SW cleanup to avoid KIQ 
> fence state mismatch after reset.

Well you just explained it.

The KIQ is skipped in amdgpu_device_pre_asic_reset() because it doesn't use 
dma_fences and so calling amdgpu_fence_driver_force_completion() on it is 
absolutely nonsense.

I mean when there is an intentional skipping of calling a function why in the 
world do you think that you need to do that manually?

Regards,
Christian.

>
> Thanks,
> Chenglei
>
> -----Original Message-----
> From: Koenig, Christian <[email protected]>
> Sent: Tuesday, April 7, 2026 1:07 PM
> To: Xie, Chenglei <[email protected]>;
> [email protected]
> Cc: Lazar, Lijo <[email protected]>; Kuehling, Felix
> <[email protected]>; Deucher, Alexander
> <[email protected]>; Chan, Hing Pong <[email protected]>;
> Luo, Zhigang <[email protected]>; Kasiviswanathan, Harish
> <[email protected]>; Zhao, Victor <[email protected]>;
> Yat Sin, David <[email protected]>; Dhinakararam, Lokesh
> <[email protected]>
> Subject: Re: [[PATCH v6]] drm/amdgpu: gate VM CPU HDP flush on reset
> lock; force-complete KIQ before VF reset
>
> On 4/7/26 18:25, Chenglei Xie wrote:
>> During GPU reset, the application could still run CPU page table
>> updates. Each commit called amdgpu_device_flush_hdp(), which on SR-IOV sends 
>> work through the KIQ ring.
>> That can advance sync_seq while the GPU is being reset, leaving fence
>> writeback out of sync and causing amdgpu_fence_emit_polling() to time
>> out on later KIQ use.
>>
>> Fix:
>> amdgpu_vm_cpu_commit():
>>   Take reset_domain->sem with down_read_trylock() before 
>> amdgpu_device_flush_hdp().
>>   If the reset path holds the write lock, skip the HDP flush so no 
>> HDP-related HW
>>   access (including KIQ) runs during reset; state is re-established after 
>> reset.
>>
>> amdgpu_virt_pre_reset():
>>   After stopping the data exchange thread and setting MP1 FLR state, call
>>   amdgpu_fence_driver_force_completion() on each initialized KIQ ring so 
>> pending
>>   fences are signalled and writeback is aligned before reset proceeds.
>>
>> Signed-off-by: Chenglei Xie <[email protected]>
>> Change-Id: I938bce0cab93a794dbdb02fe3ca9e041f9ac1424
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 11 +++++++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 16 +++++++++++++++-
>>  2 files changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>> index 6974b1c5b56c2..0127b0d6c7277 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>> @@ -1188,9 +1188,20 @@ enum amdgpu_sriov_vf_mode
>> amdgpu_virt_get_sriov_vf_mode(struct amdgpu_device *ad
>>
>>  void amdgpu_virt_pre_reset(struct amdgpu_device *adev)  {
>> +     int i;
>> +
>>       /* stop the data exchange thread */
>>       amdgpu_virt_fini_data_exchange(adev);
>>       amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_FLR);
>> +
>> +     /* Force completion on KIQ ring fences so pending fences are 
>> signalled. */
>> +     for (i = 0; i < AMDGPU_MAX_GC_INSTANCES; i++) {
>> +             struct amdgpu_ring *ring = &adev->gfx.kiq[i].ring;
>> +
>> +             if (!ring->fence_drv.initialized)
>> +                     continue;
>> +             amdgpu_fence_driver_force_completion(ring);
>
> Well that is unrelated and clearly incorrect. The KIQ is re-initialized 
> through a reset and should *NEVER* be force signaled.
>
>> +     }
>>  }
>>
>>  void amdgpu_virt_post_reset(struct amdgpu_device *adev) diff --git
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>> index 22e2e5b473415..a9e33b7e87406 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>> @@ -21,6 +21,8 @@
>>   */
>>
>>  #include "amdgpu_vm.h"
>> +#include "amdgpu.h"
>> +#include "amdgpu_reset.h"
>>  #include "amdgpu_object.h"
>>  #include "amdgpu_trace.h"
>>
>> @@ -108,11 +110,23 @@ static int amdgpu_vm_cpu_update(struct
>> amdgpu_vm_update_params *p,  static int amdgpu_vm_cpu_commit(struct 
>> amdgpu_vm_update_params *p,
>>                               struct dma_fence **fence)  {
>> +     struct amdgpu_device *adev = p->adev;
>> +
>>       if (p->needs_flush)
>>               atomic64_inc(&p->vm->tlb_seq);
>>
>>       mb();
>> -     amdgpu_device_flush_hdp(p->adev, NULL);
>> +     /*
>> +      * While GPU reset holds reset_domain write lock, skip HDP flush 
>> entirely so
>> +      * no HDP-related HW access runs during reset;
>> +      * reset re-establishes consistent state afterward.
>> +      */
>
> That comment explains what is done but not why.
>
> Rather use something like this:
>
> /* A reset flushed the HDP anyway, so that here can be skipped when a
> reset is ongoing */
>
> Regards,
> Christian.
>
>> +     if (!down_read_trylock(&adev->reset_domain->sem))
>> +             return 0;
>> +
>> +     amdgpu_device_flush_hdp(adev, NULL);
>> +     up_read(&adev->reset_domain->sem);
>> +
>>       return 0;
>>  }
>>
>

Reply via email to