The early commit b8adc31cc0ca ("drm/amdgpu: Avoid extra evict-restore
process.") changed amdgpu_vm_wait_idle to use drm_sched_entity_flush
instead of dma_resv_wait_timeout to avoid KFD eviction fence signaling.
But this introduce a race condition when processes are killed.

During process kill, drm_sched_entity_flush() will kill the vm entities.
Concurrent job submissions of this process will fail.

Fix by skipping vm entity flushing when the process is being killed.

Signed-off-by: Liu01 Tong <tong.li...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 283dd44f04b0..ae43a378f866 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2415,6 +2415,13 @@ void amdgpu_vm_adjust_size(struct amdgpu_device *adev, 
uint32_t min_vm_size,
  */
 long amdgpu_vm_wait_idle(struct amdgpu_vm *vm, long timeout)
 {
+       /* If the process is being killed, skip flush VM entities
+        * as entities of concurrent job submission of this process
+        * might be in an inconsistent state
+        */
+       if (current->flags & PF_EXITING)
+               return timeout;
+
        timeout = drm_sched_entity_flush(&vm->immediate, timeout);
        if (timeout <= 0)
                return timeout;
-- 
2.34.1

Reply via email to