The early commit b8adc31cc0ca ("drm/amdgpu: Avoid extra evict-restore process.") changed amdgpu_vm_wait_idle to use drm_sched_entity_flush instead of dma_resv_wait_timeout to avoid KFD eviction fence signaling. But this introduce a race condition when processes are killed.
During process kill, drm_sched_entity_flush() will kill the vm entities. Concurrent job submissions of this process will fail. Fix by skipping vm entity flushing when the process is being killed. Signed-off-by: Liu01 Tong <tong.li...@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 283dd44f04b0..ae43a378f866 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2415,6 +2415,13 @@ void amdgpu_vm_adjust_size(struct amdgpu_device *adev, uint32_t min_vm_size, */ long amdgpu_vm_wait_idle(struct amdgpu_vm *vm, long timeout) { + /* If the process is being killed, skip flush VM entities + * as entities of concurrent job submission of this process + * might be in an inconsistent state + */ + if (current->flags & PF_EXITING) + return timeout; + timeout = drm_sched_entity_flush(&vm->immediate, timeout); if (timeout <= 0) return timeout; -- 2.34.1