Re: [PATCH v2] drm/amd/amdgpu: consider kernel job always not guilty

2021-07-21 Thread Christian König

Am 21.07.21 um 04:05 schrieb Jingwen Chen:

[Why]
Currently all timedout job will be considered to be guilty. In SRIOV
multi-vf use case, the vf flr happens first and then job time out is
found. There can be several jobs timeout during a very small time slice.
And if the innocent sdma job time out is found before the real bad
job, then the innocent sdma job will be set to guilty. This will lead
to a page fault after resubmitting job.

[How]
If the job is a kernel job, we will always consider it not guilty

Signed-off-by: Jingwen Chen 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 37fa199be8b3..40461547701a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4410,7 +4410,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
amdgpu_fence_driver_force_completion(ring);
}
  
-	if(job)

+   if (job && job->vm)
drm_sched_increase_karma(>base);
  
  	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);

@@ -4874,7 +4874,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as 
another already in progress",
job ? job->base.id : -1, hive->hive_id);
amdgpu_put_xgmi_hive(hive);
-   if (job)
+   if (job && job->vm)
drm_sched_increase_karma(>base);
return 0;
}
@@ -4898,7 +4898,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
job ? job->base.id : -1);
  
  		/* even we skipped this reset, still need to set the job to guilty */

-   if (job)
+   if (job && job->vm)
drm_sched_increase_karma(>base);
goto skip_recovery;
}


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v2] drm/amd/amdgpu: consider kernel job always not guilty

2021-07-21 Thread Jingwen Chen
[Why]
Currently all timedout job will be considered to be guilty. In SRIOV
multi-vf use case, the vf flr happens first and then job time out is
found. There can be several jobs timeout during a very small time slice.
And if the innocent sdma job time out is found before the real bad
job, then the innocent sdma job will be set to guilty. This will lead
to a page fault after resubmitting job.

[How]
If the job is a kernel job, we will always consider it not guilty

Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 37fa199be8b3..40461547701a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4410,7 +4410,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
amdgpu_fence_driver_force_completion(ring);
}
 
-   if(job)
+   if (job && job->vm)
drm_sched_increase_karma(>base);
 
r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
@@ -4874,7 +4874,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as 
another already in progress",
job ? job->base.id : -1, hive->hive_id);
amdgpu_put_xgmi_hive(hive);
-   if (job)
+   if (job && job->vm)
drm_sched_increase_karma(>base);
return 0;
}
@@ -4898,7 +4898,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
job ? job->base.id : -1);
 
/* even we skipped this reset, still need to set the job to 
guilty */
-   if (job)
+   if (job && job->vm)
drm_sched_increase_karma(>base);
goto skip_recovery;
}
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx