Re: [PATCH v8] drm/amd/amdgpu implement tdr advanced mode

2021-03-12 Thread Christian König

Hi Jack,

the scheduler changes become to invasive.

You should split that patch up into two. The first one to make the 
scheduler changes and the second one using them in amdgpu.


Christian.

Am 11.03.21 um 16:37 schrieb Jack Zhang:

[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.

1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.

2. For whole gpu reset(vram lost), do resubmit as the old way.

Signed-off-by: Jack Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  74 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   2 +-
  drivers/gpu/drm/scheduler/sched_main.c | 103 ++---
  include/drm/gpu_scheduler.h|   3 +
  4 files changed, 149 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..01352ed725d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4617,6 +4617,73 @@ static int amdgpu_device_suspend_display_audio(struct 
amdgpu_device *adev)
return 0;
  }
  
+void amdgpu_device_correct_karma(struct amdgpu_device *adev,

+  struct amdgpu_hive_info *hive,
+  struct list_head *device_list_handle,
+  bool *need_full_reset)
+{
+   int i, r = 0;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+   int ret = 0;
+   struct drm_sched_job *s_job;
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   s_job = list_first_entry_or_null(>sched.pending_list,
+   struct drm_sched_job, list);
+   if (s_job == NULL)
+   continue;
+
+   /* clear job's guilty and depend the folowing step to decide 
the real one */
+   drm_sched_reset_karma(s_job);
+   drm_sched_resubmit_jobs_ext(>sched, 1);
+
+   ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, 
ring->sched.timeout);
+   if (ret == 0) { /* timeout */
+   DRM_ERROR("Found the real bad job! ring:%s, 
job_id:%llx\n",
+   ring->sched.name, s_job->id);
+
+   /* set guilty */
+   drm_sched_increase_karma(s_job);
+retry:
+   /* do hw reset */
+   if (amdgpu_sriov_vf(adev)) {
+   amdgpu_virt_fini_data_exchange(adev);
+   r = amdgpu_device_reset_sriov(adev, false);
+   if (r)
+   adev->asic_reset_res = r;
+   } else {
+   r  = amdgpu_do_asic_reset(hive, 
device_list_handle,
+   need_full_reset, false);
+   if (r && r == -EAGAIN)
+   goto retry;
+   }
+
+   /*
+* add reset counter so that the following
+* resubmitted job could flush vmid
+*/
+   atomic_inc(>gpu_reset_counter);
+   continue;
+   }
+
+   /* got the hw fence, signal finished fence */
+   atomic_dec(>sched.num_jobs);
+   dma_fence_get(_job->s_fence->finished);
+   dma_fence_signal(_job->s_fence->finished);
+   dma_fence_put(_job->s_fence->finished);
+
+   /* remove node from list and free the job */
+   spin_lock(>sched.job_list_lock);
+   list_del_init(_job->node);
+   spin_unlock(>sched.job_list_lock);
+   ring->sched.ops->free_job(s_job);
+   }
+}
+
  /**
   * amdgpu_device_gpu_recover - reset the asic and recover scheduler
   *
@@ -4639,6 +4706,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+   int tmp_vram_lost_counter;
  
  	/*

 * Special case: RAS triggered and full reset isn't supported
@@ -4788,6 +4856,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

Re: [PATCH v8] drm/amd/amdgpu implement tdr advanced mode

2021-03-11 Thread Andrey Grodzovsky

I would use a more meaningful name then
amdgpu_device_correct_karma and also put some comment
above describing the algorithm (just as in the patch description).
Other then that looks good to me -
Reviewed-by: Andrey Grodzovsky andrey.grodzov...@amd.com

Andrey

On 2021-03-11 10:37 a.m., Jack Zhang wrote:

[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.

1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.

2. For whole gpu reset(vram lost), do resubmit as the old way.

Signed-off-by: Jack Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  74 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   2 +-
  drivers/gpu/drm/scheduler/sched_main.c | 103 ++---
  include/drm/gpu_scheduler.h|   3 +
  4 files changed, 149 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..01352ed725d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4617,6 +4617,73 @@ static int amdgpu_device_suspend_display_audio(struct 
amdgpu_device *adev)
return 0;
  }
  
+void amdgpu_device_correct_karma(struct amdgpu_device *adev,

+  struct amdgpu_hive_info *hive,
+  struct list_head *device_list_handle,
+  bool *need_full_reset)
+{
+   int i, r = 0;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+   int ret = 0;
+   struct drm_sched_job *s_job;
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   s_job = list_first_entry_or_null(>sched.pending_list,
+   struct drm_sched_job, list);
+   if (s_job == NULL)
+   continue;
+
+   /* clear job's guilty and depend the folowing step to decide 
the real one */
+   drm_sched_reset_karma(s_job);
+   drm_sched_resubmit_jobs_ext(>sched, 1);
+
+   ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, 
ring->sched.timeout);
+   if (ret == 0) { /* timeout */
+   DRM_ERROR("Found the real bad job! ring:%s, 
job_id:%llx\n",
+   ring->sched.name, s_job->id);
+
+   /* set guilty */
+   drm_sched_increase_karma(s_job);
+retry:
+   /* do hw reset */
+   if (amdgpu_sriov_vf(adev)) {
+   amdgpu_virt_fini_data_exchange(adev);
+   r = amdgpu_device_reset_sriov(adev, false);
+   if (r)
+   adev->asic_reset_res = r;
+   } else {
+   r  = amdgpu_do_asic_reset(hive, 
device_list_handle,
+   need_full_reset, false);
+   if (r && r == -EAGAIN)
+   goto retry;
+   }
+
+   /*
+* add reset counter so that the following
+* resubmitted job could flush vmid
+*/
+   atomic_inc(>gpu_reset_counter);
+   continue;
+   }
+
+   /* got the hw fence, signal finished fence */
+   atomic_dec(>sched.num_jobs);
+   dma_fence_get(_job->s_fence->finished);
+   dma_fence_signal(_job->s_fence->finished);
+   dma_fence_put(_job->s_fence->finished);
+
+   /* remove node from list and free the job */
+   spin_lock(>sched.job_list_lock);
+   list_del_init(_job->node);
+   spin_unlock(>sched.job_list_lock);
+   ring->sched.ops->free_job(s_job);
+   }
+}
+
  /**
   * amdgpu_device_gpu_recover - reset the asic and recover scheduler
   *
@@ -4639,6 +4706,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+   int tmp_vram_lost_counter;
  
  	/*

 * Special case: RAS triggered and full reset isn't supported
@@ -4788,6 

[PATCH v8] drm/amd/amdgpu implement tdr advanced mode

2021-03-11 Thread Jack Zhang
[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.

1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.

2. For whole gpu reset(vram lost), do resubmit as the old way.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  74 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   2 +-
 drivers/gpu/drm/scheduler/sched_main.c | 103 ++---
 include/drm/gpu_scheduler.h|   3 +
 4 files changed, 149 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..01352ed725d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4617,6 +4617,73 @@ static int amdgpu_device_suspend_display_audio(struct 
amdgpu_device *adev)
return 0;
 }
 
+void amdgpu_device_correct_karma(struct amdgpu_device *adev,
+  struct amdgpu_hive_info *hive,
+  struct list_head *device_list_handle,
+  bool *need_full_reset)
+{
+   int i, r = 0;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+   int ret = 0;
+   struct drm_sched_job *s_job;
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   s_job = list_first_entry_or_null(>sched.pending_list,
+   struct drm_sched_job, list);
+   if (s_job == NULL)
+   continue;
+
+   /* clear job's guilty and depend the folowing step to decide 
the real one */
+   drm_sched_reset_karma(s_job);
+   drm_sched_resubmit_jobs_ext(>sched, 1);
+
+   ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, 
ring->sched.timeout);
+   if (ret == 0) { /* timeout */
+   DRM_ERROR("Found the real bad job! ring:%s, 
job_id:%llx\n",
+   ring->sched.name, s_job->id);
+
+   /* set guilty */
+   drm_sched_increase_karma(s_job);
+retry:
+   /* do hw reset */
+   if (amdgpu_sriov_vf(adev)) {
+   amdgpu_virt_fini_data_exchange(adev);
+   r = amdgpu_device_reset_sriov(adev, false);
+   if (r)
+   adev->asic_reset_res = r;
+   } else {
+   r  = amdgpu_do_asic_reset(hive, 
device_list_handle,
+   need_full_reset, false);
+   if (r && r == -EAGAIN)
+   goto retry;
+   }
+
+   /*
+* add reset counter so that the following
+* resubmitted job could flush vmid
+*/
+   atomic_inc(>gpu_reset_counter);
+   continue;
+   }
+
+   /* got the hw fence, signal finished fence */
+   atomic_dec(>sched.num_jobs);
+   dma_fence_get(_job->s_fence->finished);
+   dma_fence_signal(_job->s_fence->finished);
+   dma_fence_put(_job->s_fence->finished);
+
+   /* remove node from list and free the job */
+   spin_lock(>sched.job_list_lock);
+   list_del_init(_job->node);
+   spin_unlock(>sched.job_list_lock);
+   ring->sched.ops->free_job(s_job);
+   }
+}
+
 /**
  * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  *
@@ -4639,6 +4706,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+   int tmp_vram_lost_counter;
 
/*
 * Special case: RAS triggered and full reset isn't supported
@@ -4788,6 +4856,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
}
}
 
+   tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
/* Actual ASIC resets if needed.*/
/* TODO Implement XGMI hive reset logic for SRIOV */
if (amdgpu_sriov_vf(adev)) {
@@ 

[PATCH v8] drm/amd/amdgpu implement tdr advanced mode

2021-03-11 Thread Jack Zhang
[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.

1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.

2. For whole gpu reset(vram lost), do resubmit as the old way.

Signed-off-by: Jack Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  74 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   2 +-
 drivers/gpu/drm/scheduler/sched_main.c | 103 ++---
 include/drm/gpu_scheduler.h|   3 +
 4 files changed, 149 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e247c3a2ec08..01352ed725d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4617,6 +4617,73 @@ static int amdgpu_device_suspend_display_audio(struct 
amdgpu_device *adev)
return 0;
 }
 
+void amdgpu_device_correct_karma(struct amdgpu_device *adev,
+  struct amdgpu_hive_info *hive,
+  struct list_head *device_list_handle,
+  bool *need_full_reset)
+{
+   int i, r = 0;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+   int ret = 0;
+   struct drm_sched_job *s_job;
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   s_job = list_first_entry_or_null(>sched.pending_list,
+   struct drm_sched_job, list);
+   if (s_job == NULL)
+   continue;
+
+   /* clear job's guilty and depend the folowing step to decide 
the real one */
+   drm_sched_reset_karma(s_job);
+   drm_sched_resubmit_jobs_ext(>sched, 1);
+
+   ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, 
ring->sched.timeout);
+   if (ret == 0) { /* timeout */
+   DRM_ERROR("Found the real bad job! ring:%s, 
job_id:%llx\n",
+   ring->sched.name, s_job->id);
+
+   /* set guilty */
+   drm_sched_increase_karma(s_job);
+retry:
+   /* do hw reset */
+   if (amdgpu_sriov_vf(adev)) {
+   amdgpu_virt_fini_data_exchange(adev);
+   r = amdgpu_device_reset_sriov(adev, false);
+   if (r)
+   adev->asic_reset_res = r;
+   } else {
+   r  = amdgpu_do_asic_reset(hive, 
device_list_handle,
+   need_full_reset, false);
+   if (r && r == -EAGAIN)
+   goto retry;
+   }
+
+   /*
+* add reset counter so that the following
+* resubmitted job could flush vmid
+*/
+   atomic_inc(>gpu_reset_counter);
+   continue;
+   }
+
+   /* got the hw fence, signal finished fence */
+   atomic_dec(>sched.num_jobs);
+   dma_fence_get(_job->s_fence->finished);
+   dma_fence_signal(_job->s_fence->finished);
+   dma_fence_put(_job->s_fence->finished);
+
+   /* remove node from list and free the job */
+   spin_lock(>sched.job_list_lock);
+   list_del_init(_job->node);
+   spin_unlock(>sched.job_list_lock);
+   ring->sched.ops->free_job(s_job);
+   }
+}
+
 /**
  * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  *
@@ -4639,6 +4706,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+   int tmp_vram_lost_counter;
 
/*
 * Special case: RAS triggered and full reset isn't supported
@@ -4788,6 +4856,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
}
}
 
+   tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
/* Actual ASIC resets if needed.*/
/* TODO Implement XGMI hive reset logic for SRIOV */
if (amdgpu_sriov_vf(adev)) {
@@