On Fri, Jan 30, 2026 at 10:30 AM Christian König
<[email protected]> wrote:
>
>
>
> On 1/29/26 21:37, Alex Deucher wrote:
> > Need to re-add the bad job to the pending list before we
> > restart the scheduler.
> >
> > Reviewed-by: Jesse Zhang <[email protected]>
> > Signed-off-by: Alex Deucher <[email protected]>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 ++++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 4 ----
> > 2 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index aaf5477fcd7ac..9b10470321be3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -135,8 +135,14 @@ static enum drm_gpu_sched_stat
> > amdgpu_job_timedout(struct drm_sched_job *s_job)
> > ring->funcs->reset) {
> > dev_err(adev->dev, "Starting %s ring reset\n",
> > s_job->sched->name);
> > + /* Stop the scheduler to prevent anybody else from touching
> > the ring buffer. */
> > + drm_sched_wqueue_stop(&ring->sched);
> > r = amdgpu_ring_reset(ring, job->vmid, job->hw_fence);
> > if (!r) {
> > + /* add the job back to the pending list */
> > + list_add(&s_job->list, &s_job->sched->pending_list);
>
> This is explicitly forbidden by the scheduler maintainer.
>
> So we seriously can't do that here.
>
> Correct approach would be to return the proper code to the scheduler so that
> the scheduler does that.
Ah sorry, the tree I was working against didn't have this change yet.
Will fix it up.
Alex
>
> Regards,
> Christian.
>
> > + /* Start the scheduler again */
> > + drm_sched_wqueue_start(&ring->sched);
> > atomic_inc(&ring->adev->gpu_reset_counter);
> > dev_err(adev->dev, "Ring %s reset succeeded\n",
> > ring->sched.name);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > index b82357c657237..129ad51386535 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > @@ -868,8 +868,6 @@ bool amdgpu_ring_sched_ready(struct amdgpu_ring *ring)
> > void amdgpu_ring_reset_helper_begin(struct amdgpu_ring *ring,
> > struct amdgpu_fence *guilty_fence)
> > {
> > - /* Stop the scheduler to prevent anybody else from touching the ring
> > buffer. */
> > - drm_sched_wqueue_stop(&ring->sched);
> > /* back up the non-guilty commands */
> > amdgpu_ring_backup_unprocessed_commands(ring, guilty_fence);
> > }
> > @@ -895,8 +893,6 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring
> > *ring,
> > amdgpu_ring_write(ring, ring->ring_backup[i]);
> > amdgpu_ring_commit(ring);
> > }
> > - /* Start the scheduler again */
> > - drm_sched_wqueue_start(&ring->sched);
> > return 0;
> > }
> >
>