amdgpu: re-add the bad job to the pending list for ring resets

Philipp Stanner Sun, 01 Feb 2026 02:53:14 -0800

On Fri, 2026-01-30 at 16:30 +0100, Christian König wrote:
> 
> 
> On 1/29/26 21:37, Alex Deucher wrote:
> > Need to re-add the bad job to the pending list before we
> > restart the scheduler.
> > 
> > Reviewed-by: Jesse Zhang <[email protected]>
> > Signed-off-by: Alex Deucher <[email protected]>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  | 6 ++++++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 4 ----
> >  2 files changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index aaf5477fcd7ac..9b10470321be3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -135,8 +135,14 @@ static enum drm_gpu_sched_stat 
> > amdgpu_job_timedout(struct drm_sched_job *s_job)
> >         ring->funcs->reset) {
> >             dev_err(adev->dev, "Starting %s ring reset\n",
> >                     s_job->sched->name);
> > +           /* Stop the scheduler to prevent anybody else from touching the 
> > ring buffer. */
> > +           drm_sched_wqueue_stop(&ring->sched);
> >             r = amdgpu_ring_reset(ring, job->vmid, job->hw_fence);
> >             if (!r) {
> > +                   /* add the job back to the pending list */
> > +                   list_add(&s_job->list, &s_job->sched->pending_list);
> 
> This is explicitly forbidden by the scheduler maintainer.


Fun fact: In English they sometimes use our german word "verboten" :)

> 
> So we seriously can't do that here.
> 
> Correct approach would be to return the proper code to the scheduler so that 
> the scheduler does that.

Yes.
Thank you very much, Christian.

Layering violations and ossification of API internals is one of the
main reasons that brought drm_sched to the brink of unmaintainability.
I think I was very explicit about this at XDC.

Our friends at AMD shall take Panfrost as an inspiration:
https://elixir.bootlin.com/linux/v6.19-rc4/source/drivers/gpu/drm/panfrost/panfrost_job.c#L760


P.

> 
> Regards,
> Christian.
> 
> > +                   /* Start the scheduler again */
> > +                   drm_sched_wqueue_start(&ring->sched);
> >                     atomic_inc(&ring->adev->gpu_reset_counter);
> >                     dev_err(adev->dev, "Ring %s reset succeeded\n",
> >                             ring->sched.name);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > index b82357c657237..129ad51386535 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > @@ -868,8 +868,6 @@ bool amdgpu_ring_sched_ready(struct amdgpu_ring *ring)
> >  void amdgpu_ring_reset_helper_begin(struct amdgpu_ring *ring,
> >                                 struct amdgpu_fence *guilty_fence)
> >  {
> > -   /* Stop the scheduler to prevent anybody else from touching the ring 
> > buffer. */
> > -   drm_sched_wqueue_stop(&ring->sched);
> >     /* back up the non-guilty commands */
> >     amdgpu_ring_backup_unprocessed_commands(ring, guilty_fence);
> >  }
> > @@ -895,8 +893,6 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring 
> > *ring,
> >                     amdgpu_ring_write(ring, ring->ring_backup[i]);
> >             amdgpu_ring_commit(ring);
> >     }
> > -   /* Start the scheduler again */
> > -   drm_sched_wqueue_start(&ring->sched);
> >     return 0;
> >  }
> >  
>

Re: [PATCH 01/12] drm/amdgpu: re-add the bad job to the pending list for ring resets

Reply via email to