On Tue, 2026-01-13 at 15:37 +0100, Christian König wrote:
> On 1/13/26 14:34, Philipp Stanner wrote:
> > On Tue, 2026-01-13 at 14:17 +0100, Christian König wrote:
> > > On 1/8/26 15:48, Alex Deucher wrote:
> > > > We only want to stop the work queues, not mess with the
> > > > pending list so just stop the work queues.
> > 
> > Ideally amdgpu could stop touching the pending_list altogether forever,
> > as discussed at XDC. Is work for that in the pipe? Is that what this
> > patch is for?
> 
> Yes.

Good!

> 
> > 
> > > 
> > > Oh, yes please! I can't remember how long we have worked towards
> > > that.
> > > 
> > > But we also need to change the return code so that the scheduler
> > > now re-inserts the job into the pending list.
> > 
> > You're referring to false-positive timeouts. Porting users to that
> > typically consists of adding that return code and also removing
> > whatever the driver used to do to inject the non-timedout job into
> > the
> > scheduler again.
> > 
> > How is that being done here?
> 
> Previously drm_sched_stop() would insert the job back into the
> pending list after stopping the scheduler thread.

Why does it even do that?

The entire function drm_sched_stop() looks insane to me. I'm not even
sure we're talking about the same reinserting.

Why is it that anyone has any interest at all of keeping a broken job
in the pending_list?


> 
> But when that is replaced with drm_sched_wqueue_stop() then that
> won't happen any more. That is a good thing and prevents us from
> running into problems like UAF because the HW fence signaled.
> 
> As far as I can see we should start returning
> DRM_GPU_SCHED_STAT_NO_HANG from amdgpu even when there was actually a
> hang (maybe rename the return code).

Well, no. That's not how NO_HANG was designed.

Why would you want that?


P.

> 
> Regards,
> Christian.
> 
> > 
> > P.
> > 
> > > 
> > > Adding Philip on CC to double check what I say above.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > 
> > > > Signed-off-by: Alex Deucher <[email protected]>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 80572f71ff627..868ab5314c0d1 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -6301,7 +6301,7 @@ static void amdgpu_device_halt_activities(struct 
> > > > amdgpu_device *adev,
> > > >                         if (!amdgpu_ring_sched_ready(ring))
> > > >                                 continue;
> > > >  
> > > > -                       drm_sched_stop(&ring->sched, job ? &job->base : 
> > > > NULL);
> > > > +                       drm_sched_wqueue_stop(&ring->sched);
> > > >  
> > > >                         if (need_emergency_restart)
> > > >                                 
> > > > amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
> > > > @@ -6385,7 +6385,7 @@ static int amdgpu_device_sched_resume(struct 
> > > > list_head *device_list,
> > > >                         if (!amdgpu_ring_sched_ready(ring))
> > > >                                 continue;
> > > >  
> > > > -                       drm_sched_start(&ring->sched, 0);
> > > > +                       drm_sched_wqueue_start(&ring->sched);
> > > >                 }
> > > >  
> > > >                 if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) 
> > > > && !job_signaled)
> > > 
> > 
> 

Reply via email to