On Tue, 2026-01-13 at 15:37 +0100, Christian König wrote: > On 1/13/26 14:34, Philipp Stanner wrote: > > On Tue, 2026-01-13 at 14:17 +0100, Christian König wrote: > > > On 1/8/26 15:48, Alex Deucher wrote: > > > > We only want to stop the work queues, not mess with the > > > > pending list so just stop the work queues. > > > > Ideally amdgpu could stop touching the pending_list altogether forever, > > as discussed at XDC. Is work for that in the pipe? Is that what this > > patch is for? > > Yes.
Good! > > > > > > > > > Oh, yes please! I can't remember how long we have worked towards > > > that. > > > > > > But we also need to change the return code so that the scheduler > > > now re-inserts the job into the pending list. > > > > You're referring to false-positive timeouts. Porting users to that > > typically consists of adding that return code and also removing > > whatever the driver used to do to inject the non-timedout job into > > the > > scheduler again. > > > > How is that being done here? > > Previously drm_sched_stop() would insert the job back into the > pending list after stopping the scheduler thread. Why does it even do that? The entire function drm_sched_stop() looks insane to me. I'm not even sure we're talking about the same reinserting. Why is it that anyone has any interest at all of keeping a broken job in the pending_list? > > But when that is replaced with drm_sched_wqueue_stop() then that > won't happen any more. That is a good thing and prevents us from > running into problems like UAF because the HW fence signaled. > > As far as I can see we should start returning > DRM_GPU_SCHED_STAT_NO_HANG from amdgpu even when there was actually a > hang (maybe rename the return code). Well, no. That's not how NO_HANG was designed. Why would you want that? P. > > Regards, > Christian. > > > > > P. > > > > > > > > Adding Philip on CC to double check what I say above. > > > > > > Regards, > > > Christian. > > > > > > > > > > > Signed-off-by: Alex Deucher <[email protected]> > > > > --- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > index 80572f71ff627..868ab5314c0d1 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > @@ -6301,7 +6301,7 @@ static void amdgpu_device_halt_activities(struct > > > > amdgpu_device *adev, > > > > if (!amdgpu_ring_sched_ready(ring)) > > > > continue; > > > > > > > > - drm_sched_stop(&ring->sched, job ? &job->base : > > > > NULL); > > > > + drm_sched_wqueue_stop(&ring->sched); > > > > > > > > if (need_emergency_restart) > > > > > > > > amdgpu_job_stop_all_jobs_on_sched(&ring->sched); > > > > @@ -6385,7 +6385,7 @@ static int amdgpu_device_sched_resume(struct > > > > list_head *device_list, > > > > if (!amdgpu_ring_sched_ready(ring)) > > > > continue; > > > > > > > > - drm_sched_start(&ring->sched, 0); > > > > + drm_sched_wqueue_start(&ring->sched); > > > > } > > > > > > > > if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) > > > > && !job_signaled) > > > > > >
