On 1/13/26 14:34, Philipp Stanner wrote: > On Tue, 2026-01-13 at 14:17 +0100, Christian König wrote: >> On 1/8/26 15:48, Alex Deucher wrote: >>> We only want to stop the work queues, not mess with the >>> pending list so just stop the work queues. > > Ideally amdgpu could stop touching the pending_list altogether forever, > as discussed at XDC. Is work for that in the pipe? Is that what this > patch is for?
Yes. > >> >> Oh, yes please! I can't remember how long we have worked towards that. >> >> But we also need to change the return code so that the scheduler now >> re-inserts the job into the pending list. > > You're referring to false-positive timeouts. Porting users to that > typically consists of adding that return code and also removing > whatever the driver used to do to inject the non-timedout job into the > scheduler again. > > How is that being done here? Previously drm_sched_stop() would insert the job back into the pending list after stopping the scheduler thread. But when that is replaced with drm_sched_wqueue_stop() then that won't happen any more. That is a good thing and prevents us from running into problems like UAF because the HW fence signaled. As far as I can see we should start returning DRM_GPU_SCHED_STAT_NO_HANG from amdgpu even when there was actually a hang (maybe rename the return code). Regards, Christian. > > P. > >> >> Adding Philip on CC to double check what I say above. >> >> Regards, >> Christian. >> >>> >>> Signed-off-by: Alex Deucher <[email protected]> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 80572f71ff627..868ab5314c0d1 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -6301,7 +6301,7 @@ static void amdgpu_device_halt_activities(struct >>> amdgpu_device *adev, >>> if (!amdgpu_ring_sched_ready(ring)) >>> continue; >>> >>> - drm_sched_stop(&ring->sched, job ? &job->base : NULL); >>> + drm_sched_wqueue_stop(&ring->sched); >>> >>> if (need_emergency_restart) >>> amdgpu_job_stop_all_jobs_on_sched(&ring->sched); >>> @@ -6385,7 +6385,7 @@ static int amdgpu_device_sched_resume(struct >>> list_head *device_list, >>> if (!amdgpu_ring_sched_ready(ring)) >>> continue; >>> >>> - drm_sched_start(&ring->sched, 0); >>> + drm_sched_wqueue_start(&ring->sched); >>> } >>> >>> if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && >>> !job_signaled) >> >
