On 1/13/26 14:34, Philipp Stanner wrote:
> On Tue, 2026-01-13 at 14:17 +0100, Christian König wrote:
>> On 1/8/26 15:48, Alex Deucher wrote:
>>> We only want to stop the work queues, not mess with the
>>> pending list so just stop the work queues.
> 
> Ideally amdgpu could stop touching the pending_list altogether forever,
> as discussed at XDC. Is work for that in the pipe? Is that what this
> patch is for?

Yes.

> 
>>
>> Oh, yes please! I can't remember how long we have worked towards that.
>>
>> But we also need to change the return code so that the scheduler now 
>> re-inserts the job into the pending list.
> 
> You're referring to false-positive timeouts. Porting users to that
> typically consists of adding that return code and also removing
> whatever the driver used to do to inject the non-timedout job into the
> scheduler again.
> 
> How is that being done here?

Previously drm_sched_stop() would insert the job back into the pending list 
after stopping the scheduler thread.

But when that is replaced with drm_sched_wqueue_stop() then that won't happen 
any more. That is a good thing and prevents us from running into problems like 
UAF because the HW fence signaled.

As far as I can see we should start returning DRM_GPU_SCHED_STAT_NO_HANG from 
amdgpu even when there was actually a hang (maybe rename the return code).

Regards,
Christian.

> 
> P.
> 
>>
>> Adding Philip on CC to double check what I say above.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Alex Deucher <[email protected]>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 80572f71ff627..868ab5314c0d1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -6301,7 +6301,7 @@ static void amdgpu_device_halt_activities(struct 
>>> amdgpu_device *adev,
>>>                     if (!amdgpu_ring_sched_ready(ring))
>>>                             continue;
>>>  
>>> -                   drm_sched_stop(&ring->sched, job ? &job->base : NULL);
>>> +                   drm_sched_wqueue_stop(&ring->sched);
>>>  
>>>                     if (need_emergency_restart)
>>>                             amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
>>> @@ -6385,7 +6385,7 @@ static int amdgpu_device_sched_resume(struct 
>>> list_head *device_list,
>>>                     if (!amdgpu_ring_sched_ready(ring))
>>>                             continue;
>>>  
>>> -                   drm_sched_start(&ring->sched, 0);
>>> +                   drm_sched_wqueue_start(&ring->sched);
>>>             }
>>>  
>>>             if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && 
>>> !job_signaled)
>>
> 

Reply via email to