On 1/9/26 10:44, Zhang, Jesse(Jie) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Koenig, Christian <[email protected]>
>> Sent: Friday, January 9, 2026 5:02 PM
>> To: Zhang, Jesse(Jie) <[email protected]>; [email protected]
>> Cc: Deucher, Alexander <[email protected]>
>> Subject: Re: [PATCH 1/2] drm/amd/amdgpu: Add timeout for user queue fence
>> waiting
>>
>> On 1/9/26 03:34, Jesse.Zhang wrote:
>>> In certain error scenarios (e.g., malformed commands), a fence may never
>> become signaled, causing the kernel to hang indefinitely when waiting with
>> MAX_SCHEDULE_TIMEOUT.
>>> To prevent such hangs and ensure system responsiveness, replace the
>>> indefinite
>> timeout with a reasonable 2-second limit.
>>>
>>> This ensures that even if a fence never signals, the wait will time
>>> out and appropriate error handling can take place, rather than stalling the
>>> driver
>> indefinitely.
>>>
>>> v2: make timeout per queue type (adev->gfx_timeout vs
>>> adev->compute_timeout vs adev->sdma_timeout) to be consistent with
>>> kernel queues. (Alex)
>>>
>>> Suggested-by: Alex Deucher <[email protected]>
>>> Signed-off-by: Jesse Zhang <[email protected]>
>>
>> Once more: Absolutely clear NAK to this here!
>>
>> The function amdgpu_userq_wait_for_last_fence() *MUST* wait forever for the
>> last
>> fence to signal otherwise we run into massive problems.
> [Zhang, Jesse(Jie)]
> If data packets are corrupted (e.g., injecting error data packets), it can
> cause the queue to hang. The last fence will not return, and the process will
> wait indefinitely.
Yes that is expected behavior.
> Therefore, we should set an appropriate timeout period.
No, exactly that is absolutely broken.
> We can trigger a detection and reset mechanism to recover the hung queue and,
> if necessary, set the error status of the last fence.
Yes, that is the right thing to do.
The timeout detection must be independent from the fence, e.g. you don't wait
for a timeout on the fence but rather have the timeout start as soon as the
fence is initialized.
The timeout will then trigger, does the queue reset and signals the fence with
an error code.
This will then unblock the waiting processes.
Regards,
Christian.
>
> Thanks
> Jesse.
>
>>
>> Regards,
>> Christian.
>>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 20 +++++++++++++++++++-
>>> 1 file changed, 19 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> index 98110f543307..402307581293 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> @@ -367,11 +367,29 @@ static int amdgpu_userq_map_helper(struct
>>> amdgpu_usermode_queue *queue) static int
>>> amdgpu_userq_wait_for_last_fence(struct amdgpu_usermode_queue *queue) {
>>> struct amdgpu_userq_mgr *uq_mgr = queue->userq_mgr;
>>> + struct amdgpu_device *adev = uq_mgr->adev;
>>> struct dma_fence *f = queue->last_fence;
>>> + signed long timeout;
>>> int ret = 0;
>>>
>>> + /* Determine timeout based on queue type */
>>> + switch (queue->queue_type) {
>>> + case AMDGPU_RING_TYPE_GFX:
>>> + timeout = adev->gfx_timeout;
>>> + break;
>>> + case AMDGPU_RING_TYPE_COMPUTE:
>>> + timeout = adev->compute_timeout;
>>> + break;
>>> + case AMDGPU_RING_TYPE_SDMA:
>>> + timeout = adev->sdma_timeout;
>>> + break;
>>> + default:
>>> + timeout = adev->gfx_timeout;
>>> + break;
>>> + }
>>> +
>>> if (f && !dma_fence_is_signaled(f)) {
>>> - ret = dma_fence_wait_timeout(f, true,
>> MAX_SCHEDULE_TIMEOUT);
>>> + ret = dma_fence_wait_timeout(f, true, timeout);
>>> if (ret <= 0) {
>>> drm_file_err(uq_mgr->file, "Timed out waiting for
>> fence=%llu:%llu\n",
>>> f->context, f->seqno);
>