[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Koenig, Christian <[email protected]>
> Sent: Friday, January 9, 2026 5:02 PM
> To: Zhang, Jesse(Jie) <[email protected]>; [email protected]
> Cc: Deucher, Alexander <[email protected]>
> Subject: Re: [PATCH 1/2] drm/amd/amdgpu: Add timeout for user queue fence
> waiting
>
> On 1/9/26 03:34, Jesse.Zhang wrote:
> > In certain error scenarios (e.g., malformed commands), a fence may never
> become signaled, causing the kernel to hang indefinitely when waiting with
> MAX_SCHEDULE_TIMEOUT.
> > To prevent such hangs and ensure system responsiveness, replace the 
> > indefinite
> timeout with a reasonable 2-second limit.
> >
> > This ensures that even if a fence never signals, the wait will time
> > out and appropriate error handling can take place, rather than stalling the 
> > driver
> indefinitely.
> >
> > v2: make timeout per queue type (adev->gfx_timeout vs
> > adev->compute_timeout vs adev->sdma_timeout) to be consistent with
> > kernel queues. (Alex)
> >
> > Suggested-by: Alex Deucher <[email protected]>
> > Signed-off-by: Jesse Zhang <[email protected]>
>
> Once more: Absolutely clear NAK to this here!
>
> The function amdgpu_userq_wait_for_last_fence() *MUST* wait forever for the 
> last
> fence to signal otherwise we run into massive problems.
[Zhang, Jesse(Jie)]
If data packets are corrupted (e.g., injecting error data packets), it can 
cause the queue to hang. The last fence will not return, and the process will 
wait indefinitely.
Therefore, we should set an appropriate timeout period.
We can trigger a detection and reset mechanism to recover the hung queue and, 
if necessary, set the error status of the last fence.

Thanks
Jesse.

>
> Regards,
> Christian.
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > index 98110f543307..402307581293 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > @@ -367,11 +367,29 @@ static int amdgpu_userq_map_helper(struct
> > amdgpu_usermode_queue *queue)  static int
> > amdgpu_userq_wait_for_last_fence(struct amdgpu_usermode_queue *queue)  {
> >     struct amdgpu_userq_mgr *uq_mgr = queue->userq_mgr;
> > +   struct amdgpu_device *adev = uq_mgr->adev;
> >     struct dma_fence *f = queue->last_fence;
> > +   signed long timeout;
> >     int ret = 0;
> >
> > +   /* Determine timeout based on queue type */
> > +   switch (queue->queue_type) {
> > +   case AMDGPU_RING_TYPE_GFX:
> > +           timeout = adev->gfx_timeout;
> > +           break;
> > +   case AMDGPU_RING_TYPE_COMPUTE:
> > +           timeout = adev->compute_timeout;
> > +           break;
> > +   case AMDGPU_RING_TYPE_SDMA:
> > +           timeout = adev->sdma_timeout;
> > +           break;
> > +   default:
> > +           timeout = adev->gfx_timeout;
> > +           break;
> > +   }
> > +
> >     if (f && !dma_fence_is_signaled(f)) {
> > -           ret = dma_fence_wait_timeout(f, true,
> MAX_SCHEDULE_TIMEOUT);
> > +           ret = dma_fence_wait_timeout(f, true, timeout);
> >             if (ret <= 0) {
> >                     drm_file_err(uq_mgr->file, "Timed out waiting for
> fence=%llu:%llu\n",
> >                                  f->context, f->seqno);

Reply via email to