amdgpu: set an error on all fences from a bad context

Christian König Sat, 20 Sep 2025 18:01:17 -0700

On 15.09.25 19:09, Alex Deucher wrote:
> On Mon, Sep 15, 2025 at 10:32 AM Christian König
> <[email protected]> wrote:
>>
>> On 15.09.25 16:24, Alex Deucher wrote:
>>> On Mon, Sep 15, 2025 at 9:59 AM Christian König
>>> <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 15.09.25 15:33, Alex Deucher wrote:
>>>>> When we backup ring contents to reemit after a queue reset,
>>>>> we don't backup ring contents from the bad context.  When
>>>>> we signal the fences, we should set an error on those
>>>>> fences as well.
>>>>>
>>>>> v2: misc cleanups
>>>>> v3: add locking for fence error, fix comment (Christian)
>>>>>
>>>>> Fixes: 77cc0da39c7c ("drm/amdgpu: track ring state associated with a 
>>>>> fence")
>>>>> Signed-off-by: Alex Deucher <[email protected]>
>>>>> ---
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 33 ++++++++++++++++++++---
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c  |  2 +-
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |  2 +-
>>>>>  3 files changed, 31 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index fd8cca241da62..72f0f16924476 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -758,11 +758,36 @@ void amdgpu_fence_driver_force_completion(struct 
>>>>> amdgpu_ring *ring)
>>>>>   * @fence: fence of the ring to signal
>>>>>   *
>>>>>   */
>>>>> -void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence 
>>>>> *fence)
>>>>> +void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence *af)
>>>>>  {
>>>>> -     dma_fence_set_error(&fence->base, -ETIME);
>>>>> -     amdgpu_fence_write(fence->ring, fence->seq);
>>>>> -     amdgpu_fence_process(fence->ring);
>>>>> +     struct dma_fence *unprocessed;
>>>>> +     struct dma_fence __rcu **ptr;
>>>>> +     struct amdgpu_fence *fence;
>>>>> +     struct amdgpu_ring *ring = af->ring;
>>>>> +     unsigned long flags;
>>>>> +     u64 i, seqno;
>>>>> +
>>>>> +     seqno = amdgpu_fence_read(ring);
>>>>> +
>>>>> +     spin_lock_irqsave(&ring->fence_drv.lock, flags);
>>>>> +     for (i = seqno + 1; i <= ring->fence_drv.sync_seq; ++i) {
>>>>
>>>> That won't work, the seqno can wrap around.
>>>>
>>>> I would just go over all fences, e.g. from 0 to end of array.
>>>>
>>>> The checks below should make sure that we don't try to set an error on 
>>>> something already processed.
>>>>
>>>>> +             ptr = &ring->fence_drv.fences[i & 
>>>>> ring->fence_drv.num_fences_mask];
>>>>> +             rcu_read_lock();
>>>>> +             unprocessed = rcu_dereference(*ptr);
>>>>> +
>>>>> +             if (unprocessed && !dma_fence_is_signaled(unprocessed)) {
>>>>
>>>> dma_fence_is_signaled_locked(), otherwise the function would try to lock 
>>>> the same spinlock again.
>>>>
>>>>> +                     fence = container_of(unprocessed, struct 
>>>>> amdgpu_fence, base);
>>>>> +
>>>>> +                     if (fence == af)
>>>>> +                             dma_fence_set_error(&fence->base, -ETIME);
>>>>> +                     else if (fence->context == af->context)
>>>>> +                             dma_fence_set_error(&fence->base, 
>>>>> -ECANCELED);
>>>>> +             }
>>>>> +             rcu_read_unlock();
>>>>> +     }
>>>>> +     spin_unlock_irqrestore(&ring->fence_drv.lock, flags);
>>>>
>>>>> +     amdgpu_fence_write(ring, af->seq);
>>>>> +     amdgpu_fence_process(ring);
>>>>
>>>> That's actually wrong. We want the other fences to signal later on when we 
>>>> process them.
>>>
>>> This fence is the original guilty fence.  The other fences will
>>> naturally signal when the later fences signal.
>>
>> Ah, got it. But that is still racy as hell.
>>
>> We should probably rather signal the guilty fence separately by calling 
>> dma_fence_signal().
> 
> What is it racing with?  amdgpu_fence_process() ultimately calls
> dma_fence_signal(), plus it handles various AMD specific bookkeeping.


Potentially concurrent signaling. Keep in mind that it is perfectly possible 
that the guilty fence just completes while we do all this here.

Regards,
Christian.

> 
> Alex
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> Alex
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>  }
>>>>>
>>>>>  void amdgpu_fence_save_wptr(struct dma_fence *fence)
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> index 6379bb25bf5ce..f6c9decedbd51 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> @@ -812,7 +812,7 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring 
>>>>> *ring,
>>>>>       if (r)
>>>>>               return r;
>>>>>
>>>>> -     /* signal the fence of the bad job */
>>>>> +     /* signal the guilty fence and set an error on all fences from the 
>>>>> context */
>>>>>       if (guilty_fence)
>>>>>               amdgpu_fence_driver_guilty_force_completion(guilty_fence);
>>>>>       /* Re-emit the non-guilty commands */
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> index 7670f5d82b9e4..0704fd9b85fdb 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> @@ -155,7 +155,7 @@ extern const struct drm_sched_backend_ops 
>>>>> amdgpu_sched_ops;
>>>>>  void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring);
>>>>>  void amdgpu_fence_driver_set_error(struct amdgpu_ring *ring, int error);
>>>>>  void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring);
>>>>> -void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence 
>>>>> *fence);
>>>>> +void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence 
>>>>> *af);
>>>>>  void amdgpu_fence_save_wptr(struct dma_fence *fence);
>>>>>
>>>>>  int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring);
>>>>
>>

Re: [PATCH 1/2] drm/amdgpu: set an error on all fences from a bad context

Reply via email to