On 6/5/25 15:21, Alex Deucher wrote: >>> + am_fence->start_ring_wptr = 0; >>> + am_fence->end_ring_wptr = 0; >> >> Why do we need the start here? I would just keep the end around and then >> jump from fence to fence while re-submitting them. > > I need to know the start and end of the ring contents associated with > each fence. When I re-emit, I just copy over the ring contents for > all fences that don't match the bad one. Also we submit multiple > fences per IB depending on whether we do a vm flush. Those fences are > internal to the IB frame so they don't really need a start and end, > hence 0.
What I wanted to do is the following: ptr = bad_fence->wend_ring_wptr; for (i = (bad_fence->seq + 1) & fence_drv->mask; i != bad_fence_seq; ++i &= fence_drv->mask) fence = fence_drv->fences[i] if (dma_fence_is_signaled(fence)) break; if (!fence->end_ring_wptr) continue; if (fence->context != bad_fence->context) backup(ptr, fence->end_ring_wptr); ptr = fence->end_ring_wptr; } But could be that it is better to backup start/end explicitly. >>> +void amdgpu_fence_driver_seq_force_completion(struct amdgpu_ring *ring, >>> u32 seq) >> >> Better give the full fence structure here. > > You mean pass the fence directly? Yes. > >> >>> +{ >>> + amdgpu_fence_write(ring, seq); >>> + amdgpu_fence_process(ring); >>> +} >>> + >>> /* >>> * Common fence implementation >>> */ >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >>> index 802743efa3b39..636941697a740 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >>> @@ -126,7 +126,9 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, >>> unsigned int num_ibs, >>> struct dma_fence **f) >>> { >>> struct amdgpu_device *adev = ring->adev; >>> + u64 start_ring_wptr, end_ring_wptr; >>> struct amdgpu_ib *ib = &ibs[0]; >>> + struct amdgpu_fence *am_fence; >>> struct dma_fence *tmp = NULL; >>> bool need_ctx_switch; >>> struct amdgpu_vm *vm; >>> @@ -138,7 +140,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, >>> unsigned int num_ibs, >>> int vmid = AMDGPU_JOB_GET_VMID(job); >>> bool need_pipe_sync = false; >>> unsigned int cond_exec; >>> - >>> unsigned int i; >>> int r = 0; >>> >>> @@ -187,6 +188,7 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, >>> unsigned int num_ibs, >>> dev_err(adev->dev, "scheduling IB failed (%d).\n", r); >>> return r; >>> } >>> + start_ring_wptr = ring->wptr; >>> >>> need_ctx_switch = ring->current_ctx != fence_ctx; >>> if (ring->funcs->emit_pipeline_sync && job && >>> @@ -306,6 +308,15 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, >>> unsigned int num_ibs, >>> >>> amdgpu_ring_ib_end(ring); >>> amdgpu_ring_commit(ring); >>> + /* This must be last for resets to work properly >>> + * as we need to save the wptr associated with this >>> + * fence. >>> + */ >>> + end_ring_wptr = ring->wptr; >>> + am_fence = container_of(*f, struct amdgpu_fence, base); >>> + am_fence->start_ring_wptr = start_ring_wptr; >>> + am_fence->end_ring_wptr = end_ring_wptr; >> >> The end_ring_wptr variable is superflous and I would put assigning that into >> a helper in amdgpu_fence.c > > I'm not following. I need the start and end wptrs in order to know > what ranges of the ring I need to save. But you have end_ring_wptr = ring->wptr; ... am_fence->end_ring_wptr = end_ring_wptr; start_ring_wptr is available as ring->wptr_old btw. >>> +void amdgpu_ring_backup_unprocessed_jobs(struct amdgpu_ring *ring, >>> + bool is_guilty, >>> + struct amdgpu_fence *bad_fence) >>> +{ >>> + struct amdgpu_fence *fence; >>> + struct dma_fence *old, **ptr; >>> + int i; >>> + >>> + ring->ring_backup_entries_to_copy = 0; >>> + for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) { >> >> That is the wrong order for the fences, you need to start/end at the last >> submitted one. > > I'm not sure I follow. When I backup the ring contents, I need to go > from oldest to newest so the order is correct when I re-emit. Yeah, but 0 is not the oldest. fence_drv->fences is a ring buffer! You need to start with something like fence_drv->fences[bad_fence->seq & mask]. Christian. > > Alex