On Wed, May 28, 2025 at 9:57 AM Alex Deucher <alexdeuc...@gmail.com> wrote: > > On Wed, May 28, 2025 at 9:48 AM Christian König > <christian.koe...@amd.com> wrote: > > > > On 5/28/25 15:38, Alex Deucher wrote: > > > On Wed, May 28, 2025 at 7:40 AM Christian König > > > <christian.koe...@amd.com> wrote: > > >> > > >> On 5/28/25 06:19, Alex Deucher wrote: > > >>> Re-emit the unprocessed state after resetting the queue. > > >>> > > >>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> > > >>> --- > > >>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 19 ++++++++++++------- > > >>> 1 file changed, 12 insertions(+), 7 deletions(-) > > >>> > > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > > >>> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > > >>> index 3193eb88b6889..f6e04cf21abcc 100644 > > >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > > >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > > >>> @@ -9537,6 +9537,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring > > >>> *ring, unsigned int vmid) > > >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0]; > > >>> struct amdgpu_ring *kiq_ring = &kiq->ring; > > >>> unsigned long flags; > > >>> + unsigned int i; > > >>> u32 tmp; > > >>> u64 addr; > > >>> int r; > > >>> @@ -9571,10 +9572,8 @@ static int gfx_v10_0_reset_kgq(struct > > >>> amdgpu_ring *ring, unsigned int vmid) > > >>> SOC15_REG_OFFSET(GC, 0, > > >>> mmCP_VMID_RESET), 0, 0xffffffff); > > >>> kiq->pmf->kiq_map_queues(kiq_ring, ring); > > >>> amdgpu_ring_commit(kiq_ring); > > >>> - > > >>> - spin_unlock_irqrestore(&kiq->ring_lock, flags); > > >>> - > > >>> r = amdgpu_ring_test_ring(kiq_ring); > > >> > > >> I don't think we should do a ring test on the KIQ here That basically > > >> doesn't tells as much and might cause additional problems. > > > > > > We need some way to wait for the KIQ submission to complete. This is > > > a simple way to accomplish that. > > > > Ok, that makes sense. > > > > > > > >> > > >>> + spin_unlock_irqrestore(&kiq->ring_lock, flags); > > >>> if (r) > > >>> return r; > > >>> > > >>> @@ -9584,7 +9583,15 @@ static int gfx_v10_0_reset_kgq(struct > > >>> amdgpu_ring *ring, unsigned int vmid) > > >>> return r; > > >>> } > > >>> > > >>> - return amdgpu_ring_test_ring(ring); > > >>> + if (amdgpu_ring_alloc(ring, 8 + > > >>> ring->ring_backup_entries_to_copy)) > > >>> + return -ENOMEM; > > >>> + amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr, > > >>> + ring->ring_backup_seq, 0); > > >>> + for (i = 0; i < ring->ring_backup_entries_to_copy; i++) > > >>> + amdgpu_ring_write(ring, ring->ring_backup[i]); > > >>> + amdgpu_ring_commit(ring); > > >> > > >> I'm not sure if the commands are always relocatable. We should probably > > >> just instruct the ring to re-start with the original RPTR/WPTR. > > >> > > >> That would also avoid the need to save/restore the ring content with the > > >> CPU. > > > > > > I tried that originally, but I couldn't make it work for a few reasons: > > > 1. We need to have enforce isolation enabled otherwise we almost > > > always reset the wrong VMID so then when we execute the rest of the > > > pipeline, we hang again. > > > 2. When enforce isolation is enabled, we need to signal the fence > > > associated with the guilty job first otherwise we get stuck waiting on > > > the pipeline sync when we execute the rest of the pipeline > > > > So we execute the guilty job submission again? What prevents that one from > > running? > > Without enforce isolation we usually end up reseting the wrong job. I > re-emit everything from the ring after the bad job. What usually > happens is that a game is running and then you start an app which > hangs the GPU (e.g., an infinite loop in a shader). Most of the time > we reset the game rather than the bad app. The scheduler times out on > the app first and then we reset the queue and re-emit the following > jobs, one of which is the bad app. Queue reset fails and we fall back > to a full adapter reset. With enforce isolation, we reset the bad app > instead.
Thinking about it more, we'll always end up in a full adapter reset if we have back to back bad apps. I think what we need to do is emit the seq number from the bad app, verify that that completes and use that to determine success in the reset function then re-emit the follow on work and exit the function. That way the reset succeeds and then if the follow on job times out, we can repeat the process cleanly. Alex > > Alex > > > > > Christian. > > > > > > > > Alex > > > > > >> > > >> Regards, > > >> Christian. > > >> > > >>> + > > >>> + return gfx_v10_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT); > > >>> } > > >>> > > >>> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring, > > >>> @@ -9612,9 +9619,8 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring > > >>> *ring, > > >>> kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES, > > >>> 0, 0); > > >>> amdgpu_ring_commit(kiq_ring); > > >>> - spin_unlock_irqrestore(&kiq->ring_lock, flags); > > >>> - > > >>> r = amdgpu_ring_test_ring(kiq_ring); > > >>> + spin_unlock_irqrestore(&kiq->ring_lock, flags); > > >>> if (r) > > >>> return r; > > >>> > > >>> @@ -9891,7 +9897,6 @@ static const struct amdgpu_ring_funcs > > >>> gfx_v10_0_ring_funcs_gfx = { > > >>> .emit_wreg = gfx_v10_0_ring_emit_wreg, > > >>> .emit_reg_wait = gfx_v10_0_ring_emit_reg_wait, > > >>> .emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait, > > >>> - .soft_recovery = gfx_v10_0_ring_soft_recovery, > > >>> .emit_mem_sync = gfx_v10_0_emit_mem_sync, > > >>> .reset = gfx_v10_0_reset_kgq, > > >>> .emit_cleaner_shader = gfx_v10_0_ring_emit_cleaner_shader, > > >> > >