AMD General

> -----Original Message-----
> From: Shetaia, Amir <[email protected]>
> Sent: Wednesday, May 13, 2026 1:29 PM
> To: Timur Kristóf <[email protected]>; Alex Deucher
> <[email protected]>
> Cc: [email protected]; Deucher, Alexander
> <[email protected]>; Koenig, Christian
> <[email protected]>; Marek Olšák <[email protected]>; Natalie
> Vock <[email protected]>; Melissa Wen <[email protected]>
> Subject: RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
>
> AMD General
>
> Hi Timur, Alex,
>
> Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for
> the past few weeks and what you're describing on NV48 lines up closely with
> what we've seen
>
> Quick highlights from my work:
>
> 1. IH retry CAM ACK doesn't actually free the slot when written via
> WDOORBELL on NV4 .. we have to use MMIO (WREG32_SOC15(OSSSYS, 0,
> regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).
> I think you may want to check that, since "fault never resolves" is exactly 
> the
> symptom you'd see if the CAM never gets cleared.
>
> 2. gfx12 needs its own retry-fault detection path ..
> amdgpu_gmc_handle_retry_fault on gfx9-era constants
> (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never
> matches on gfx12. We added a gfx12-native handler that reads from
> src_data[2] for NV4.
>
> 3. TLB flush making it worse is a known trap .. on NV4 we see the same. The
> flush adds more pressure on the same UTC L2  already saturated by the retry
> storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck
> busy on the user VMID with SDMA parked on a GCR ack.
>
> 4. Up to ~512 MiB our patches resolve faults cleanly; at 1 GiB we see random
> hangs that we've isolated to an SDMA -> GCR -> GC-cache deadlock when the
> BO-clear runs in ih_soft_work context.
>
> Could you reply with your series? I tried searching the inbox but couldn't 
> find
> it. Once I have it, I can diff against ours to see what overlaps and what's 
> net-
> new on each side.
>

Here's the patch series:
https://patchwork.freedesktop.org/series/166522/

Alex

> AMIR SHETAIA
> Senior Software Development Engineer  |  AMD Software Platform
> Architecture Team
> ----------------------------------------------------------------------------------------------
> ------------------------------------
> 1 Commerce Valley Drive, Markham, ON L3T 7X6 LinkedIn  |  Instagram  |  X  |
> amd.com
>
>
>
>
> -----Original Message-----
> From: Timur Kristóf <[email protected]>
> Sent: Wednesday, May 13, 2026 12:43 PM
> To: Shetaia, Amir <[email protected]>; Alex Deucher
> <[email protected]>
> Cc: [email protected]; Deucher, Alexander
> <[email protected]>; Koenig, Christian
> <[email protected]>; Marek Olšák <[email protected]>; Natalie
> Vock <[email protected]>; Melissa Wen <[email protected]>
> Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
>
> [You don't often get email from [email protected]. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wednesday, May 13, 2026 6:36:02 PM Central European Summer Time
> Alex Deucher wrote:
> > + Amir
> >
> > Amir may have some insights on navi4x as he was looking at this recently.
> >
> > Alex
>
> Hi Alex, Amir,
>
> I think we are very close to enabling retry faults by default on Navi 3.
> I'd be happy to receive feedback on the above series.
>
> With regards to Navi 4:
>
> I also attempted to get it working on Navi 48, and I managed to get retry 
> faults
> enabled, but it seems that amdgpu_vm_handle_fault() can't actually resolve
> the page fault on Navi 48. It just keeps retrying until it times out.
> Christian suggested this may be due to an invalid page being stuck in the
> cache. I tried adding a TLB flush but unfortunately that just made it worse 
> (it
> hangs irrecoverably).
>
> Any insight is appreciated!
>
> Thanks & best regards,
> Timur
>
> >
> > On Wed, May 13, 2026 at 12:30 PM Timur Kristóf
> > <[email protected]>
> wrote:
> > > Fix some issues regarding retry fault handling, such as enabling the
> > > retry fault interrupt (necessary for retry faults to work) and such.
> > >
> > > Improve retry faults on Navi 3 dGPUs by enabling the filter CAM,
> > > which can filter the repeated page fault interrupts that happen when
> > > retry faults are enabled, making the handling more efficient.
> > >
> > > With this series, the kernel is able to mitigate most page faults on
> > > Navi 3 without causing a hang and without a need to reset the GPU,
> > > when the
> > > amdgpu.noretry=0 module parameter is set.
> > >
> > > Timur Kristóf (6):
> > >   drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
> > >   drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
> > >   drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
> > >   drm/amdgpu/gmc: Don't compare page fault timestamps with other
> > >
> > >     interrupts
> > >
> > >   drm/amdgpu/ih: Add retry_cam_ack IH function pointer
> > >   drm/amdgpu: Enable retry CAM on Navi 3 dGPUs
> > >
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     |  7 +++++--
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h      |  1 +
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++-----
> --
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 17 ++++++++++-------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 19 +++++++++++-----
> ---
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 15 +++++++++------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 15 +++++++++------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 15 +++++++++------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 15 +++++++++------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 17 ++++++++++-------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 17 ++++++++++------
> -
> > >  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c      |  5 ++++-
> > >  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c        | 18 +++++++++++++++++-
> > >  drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        |  6 ++++++
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/vega20_ih.c      |  8 +++++++-
> > >  22 files changed, 134 insertions(+), 71 deletions(-)
> > >
> > > --
> > > 2.54.0
>
>
>
>

Reply via email to