AMD General Hi Timur, Alex,
Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for the past few weeks and what you're describing on NV48 lines up closely with what we've seen Quick highlights from my work: 1. IH retry CAM ACK doesn't actually free the slot when written via WDOORBELL on NV4 .. we have to use MMIO (WREG32_SOC15(OSSSYS, 0, regIH_RETRY_CAM_ACK, cam_index & 0x3ff)). I think you may want to check that, since "fault never resolves" is exactly the symptom you'd see if the CAM never gets cleared. 2. gfx12 needs its own retry-fault detection path .. amdgpu_gmc_handle_retry_fault on gfx9-era constants (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on gfx12. We added a gfx12-native handler that reads from src_data[2] for NV4. 3. TLB flush making it worse is a known trap .. on NV4 we see the same. The flush adds more pressure on the same UTC L2 already saturated by the retry storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck busy on the user VMID with SDMA parked on a GCR ack. 4. Up to ~512 MiB our patches resolve faults cleanly; at 1 GiB we see random hangs that we've isolated to an SDMA -> GCR -> GC-cache deadlock when the BO-clear runs in ih_soft_work context. Could you reply with your series? I tried searching the inbox but couldn't find it. Once I have it, I can diff against ours to see what overlaps and what's net-new on each side. AMIR SHETAIA Senior Software Development Engineer | AMD Software Platform Architecture Team ---------------------------------------------------------------------------------------------------------------------------------- 1 Commerce Valley Drive, Markham, ON L3T 7X6 LinkedIn | Instagram | X | amd.com -----Original Message----- From: Timur Kristóf <[email protected]> Sent: Wednesday, May 13, 2026 12:43 PM To: Shetaia, Amir <[email protected]>; Alex Deucher <[email protected]> Cc: [email protected]; Deucher, Alexander <[email protected]>; Koenig, Christian <[email protected]>; Marek Olšák <[email protected]>; Natalie Vock <[email protected]>; Melissa Wen <[email protected]> Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling [You don't often get email from [email protected]. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] On Wednesday, May 13, 2026 6:36:02 PM Central European Summer Time Alex Deucher wrote: > + Amir > > Amir may have some insights on navi4x as he was looking at this recently. > > Alex Hi Alex, Amir, I think we are very close to enabling retry faults by default on Navi 3. I'd be happy to receive feedback on the above series. With regards to Navi 4: I also attempted to get it working on Navi 48, and I managed to get retry faults enabled, but it seems that amdgpu_vm_handle_fault() can't actually resolve the page fault on Navi 48. It just keeps retrying until it times out. Christian suggested this may be due to an invalid page being stuck in the cache. I tried adding a TLB flush but unfortunately that just made it worse (it hangs irrecoverably). Any insight is appreciated! Thanks & best regards, Timur > > On Wed, May 13, 2026 at 12:30 PM Timur Kristóf > <[email protected]> wrote: > > Fix some issues regarding retry fault handling, such as enabling the > > retry fault interrupt (necessary for retry faults to work) and such. > > > > Improve retry faults on Navi 3 dGPUs by enabling the filter CAM, > > which can filter the repeated page fault interrupts that happen when > > retry faults are enabled, making the handling more efficient. > > > > With this series, the kernel is able to mitigate most page faults on > > Navi 3 without causing a hang and without a need to reset the GPU, > > when the > > amdgpu.noretry=0 module parameter is set. > > > > Timur Kristóf (6): > > drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly > > drm/amdgpu/gfxhub: Enable retry fault interrupts when needed > > drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed > > drm/amdgpu/gmc: Don't compare page fault timestamps with other > > > > interrupts > > > > drm/amdgpu/ih: Add retry_cam_ack IH function pointer > > drm/amdgpu: Enable retry CAM on Navi 3 dGPUs > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 +++++-- > > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h | 1 + > > drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++------- > > drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c | 17 ++++++++++------- > > drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c | 19 +++++++++++-------- > > drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 15 +++++++++------ > > drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c | 15 +++++++++------ > > drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c | 15 +++++++++------ > > drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 15 +++++++++------ > > drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c | 17 ++++++++++------- > > drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c | 17 ++++++++++------- > > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 5 ++++- > > drivers/gpu/drm/amd/amdgpu/ih_v6_0.c | 18 +++++++++++++++++- > > drivers/gpu/drm/amd/amdgpu/ih_v7_0.c | 6 ++++++ > > drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 8 +++++++- > > 22 files changed, 134 insertions(+), 71 deletions(-) > > > > -- > > 2.54.0
