amdgpu: Improve retry fault handling

Shetaia, Amir Wed, 13 May 2026 13:34:50 -0700

AMD General

Hi Timur and Alex,


Thanks for sending the series.

Timur, you are right, I see your patch 6 already does the MMIO ACK for 
gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's ih_v7_0 
implementation, which still does WDOORBELL. that's where I'd suggest swapping 
in MMIO for NV4.

Some answers to your questions:

1. "Fault never resolves on NV48" different shape from our broken-CAM-ACK 
symptom.

You're right, those are different. Our cam-walk-monotonically symptom only 
shows up when CAM is enabled but the ACK is broken.
On your NV48 setup CAM probably isn't enabled at all (your patch 6 only enables 
it for ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init), so retries fire 
repeatedly on the IH ring instead of being deduped by CAM.
That matches what you're seeing .. amdgpu_vm_handle_fault keeps being called 
but each call is on a fresh IRQ for the same address.

Two things that could be happening underneath:
- The fault handler runs but the updated PTE never reaches UTC L0 (TLB 
invalidation gap). On NV4 we see this as "valid PTEs failing to translate" in 
our UMR captures.
- Or amdgpu_vm_handle_fault is bailing early without actually fixing the mapping

Quickest discriminator: enable the CAM in ih_v7_0_irq_init (set 
IH_RETRY_INT_CAM_CNTL.ENABLE=1, CAM_SIZE=0xF, 
adev->irq.retry_cam_enabled=true), use MMIO ACK from gmc_v12_0, and see if the 
symptom changes from "infinite retries"
to "first batch of pages map, then it hangs after a few hundred."

2. What bits we check on src_data[2]:

Honestly, we don't use src_data[2] for retry detection. We use it only for the 
cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10 bits = CAM slot 
*/

For retry detection we initially used the gfx9 constant on src_data[1] like 
you, but observed the bit cleared on a lot of NV4 events that should have been 
retries (waves were hung in xnack-stall but no IH event matched).
So we just go through the retry path unconditionally on NV4 and let 
amdgpu_vm_handle_fault sort it out via SVM range migration. May be specific to 
gfx1201 / our test path

3. TLB flush making it worse .. clue about what to do:

Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an 
architectural deadlock ... ih_soft_work blocks on a dma_fence for an SDMA 
BO-clear, the BO-clear is stalled on a GCR (cache flush) request,
and the GC cache block isn't ACK'ing the GCR while UTC L2 is saturated by the 
user shader's XNACK retry storm. Adding a TLB flush adds another translation 
request to the same saturated UTC, which is why it makes things worse.

4. IH1 ring on NV4:

Same as you ... retry faults on NV4 always come in on IH0. We delegate from IH0 
to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the SVM/migration path 
can sleep, but the original entry is on IH0. We haven't tried IH1 routing.

Re your branch: thanks for the gitlab link, easier than digging through 
patchwork.
I'll cherry-pick patches 1, 3, 4 into our test build to see if patch 4 cleans 
up the timestamp filter delta we're seeing (97k entered / 2.8k completed at 1 
GiB might be partly explained by your Strix Halo bug).

AMIR SHETAIA



-----Original Message-----
From: Timur Kristóf <[email protected]>
Sent: Wednesday, May 13, 2026 1:52 PM
To: Alex Deucher <[email protected]>; Shetaia, Amir <[email protected]>
Cc: [email protected]; Deucher, Alexander 
<[email protected]>; Koenig, Christian <[email protected]>; 
Marek Olšák <[email protected]>; Natalie Vock <[email protected]>; Melissa Wen 
<[email protected]>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

[You don't often get email from [email protected]. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Hi Amir,

Thanks for the quick response!
See my replies below.

On Wednesday, May 13, 2026 7:28:41 PM Central European Summer Time Shetaia,
>
> Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK
> for the past few weeks and what you're describing on NV48 lines up
> closely with what we've seen

> Quick highlights from my work:
>
> 1. IH retry CAM ACK doesn't actually free the slot when written via
> WDOORBELL on NV4 .. we have to use MMIO (WREG32_SOC15(OSSSYS, 0,
> regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).

I agree. That's my conclusion as well and that's exactly what I'm doing in my 
series for Navi 31, see the following patch:
"drm/amdgpu: Enable retry CAM on Navi 3 dGPUs"

> "fault never resolves" is exactly the symptom you'd see if the CAM
> never gets cleared.

Not exactly.

When the CAM never gets cleared, the first page fault is still resolved, but 
subsequent page faults (that belong to the same CAM entry) will cause a hang 
because the IRQ handler is not called (because the IRQ is filtered out).

That's not what I see on Navi 48. Instead what I see is that the IRQ is fired 
repeatedly and amdgpu_vm_handle_fault() is called repeatedly, but just doesn't 
resolve the fault.

> 2. gfx12 needs its own retry-fault detection path ..
> amdgpu_gmc_handle_retry_fault on gfx9-era constants
> (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on
> gfx12. We added a gfx12-native handler that reads from src_data[2] for NV4.

Interesting. Could you share what bits you checked on src_data[2]?

The gfx9-era constants worked for me on both Navi 31 and 48 for detecting retry 
faults; however I needed to program some extra register fields in the gfxhub 
code to actually enable retry fault interrupts.

>
> 3. TLB flush making it worse is a known trap .. on NV4 we see the
> same. The flush adds more pressure on the same UTC L2 already
> saturated by the retry storm; the GCR can't drain. We have UMR
> captures showing GCVM_L2 stuck busy on the user VMID with SDMA parked
> on a GCR ack.

I am pretty sure this is what I saw.
Do you have any clue about what can be done about this?

> 4. Up to ~512 MiB our patches resolve faults cleanly;

That's pretty impressive! Nice work!

> at 1 GiB we see random
> hangs that we've isolated to an SDMA -> GCR -> GC-cache deadlock when
> the BO-clear runs in ih_soft_work context.

Actually something I forgot to ask: on Navi 4x is it possible to use the IH1 
ring? On my machine it seemed that the retry fault interrupts always come in on 
the IH0 ring even though the IH1 is enabled and configured upstream already.

> Could you reply with your series? I tried searching the inbox but
> couldn't find it. Once I have it, I can diff against ours to see what
> overlaps and what's net-new on each side.

You can view it on patchwork or the mailing list arcives:
https://patchwork.freedesktop.org/series/166522/
https://lists.freedesktop.org/archives/amd-gfx/2026-May/thread.html#144500

Or if that's more comfortable for you, here is my GitLab branch:
https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults

Thanks & best regards,
Timur

RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

Reply via email to