On Thu, Jan 22, 2026 at 12:07 AM Lazar, Lijo <[email protected]> wrote:
>
>
>
> On 21-Jan-26 11:54 PM, Alex Deucher wrote:
> > From: Jon Doron <[email protected]>
> >
> > On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and
> > ih2 interrupt ring buffers are not initialized. This is by design, as
> > these secondary IH rings are only available on discrete GPUs. See
> > vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when
> > AMD_IS_APU is set.
> >
> > However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to
> > get the timestamp of the last interrupt entry. When retry faults are
> > enabled on APUs (noretry=0), this function is called from the SVM page
> > fault recovery path, resulting in a NULL pointer dereference when
> > amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[].
> >
> > The crash manifests as:
> >
> >    BUG: kernel NULL pointer dereference, address: 0000000000000004
> >    RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu]
> >    Call Trace:
> >     amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu]
> >     svm_range_restore_pages+0xae5/0x11c0 [amdgpu]
> >     amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu]
> >     gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu]
> >     amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu]
> >     amdgpu_ih_process+0x84/0x100 [amdgpu]
> >
> > This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW
> > IP 9.3.0 from noretry=1") which changed the default for Renoir APU from
> > noretry=1 to noretry=0, enabling retry fault handling and thus
> > exercising the buggy code path.
> >
> > Fix this by adding a check for ih1.ring_size before attempting to use
> > it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu:
> > Rework retry fault removal").  This is needed if the hardware doesn't
> > support secondary HW IH rings.
> >
> > v2: additional updates (Alex)
> >
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814
> > Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal")
> > Cc: [email protected]
> > Signed-off-by: Jon Doron <[email protected]>
> > Signed-off-by: Alex Deucher <[email protected]>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 7 ++++++-
> >   1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > index 8e65fec9f534e..243d75917458a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -498,8 +498,13 @@ void amdgpu_gmc_filter_faults_remove(struct 
> > amdgpu_device *adev, uint64_t addr,
> >
> >       if (adev->irq.retry_cam_enabled)
> >               return;
> > +     else if (adev->irq.ih1.ring_size)
> > +             ih = &adev->irq.ih1;
> > +     else if (adev->irq.ih_soft.enabled)
> > +             ih = &adev->irq.ih_soft;
>
> Faults are delegated to soft ring when retry_cam is enabled -
> https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c#L541
>
> That matches with the original logic in d299441654f ("drm/amdgpu: Rework
> retry fault removal").
>
> To match exactly with the logic in above commit, I think it should use
> soft ring only when retry cam is enabled. Presently, it's returning
> without doing anything.

+ Philip

That logic was changed in:
commit e61801f162ddcf8874c820639483ec4849b0fb0b
Author: Philip Yang <[email protected]>
Date:   Mon Aug 28 14:05:55 2023 -0400

    drm/amdkfd: Don't use sw fault filter if retry cam enabled

    If retry cam enabled, we don't use sw retry fault filter and add fault
    into sw filter ring, so we shouldn't remove fault from sw filter.

    Signed-off-by: Philip Yang <[email protected]>
    Reviewed-by: Christian König <[email protected]>
    Signed-off-by: Alex Deucher <[email protected]>

So I retained that logic.

Alex

>
> Thanks,
> Lijo
>
> > +     else
> > +             return;
> >
> > -     ih = &adev->irq.ih1;
> >       /* Get the WPTR of the last entry in IH ring */
> >       last_wptr = amdgpu_ih_get_wptr(adev, ih);
> >       /* Order wptr with ring data. */
>

Reply via email to