On Tue, Sep 23, 2025 at 7:42 PM Rodrigo Siqueira <[email protected]> wrote:
>
> On 09/23, Alex Deucher wrote:
> > On Tue, Sep 23, 2025 at 5:12 PM Rodrigo Siqueira <[email protected]> 
> > wrote:
> > >
> > > When trying to unload amdgpu in the SteamDeck (TTY mode), the following
> > > set of errors happens and the system gets unstable:
> > >
> > > [..]
> > >  [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0
> > >  amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test 
> > > failed on gfx_0.0.0 (-110).
> > >  amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
> > > [..]
> > >  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous 
> > > command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
> > >  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
> > >  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous 
> > > command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
> > >  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
> > > [..]
> > >
> > > When the driver initializes the GPU, the PSP validates all the firmware
> > > loaded, and after that, it is not possible to load any other firmware
> > > unless the device is reset. What is happening in the load/unload
> > > situation is that PSP halts the GC engine because it suspects that
> > > something is amiss. To address this issue, this commit ensures that the
> > > GPU is reset (mode 2 reset) in the load/unload sequence.
> > >
> > > Suggested-by: Alex Deucher <[email protected]>
> > > Signed-off-by: Rodrigo Siqueira <[email protected]>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/nv.c | 7 +++++++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
> > > b/drivers/gpu/drm/amd/amdgpu/nv.c
> > > index 50e77d9b30af..1964aa37c499 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> > > @@ -543,6 +543,13 @@ static bool nv_need_reset_on_init(struct 
> > > amdgpu_device *adev)
> > >  {
> > >         u32 sol_reg;
> > >
> > > +       /* GFX in the SteamDeck hangs when amdgpu module is reloaded, 
> > > since the
> > > +        * firmware is already loaded. To avoid this issue, ensure that 
> > > the
> > > +        * device is reset to put the PSP in a good state.
> > > +        */
> > > +       if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(10, 3, 1))
> > > +               return true;
> >
> > This will force a reset every time the driver loads.  That will add a
> > lot of latency to the driver load sequence.  I think it would be
> > better to reset on unload or add a check to see if CP firmware is
> > already loaded here so we only reset if the driver has been previously
> > loaded.
>
> Hi Alex,
>
> Thanks for the feedback.
>
> First, I tried to call amdgpu_asic_reset() in amdgpu_pci_remove(), and
> then in amdgpu_device_fini_hw(). Something like this:
>
> r = amdgpu_asic_reset(adev); // mode 2

Where did you call it?  It should be after we call hw_fini for all of the IPs.

>
> However, the situation worsened, causing a hang followed by the
> SteamDeck fan to spin really fast, and then the system shut down. In
> this sense, do you have any suggestions on which stage I should invoke
> the GPU reset in the unload phase? It feels like amdgpu_device_fini_hw()
> and amdgpu_pci_remove() are already too late to invoke the GPU reset. Or
> maybe the reset operation that I used was not the correct one?

If amdgpu_asic_need_reset_on_init() returns true, we end up calling
amdgpu_asic_reset() so it's the same code path.  In the init case, the
device should be in the same state as doing it at the end of
pcI_remove because it's the first thing we do at init time.  The only
restriction would be that you need to execute it before we unmap the
MMIO BAR because it requires MMIO access.

Alex

>
> Thanks
>
> >
> > Alex
> >
> > > +
> > >         if (adev->flags & AMD_IS_APU)
> > >                 return false;
> > >
> > > --
> > > 2.51.0
> > >
>
> --
> Rodrigo Siqueira

Reply via email to