On 10/16, Alex Deucher wrote: > On Thu, Oct 16, 2025 at 5:00 PM Rodrigo Siqueira <[email protected]> wrote: > > > > When trying to unload amdgpu in the SteamDeck (TTY mode), the following > > set of errors happens and the system gets unstable: > > > > [..] > > [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0 > > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test > > failed on gfx_0.0.0 (-110). > > amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). > > [..] > > amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: > > SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! > > amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: > > SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! > > [..] > > > > When the driver initializes the GPU, the PSP validates all the firmware > > loaded, and after that, it is not possible to load any other firmware > > unless the device is reset. What is happening in the load/unload > > situation is that PSP halts the GC engine because it suspects that > > something is amiss. To address this issue, this commit ensures that the > > GPU is reset (mode 2 reset) in the unload sequence. > > > > Suggested-by: Alex Deucher <[email protected]> > > Signed-off-by: Rodrigo Siqueira <[email protected]> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 0d5585bc3b04..78009b93855b 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -3613,7 +3613,7 @@ static void amdgpu_device_smu_fini_early(struct > > amdgpu_device *adev) > > > > static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev) > > { > > - int i, r; > > + int i, r, current_reset_method; > > > > for (i = 0; i < adev->num_ip_blocks; i++) { > > if (!adev->ip_blocks[i].version->funcs->early_fini) > > @@ -3649,6 +3649,17 @@ static int amdgpu_device_ip_fini_early(struct > > amdgpu_device *adev) > > "failed to release exclusive mode on > > fini\n"); > > } > > > > + /* Reset the device before entirely removing it to avoid load issues > > + * caused by firmware validation. > > + */ > > + current_reset_method = amdgpu_reset_method; > > + amdgpu_reset_method = AMD_RESET_METHOD_MODE2; > > This would only be needed if the user has overridden the reset method > via a kernel module parameter. If they've done that they get to keep > the pieces. MODE2 reset is only used on certain chips so this won't > work generally. Better to just drop this. amdgpu_asic_reset() will > automatically default to the right reset method for the chip. > Alternative is to set AMD_RESET_METHOD_NONE which is the automatic > setting.
I'll send a V3 whithout the method mode 2 setup. Thanks a lot > > Alex > > > + r = amdgpu_asic_reset(adev); > > + if (r) > > + dev_err(adev->dev, "asic reset on %s failed\n", __func__); > > + > > + amdgpu_reset_method = current_reset_method; > > + > > return 0; > > } > > > > -- > > 2.51.0 > > -- Rodrigo Siqueira
