On 28-Jan-26 11:53 AM, Perry Yuan wrote:
Add a full memory barrier after clearing no_hw_access in amdgpu_device_mode1_reset() so subsequent PCI state restore access cannot observe stale state on other CPUs.
Just want to reiterate that this approach masks the original logical errors within amdgpu.
For ex: this is one such which would not have been caught in the first place with shortcuts like these.
12caf3b76150 drm/amdkfd: Handle GPU reset and drain retry fault race Thanks, Lijo
Fixes: 91ae0045130b ("drm/amd/pm: Disable MMIO access during SMU Mode 1 reset") Signed-off-by: Perry Yuan <[email protected]> Reviewed-by: Yifan Zhang <[email protected]> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index b2deb6a74eb2..e69ab8a923e3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5735,6 +5735,9 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) /* enable mmio access after mode 1 reset completed */ adev->no_hw_access = false;+ /* ensure no_hw_access is updated before we access hw */+ smp_mb(); + amdgpu_device_load_pci_state(adev->pdev); ret = amdgpu_psp_wait_for_bootloader(adev); if (ret)
