On 29.09.25 10:51, Jesse.Zhang wrote: > After GPU reset with VRAM loss, a general protection fault occurs > during user queue restoration when accessing vm_bo->vm after > spinlock release in amdgpu_vm_bo_reset_state_machine. > > The root cause is that vm_bo points to the last entry from the > list_for_each_entry loop, but this becomes invalid after the > spinlock is released. Accessing vm_bo->vm at this point leads > to memory corruption. > > Crash log shows: > [ 326.981811] Oops: general protection fault, probably for non-canonical > address 0x4156415741e58ac8: 0000 [#1] SMP NOPTI > [ 326.981820] CPU: 13 UID: 0 PID: 1035 Comm: kworker/13:3 Tainted: G > E 6.16.0+ #25 PREEMPT(voluntary) > [ 326.981826] Tainted: [E]=UNSIGNED_MODULE > [ 326.981827] Hardware name: Gigabyte Technology Co., Ltd. X870E AORUS PRO > ICE/X870E AORUS PRO ICE, BIOS F3i 12/19/2024 > [ 326.981831] Workqueue: events amdgpu_userq_restore_worker [amdgpu] > [ 326.981999] RIP: 0010:amdgpu_vm_assert_locked+0x16/0x70 [amdgpu] > [ 326.982094] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 > 90 0f 1f 44 00 00 48 85 ff 74 45 48 8b 87 80 03 00 00 48 85 c0 74 40 <48> 8b > b8 80 01 00 00 48 85 ff 74 3b 8b 05 0c b7 0e f0 85 c0 75 05 > [ 326.982098] RSP: 0018:ffffaa91c2a6bc20 EFLAGS: 00010206 > [ 326.982100] RAX: 4156415741e58948 RBX: ffff9e8f013e8330 RCX: > 0000000000000000 > [ 326.982102] RDX: 0000000000000005 RSI: 000000001d254e88 RDI: > ffffffffc144814a > [ 326.982104] RBP: ffffaa91c2a6bc68 R08: 0000004c21a25674 R09: > 0000000000000001 > [ 326.982106] R10: 0000000000000001 R11: dccaf3f2f82863fc R12: > ffff9e8f013e8000 > [ 326.982108] R13: ffff9e8f013e8000 R14: 0000000000000000 R15: > ffff9e8f09980000 > [ 326.982110] FS: 0000000000000000(0000) GS:ffff9e9e79995000(0000) > knlGS:0000000000000000 > [ 326.982112] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 326.982114] CR2: 000055ed6c9caa80 CR3: 0000000797060000 CR4: > 0000000000750ef0 > [ 326.982116] PKRU: 55555554 > > Signed-off-by: Jesse Zhang <[email protected]>
Reviewed-by: Christian König <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index 563cad9c6cbc..86c8288c665f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -265,7 +265,7 @@ static void amdgpu_vm_bo_reset_state_machine(struct > amdgpu_vm *vm) > vm_bo->moved = true; > spin_unlock(&vm->invalidated_lock); > > - amdgpu_vm_assert_locked(vm_bo->vm); > + amdgpu_vm_assert_locked(vm); > list_for_each_entry_safe(vm_bo, tmp, &vm->idle, vm_status) { > struct amdgpu_bo *bo = vm_bo->bo; >
