[AMD Official Use Only - AMD Internal Distribution Only]
Hi,
The change looks good to me.
Reviewed-by: Ce Sun <[email protected]>
________________________________
From: Lazar, Lijo <[email protected]>
Sent: Tuesday, February 24, 2026 1:06 PM
To: [email protected] <[email protected]>
Cc: Zhang, Hawking <[email protected]>; Deucher, Alexander
<[email protected]>; Sun, Ce(Overlord) <[email protected]>
Subject: [PATCH] drm/amdgpu: Fix error handling in slot reset
If the device has not recovered after slot reset is called, it goes to
out label for error handling. There it could make decision based on
uninitialized hive pointer and could result in accessing an uninitialized
list.
Initialize the list and hive properly so that it handles the error
situation and also releases the reset domain lock which is acquired
during error_detected callback.
Fixes: 732c6cefc1ec ("drm/amdgpu: Replace tmp_adev with hive in
amdgpu_pci_slot_reset")
Signed-off-by: Lijo Lazar <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3020cd2493f7..05362f262726 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -7045,6 +7045,15 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev
*pdev)
dev_info(adev->dev, "PCI error: slot reset callback!!\n");
memset(&reset_context, 0, sizeof(reset_context));
+ INIT_LIST_HEAD(&device_list);
+ hive = amdgpu_get_xgmi_hive(adev);
+ if (hive) {
+ mutex_lock(&hive->hive_lock);
+ list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
+ list_add_tail(&tmp_adev->reset_list, &device_list);
+ } else {
+ list_add_tail(&adev->reset_list, &device_list);
+ }
if (adev->pcie_reset_ctx.swus)
link_dev = adev->pcie_reset_ctx.swus;
@@ -7085,19 +7094,13 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev
*pdev)
reset_context.reset_req_dev = adev;
set_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
- INIT_LIST_HEAD(&device_list);
- hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
- mutex_lock(&hive->hive_lock);
reset_context.hive = hive;
- list_for_each_entry(tmp_adev, &hive->device_list,
gmc.xgmi.head) {
+ list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
tmp_adev->pcie_reset_ctx.in_link_reset = true;
- list_add_tail(&tmp_adev->reset_list, &device_list);
- }
} else {
set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
- list_add_tail(&adev->reset_list, &device_list);
}
r = amdgpu_device_asic_reset(adev, &device_list, &reset_context);
--
2.49.0