[RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw()

2022-09-30 Thread Zhang Boyang
After amdgpu_device_init() failed, adev->reset_domain may be NULL. Thus
subsequent call to amdgpu_device_fini_sw() may result in null-ptr-deref.

This patch fixes the problem by adding a NULL pointer check around the
code of releasing adev->reset_domain in amdgpu_device_fini_sw().

Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be refcounted.")

Signed-off-by: Zhang Boyang 
Link: 
https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad...@suddenlinkmail.com/
Link: https://lore.kernel.org/lkml/at9whr.3z1t3vi9a2...@att.net/
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index be7aff2d4a57..204daad06b32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
if (adev->mman.discovery_bin)
amdgpu_discovery_fini(adev);
 
-   amdgpu_reset_put_reset_domain(adev->reset_domain);
-   adev->reset_domain = NULL;
+   if (adev->reset_domain) {
+   amdgpu_reset_put_reset_domain(adev->reset_domain);
+   adev->reset_domain = NULL;
+   }
 
kfree(adev->pci_state);
 
-- 
2.30.2



Re: [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw()

2022-10-03 Thread Steven J Abner

I had done more delving into this also, thankfully was not forgotten.
Additional info to solve blackout, was going to contact AMD:
The problem as far as I could trace occurs in amdgpu_psp.c in function 
psp_cmd_submit_buf().
Normal counter 'timeout' seems to use a max of about 400 with 
usleep_range(5, 80).
Under normal operation 'psp->fence_buf' will equal 'index' and drop 
from loop. I could not track what
fills the kernel buffer objects 'virtual' pointer's reference 
(psp->fence_buf). I assume that
it's to a firmware file that on read is returning an error, and getting 
stuck in a loop lock.
The error condition I found occurs when 'psp->fence_buf' does not equal 
'index',

breaking from loop with 'timeout' == 0.
Blackout seemed to be about 80% of reboots, but found that in 
reconfiguration of kernel,
on 5.18.19, with CONFIG_ATA_ACPI=y drops blackouts to about 30% of 
reboots. I have now avoided
all blackouts with addition of CONFIG_SATA_PMP=y (at least a week 
free?).
This is pure guess but, maybe the ARM PSP processor is dependent on 
libata procedures, one

of which causes a lock up some of the time.
Steve

On Fri, Sep 30, 2022 at 21:41, Zhang Boyang  
wrote:
After amdgpu_device_init() failed, adev->reset_domain may be NULL. 
Thus
subsequent call to amdgpu_device_fini_sw() may result in 
null-ptr-deref.


This patch fixes the problem by adding a NULL pointer check around the
code of releasing adev->reset_domain in amdgpu_device_fini_sw().

Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be 
refcounted.")


Signed-off-by: Zhang Boyang 
Link: 
https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad...@suddenlinkmail.com/

Link: https://lore.kernel.org/lkml/at9whr.3z1t3vi9a2...@att.net/
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index be7aff2d4a57..204daad06b32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct 
amdgpu_device *adev)

if (adev->mman.discovery_bin)
amdgpu_discovery_fini(adev);

-   amdgpu_reset_put_reset_domain(adev->reset_domain);
-   adev->reset_domain = NULL;
+   if (adev->reset_domain) {
+   amdgpu_reset_put_reset_domain(adev->reset_domain);
+   adev->reset_domain = NULL;
+   }

kfree(adev->pci_state);

--
2.30.2