I updated my system with Radeon VII from kernel 5.6 to kernel 5.7, and
following started to happen on each boot:

        ...
        BUG: kernel NULL pointer dereference, address: 0000000000000128
        ...
        CPU: 9 PID: 1940 Comm: modprobe Tainted: G            E     
5.7.2-200.im0.fc32.x86_64 #1
        Hardware name: System manufacturer System Product Name/PRIME X570-P, 
BIOS 1407 04/02/2020
        RIP: 0010:lock_bus+0x42/0x60 [amdgpu]
        ...
        Call Trace:
         i2c_smbus_xfer+0x3d/0xf0
         i2c_default_probe+0xf3/0x130
         i2c_detect.isra.0+0xfe/0x2b0
         ? kfree+0xa3/0x200
         ? kobject_uevent_env+0x11f/0x6a0
         ? i2c_detect.isra.0+0x2b0/0x2b0
         __process_new_driver+0x1b/0x20
         bus_for_each_dev+0x64/0x90
         ? 0xffffffffc0f34000
         i2c_register_driver+0x73/0xc0
         do_one_initcall+0x46/0x200
         ? _cond_resched+0x16/0x40
         ? kmem_cache_alloc_trace+0x167/0x220
         ? do_init_module+0x23/0x260
         do_init_module+0x5c/0x260
         __do_sys_init_module+0x14f/0x170
         do_syscall_64+0x5b/0xf0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ...

Error appears when some i2c device driver tries to probe for devices
using adapter registered by `smu_v11_0_i2c_eeprom_control_init()`.
Code supporting this adapter requires `adev->psp.ras.ras` to be not
NULL, which is true only when `amdgpu_ras_init()` detects HW support by
calling `amdgpu_ras_check_supported()`.

Before 9015d60c9ee1, adapter was registered by

        -> amdgpu_device_ip_init()
          -> amdgpu_ras_recovery_init()
            -> amdgpu_ras_eeprom_init()
              -> smu_v11_0_i2c_eeprom_control_init()

after verifying that `adev->psp.ras.ras` is not NULL in
`amdgpu_ras_recovery_init()`. Currently it is registered
unconditionally by

        -> amdgpu_device_ip_init()
          -> pp_sw_init()
            -> hwmgr_sw_init()
              -> vega20_smu_init()
                -> smu_v11_0_i2c_eeprom_control_init()

Fix simply adds HW support check (ras == NULL => no support) before
calling `smu_v11_0_i2c_eeprom_control_{init,fini}()`.

Please note that there is a chance that similar fix is also required for
CHIP_ARCTURUS. I do not know whether any actual Arcturus hardware without
RAS exist, and whether calling `smu_i2c_eeprom_init()` makes any sense
when there is no HW support.

Cc: sta...@vger.kernel.org
Fixes: 9015d60c9ee1 ("drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device")
Signed-off-by: Ivan Mironov <mironov.i...@gmail.com>
Tested-by: Bjorn Nostvold <bjorn.nostv...@gmail.com>
---
Changelog:

v1:
  - Added "Tested-by" for another user who used this patch to fix the
    same issue.

v0:
  - Patch introduced.
---
 drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c 
b/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
index 2fb97554134f..c2e0fbbccf56 100644
--- a/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
+++ b/drivers/gpu/drm/amd/powerplay/smumgr/vega20_smumgr.c
@@ -522,9 +522,11 @@ static int vega20_smu_init(struct pp_hwmgr *hwmgr)
        priv->smu_tables.entry[TABLE_ACTIVITY_MONITOR_COEFF].version = 0x01;
        priv->smu_tables.entry[TABLE_ACTIVITY_MONITOR_COEFF].size = 
sizeof(DpmActivityMonitorCoeffInt_t);
 
-       ret = smu_v11_0_i2c_eeprom_control_init(&adev->pm.smu_i2c);
-       if (ret)
-               goto err4;
+       if (adev->psp.ras.ras) {
+               ret = smu_v11_0_i2c_eeprom_control_init(&adev->pm.smu_i2c);
+               if (ret)
+                       goto err4;
+       }
 
        return 0;
 
@@ -560,7 +562,8 @@ static int vega20_smu_fini(struct pp_hwmgr *hwmgr)
                        (struct vega20_smumgr *)(hwmgr->smu_backend);
        struct amdgpu_device *adev = hwmgr->adev;
 
-       smu_v11_0_i2c_eeprom_control_fini(&adev->pm.smu_i2c);
+       if (adev->psp.ras.ras)
+               smu_v11_0_i2c_eeprom_control_fini(&adev->pm.smu_i2c);
 
        if (priv) {
                
amdgpu_bo_free_kernel(&priv->smu_tables.entry[TABLE_PPTABLE].handle,
-- 
2.26.2

Reply via email to