RE: [PATCH 2/2] drm/amdgpu: Use the default reset when loading amdgpu driver
[AMD Official Use Only - General] Hi Feifei, Thanks for your feedback. Will add comments inside and modify commit messages. I think you are a little misunderstood. It should do a mode1 reset when loading or reloading the driver, regardless of the module parameter reset_method. It will call amdgpu_device_mode1_reset in amdgpu_asic_reset if amdgpu_reset_method is set to AMD_RESET_METHOD_NONE. Here's an example, modprobe amdgpu modprobe -r amdgpu modprobe amdgpu reset_method=3 //The real reset method should be mode1 reset, since it is initialization. Regards, Lyndon > -Original Message- > From: Xu, Feifei > Sent: Monday, April 24, 2023 2:00 PM > To: Li, Lyndon ; amd-gfx@lists.freedesktop.org > Cc: Liu, Shaoyun ; Zhao, Victor > ; Feng, Kenneth ; Li, > Yunxiang (Teddy) ; Li, Lyndon > Subject: RE: [PATCH 2/2] drm/amdgpu: Use the default reset when loading > amdgpu driver > > [AMD Official Use Only - General] > > I think you might be refer to : mod parameter reset_method will not affect > the loading driver code path. If loading driver, it should use the default > reset > which might be mode1/mode2 or BACO instead of the specific mode2. > > With the confusing commit msg corrected. And adding comment before the > code " r = amdgpu_asic_reset(adev);" > > > > Reviewed-by: Feifei Xu > > -Original Message- > From: lyndonli > Sent: Monday, April 24, 2023 9:58 AM > To: amd-gfx@lists.freedesktop.org > Cc: Liu, Shaoyun ; Zhao, Victor > ; Feng, Kenneth ; Xu, > Feifei ; Li, Yunxiang (Teddy) ; > Li, Lyndon > Subject: [PATCH 2/2] drm/amdgpu: Use the default reset when loading > amdgpu driver > > Below call trace and errors are observed when reloading amdgpu driver with > the module parameter reset_method=3. > > It should do a mode1 reset when loading the driver. > > [ +2.180243] [drm] psp gfx command ID_LOAD_TOC(0x20) failed and > response status is (0x0) [ +0.11] [drm:psp_hw_start [amdgpu]] > *ERROR* Failed to load toc [ +0.000890] [drm:psp_hw_start [amdgpu]] > *ERROR* PSP tmr init failed! > [ +0.020683] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear > memory with ring turned off. > [ +0.03] RIP: 0010:amdgpu_bo_release_notify+0x1ef/0x210 [amdgpu] > [ +0.04] Call Trace: > [ +0.03] > [ +0.08] ttm_bo_release+0x2c4/0x330 [amdttm] [ +0.26] > amdttm_bo_put+0x3c/0x70 [amdttm] [ +0.20] > amdgpu_bo_free_kernel+0xe6/0x140 [amdgpu] [ +0.000728] > psp_v11_0_ring_destroy+0x34/0x60 [amdgpu] [ +0.000826] > psp_hw_init+0xe7/0x2f0 [amdgpu] [ +0.000813] > amdgpu_device_fw_loading+0x1ad/0x2d0 [amdgpu] [ +0.000731] > amdgpu_device_init.cold+0x108e/0x2002 [amdgpu] [ +0.001071] ? > do_pci_enable_device+0xe1/0x110 [ +0.11] > amdgpu_driver_load_kms+0x1a/0x160 [amdgpu] [ +0.000729] > amdgpu_pci_probe+0x179/0x3a0 [amdgpu] > > Signed-off-by: lyndonli > Signed-off-by: Yunxiang Li > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e536886f6d42..9738e3660cf1 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3578,6 +3578,7 @@ int amdgpu_device_init(struct amdgpu_device > *adev, > int r, i; > bool px = false; > u32 max_MBps; > + int tmp; > > adev->shutdown = false; > adev->flags = flags; > @@ -3799,7 +3800,10 @@ int amdgpu_device_init(struct amdgpu_device > *adev, > } > } > } else { > + tmp = amdgpu_reset_method; > + amdgpu_reset_method = > AMD_RESET_METHOD_NONE; > r = amdgpu_asic_reset(adev); > + amdgpu_reset_method = tmp; > if (r) { > dev_err(adev->dev, "asic reset on init > failed\n"); > goto failed; > -- > 2.34.1
RE: [PATCH] drm/amdgpu: Fix the warning info when removing amdgpu device
[AMD Official Use Only - General] Thanks, will update it. Regards, Lyndon > -Original Message- > From: Christian König > Sent: Tuesday, March 7, 2023 2:22 PM > To: Chen, Guchun ; Li, Lyndon > ; amd-gfx@lists.freedesktop.org > Cc: Xu, Feifei ; Ma, Jun ; > Prosyak, Vitaly ; Deucher, Alexander > ; Koenig, Christian > > Subject: Re: [PATCH] drm/amdgpu: Fix the warning info when removing > amdgpu device > > The commit message reads a bit bumpy. Generally best practice are: > > Short (72 chars or less) summary > > More detailed explanatory text. Wrap it to 72 characters. The blank line > separating the summary from the body is critical (unless you omit the body > entirely). > > Write your commit message in the imperative: "Fix bug" and not "Fixed bug" > or "Fixes bug." This convention matches up with commit messages > generated by commands like git merge and git revert. > > Further paragraphs come after blank lines. > > - Bullet points are okay, too. > - Typically a hyphen or asterisk is used for the bullet, followed by a > single space. Use a hanging indent. > > Apart from that the patch is Acked-by: Christian König > > > Regards, > Christian. > > Am 07.03.23 um 03:23 schrieb Chen, Guchun: > > Reviewed-by: Guchun Chen > > > > Regards, > > Guchun > > > > -Original Message- > > From: lyndonli > > Sent: Tuesday, March 7, 2023 10:12 AM > > To: amd-gfx@lists.freedesktop.org > > Cc: Prosyak, Vitaly ; Koenig, Christian > > ; Deucher, Alexander > > ; Chen, Guchun > ; Xu, > > Feifei ; Ma, Jun ; Li, Lyndon > > > > Subject: [PATCH] drm/amdgpu: Fix the warning info when removing > amdgpu > > device > > > > Actually, the drm_dev_enter in psp_cmd_submit_buf does not protect > anything. > > And it is not used to prevent concurrent access. > > If DRM device is unplugged, it will always check the condition in WARN_ON. > > We'd better not keep adding commands to the list. > > Simply moving the drm_dev_enter/drm_dev_exit higher level will not > solve the issue. > > Because psp_cmd_submit_buf is called in many places, such as > psp_hw_init-->psp_load_fw, psp_suspend-->psp_xgmi_terminate, > amdgpu_device_gpu_recover-->amdgpu_ras_suspend. > > So drop drm_dev_enter/drm_dev_exit in psp_cmd_submit_buf. > > > > When removing amdgpu, the calling order as follows: > > amdgpu_pci_remove > > drm_dev_unplug > > amdgpu_driver_unload_kms > > amdgpu_device_fini_hw > > amdgpu_device_ip_fini_early > > psp_hw_fini > > psp_ras_terminate > > psp_ta_unloadye > > > psp_cmd_submit_buf > > > > [ 4507.740388] Call Trace: > > [ 4507.740389] > > [ 4507.740391] psp_ta_unload+0x44/0x70 [amdgpu] [ 4507.740485] > > psp_ras_terminate+0x4d/0x70 [amdgpu] [ 4507.740575] > > psp_hw_fini+0x28/0xa0 [amdgpu] [ 4507.740662] > > amdgpu_device_fini_hw+0x328/0x442 [amdgpu] [ 4507.740791] > > amdgpu_driver_unload_kms+0x51/0x60 [amdgpu] [ 4507.740875] > > amdgpu_pci_remove+0x5a/0x140 [amdgpu] [ 4507.740962] ? > > _raw_spin_unlock_irqrestore+0x27/0x43 > > [ 4507.740965] ? __pm_runtime_resume+0x60/0x90 [ 4507.740968] > > pci_device_remove+0x39/0xb0 [ 4507.740971] device_remove+0x46/0x70 > [ > > 4507.740972] device_release_driver_internal+0xd1/0x160 > > [ 4507.740974] driver_detach+0x4a/0x90 [ 4507.740975] > > bus_remove_driver+0x6c/0xf0 [ 4507.740976] > > driver_unregister+0x31/0x50 [ 4507.740977] > > pci_unregister_driver+0x40/0x90 [ 4507.740978] amdgpu_exit+0x15/0x120 > > [amdgpu] > > > > Signed-off-by: lyndonli > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 17 + > > 1 file changed, 1 insertion(+), 16 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > index 4c617faaa7c9..02f948adae72 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > @@ -603,27 +603,14 @@ psp_cmd_submit_buf(struct psp_context *psp, > >struct psp_gfx_cmd_resp *cmd, uint64_t fence_mc_addr) > { > > int ret; > > - int index, idx; > > + int index; > > int timeout = 2; > > bool ras_intr = false; > > bool skip_unsupport = false; > > - bool dev_entered; > > > > if (psp->ad
RE: [PATCH] drm/amdgpu: Fix call trace warning and hang when removing amdgpu device
Sorry, the code base was incorrect. I just sent out a v2. Regards, Lyndon > -Original Message- > From: Chen, Guchun > Sent: Thursday, March 2, 2023 1:28 PM > To: Li, Lyndon ; amd-gfx@lists.freedesktop.org > Cc: Koenig, Christian ; Chai, Thomas > ; Xu, Feifei ; Li, Lyndon > > Subject: RE: [PATCH] drm/amdgpu: Fix call trace warning and hang when > removing amdgpu device > > - adev->in_suspend || adev->ddev.unplugged) > > I don't think the code base is correct. Please double check it. > > Regards, > Guchun > > -Original Message- > From: lyndonli > Sent: Thursday, March 2, 2023 12:57 PM > To: amd-gfx@lists.freedesktop.org > Cc: Koenig, Christian ; Chen, Guchun > ; Chai, Thomas ; Xu, > Feifei ; Li, Lyndon > Subject: [PATCH] drm/amdgpu: Fix call trace warning and hang when > removing amdgpu device > > On GPUs with RAS enabled, below call trace and hang are observed when > shutting down device. > > v2: use DRM device unplugged flag instead of shutdown flag as the check to > prevent memory wipe in shutdown stage. > > [ +0.00] RIP: 0010:amdgpu_vram_mgr_fini+0x18d/0x1c0 [amdgpu] > [ +0.01] PKRU: 5554 [ +0.01] Call Trace: > [ +0.01] > [ +0.02] amdgpu_ttm_fini+0x140/0x1c0 [amdgpu] [ +0.000183] > amdgpu_bo_fini+0x27/0xa0 [amdgpu] [ +0.000184] > gmc_v11_0_sw_fini+0x2b/0x40 [amdgpu] [ +0.000163] > amdgpu_device_fini_sw+0xb6/0x510 [amdgpu] [ +0.000152] > amdgpu_driver_release_kms+0x16/0x30 [amdgpu] [ +0.90] > drm_dev_release+0x28/0x50 [drm] [ +0.16] > devm_drm_dev_init_release+0x38/0x60 [drm] [ +0.11] > devm_action_release+0x15/0x20 [ +0.03] release_nodes+0x40/0xc0 > [ +0.01] devres_release_all+0x9e/0xe0 [ +0.01] > device_unbind_cleanup+0x12/0x80 [ +0.03] > device_release_driver_internal+0xff/0x160 > [ +0.01] driver_detach+0x4a/0x90 > [ +0.01] bus_remove_driver+0x6c/0xf0 [ +0.01] > driver_unregister+0x31/0x50 [ +0.01] pci_unregister_driver+0x40/0x90 > [ +0.03] amdgpu_exit+0x15/0x120 [amdgpu] > > Signed-off-by: lyndonli > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > index d50f8bfb9be9..5554ff22d724 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > @@ -1314,7 +1314,7 @@ void amdgpu_bo_release_notify(struct > ttm_buffer_object *bo) > > if (!bo->resource || bo->resource->mem_type != TTM_PL_VRAM || > !(abo->flags & > AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE) || > - adev->in_suspend || adev->ddev.unplugged) > + adev->in_suspend || > drm_dev_is_unplugged(adev_to_drm(adev))) > return; > > if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv))) > -- > 2.34.1