RE: [PATCH 2/2] drm/amdgpu: Use the default reset when loading amdgpu driver

2023-04-24 Thread Li, Lyndon
[AMD Official Use Only - General]

Hi Feifei,

Thanks for your feedback. Will add comments inside and modify commit messages.
I think you are a little misunderstood.

It should do a mode1 reset when loading or reloading the driver, regardless of 
the module parameter reset_method. 
It will call amdgpu_device_mode1_reset in amdgpu_asic_reset if 
amdgpu_reset_method is set to AMD_RESET_METHOD_NONE. 
Here's an example,
modprobe amdgpu
modprobe -r amdgpu
modprobe amdgpu reset_method=3 //The real reset method should be mode1 reset, 
since it is initialization.

Regards,
Lyndon

> -Original Message-
> From: Xu, Feifei 
> Sent: Monday, April 24, 2023 2:00 PM
> To: Li, Lyndon ; amd-gfx@lists.freedesktop.org
> Cc: Liu, Shaoyun ; Zhao, Victor
> ; Feng, Kenneth ; Li,
> Yunxiang (Teddy) ; Li, Lyndon 
> Subject: RE: [PATCH 2/2] drm/amdgpu: Use the default reset when loading
> amdgpu driver
> 
> [AMD Official Use Only - General]
> 
> I think you might be refer to : mod parameter reset_method will not affect
> the loading driver code path. If loading driver, it should use the default 
> reset
> which might be mode1/mode2 or BACO instead of the specific mode2.
> 
> With the confusing commit msg corrected. And adding comment before the
> code " r = amdgpu_asic_reset(adev);"
> 
> 
> 
> Reviewed-by: Feifei Xu 
> 
> -Original Message-
> From: lyndonli 
> Sent: Monday, April 24, 2023 9:58 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Liu, Shaoyun ; Zhao, Victor
> ; Feng, Kenneth ; Xu,
> Feifei ; Li, Yunxiang (Teddy) ;
> Li, Lyndon 
> Subject: [PATCH 2/2] drm/amdgpu: Use the default reset when loading
> amdgpu driver
> 
> Below call trace and errors are observed when reloading amdgpu driver with
> the module parameter reset_method=3.
> 
> It should do a mode1 reset when loading the driver.
> 
> [  +2.180243] [drm] psp gfx command ID_LOAD_TOC(0x20) failed and
> response status is (0x0) [  +0.11] [drm:psp_hw_start [amdgpu]]
> *ERROR* Failed to load toc [  +0.000890] [drm:psp_hw_start [amdgpu]]
> *ERROR* PSP tmr init failed!
> [  +0.020683] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear
> memory with ring turned off.
> [  +0.03] RIP: 0010:amdgpu_bo_release_notify+0x1ef/0x210 [amdgpu]
> [  +0.04] Call Trace:
> [  +0.03]  
> [  +0.08]  ttm_bo_release+0x2c4/0x330 [amdttm] [  +0.26]
> amdttm_bo_put+0x3c/0x70 [amdttm] [  +0.20]
> amdgpu_bo_free_kernel+0xe6/0x140 [amdgpu] [  +0.000728]
> psp_v11_0_ring_destroy+0x34/0x60 [amdgpu] [  +0.000826]
> psp_hw_init+0xe7/0x2f0 [amdgpu] [  +0.000813]
> amdgpu_device_fw_loading+0x1ad/0x2d0 [amdgpu] [  +0.000731]
> amdgpu_device_init.cold+0x108e/0x2002 [amdgpu] [  +0.001071]  ?
> do_pci_enable_device+0xe1/0x110 [  +0.11]
> amdgpu_driver_load_kms+0x1a/0x160 [amdgpu] [  +0.000729]
> amdgpu_pci_probe+0x179/0x3a0 [amdgpu]
> 
> Signed-off-by: lyndonli 
> Signed-off-by: Yunxiang Li 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e536886f6d42..9738e3660cf1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3578,6 +3578,7 @@ int amdgpu_device_init(struct amdgpu_device
> *adev,
>   int r, i;
>   bool px = false;
>   u32 max_MBps;
> + int tmp;
> 
>   adev->shutdown = false;
>   adev->flags = flags;
> @@ -3799,7 +3800,10 @@ int amdgpu_device_init(struct amdgpu_device
> *adev,
>   }
>   }
>   } else {
> + tmp = amdgpu_reset_method;
> + amdgpu_reset_method =
> AMD_RESET_METHOD_NONE;
>   r = amdgpu_asic_reset(adev);
> + amdgpu_reset_method = tmp;
>   if (r) {
>   dev_err(adev->dev, "asic reset on init
> failed\n");
>   goto failed;
> --
> 2.34.1

RE: [PATCH] drm/amdgpu: Fix the warning info when removing amdgpu device

2023-03-06 Thread Li, Lyndon
[AMD Official Use Only - General]

Thanks, will update it.

Regards,
Lyndon

> -Original Message-
> From: Christian König 
> Sent: Tuesday, March 7, 2023 2:22 PM
> To: Chen, Guchun ; Li, Lyndon
> ; amd-gfx@lists.freedesktop.org
> Cc: Xu, Feifei ; Ma, Jun ;
> Prosyak, Vitaly ; Deucher, Alexander
> ; Koenig, Christian
> 
> Subject: Re: [PATCH] drm/amdgpu: Fix the warning info when removing
> amdgpu device
> 
> The commit message reads a bit bumpy. Generally best practice are:
> 
> Short (72 chars or less) summary
> 
> More detailed explanatory text. Wrap it to 72 characters. The blank line
> separating the summary from the body is critical (unless you omit the body
> entirely).
> 
> Write your commit message in the imperative: "Fix bug" and not "Fixed bug"
> or "Fixes bug." This convention matches up with commit messages
> generated by commands like git merge and git revert.
> 
> Further paragraphs come after blank lines.
> 
> - Bullet points are okay, too.
> - Typically a hyphen or asterisk is used for the bullet, followed by a
>    single space. Use a hanging indent.
> 
> Apart from that the patch is Acked-by: Christian König
> 
> 
> Regards,
> Christian.
> 
> Am 07.03.23 um 03:23 schrieb Chen, Guchun:
> > Reviewed-by: Guchun Chen 
> >
> > Regards,
> > Guchun
> >
> > -Original Message-
> > From: lyndonli 
> > Sent: Tuesday, March 7, 2023 10:12 AM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Prosyak, Vitaly ; Koenig, Christian
> > ; Deucher, Alexander
> > ; Chen, Guchun
> ; Xu,
> > Feifei ; Ma, Jun ; Li, Lyndon
> > 
> > Subject: [PATCH] drm/amdgpu: Fix the warning info when removing
> amdgpu
> > device
> >
> > Actually, the drm_dev_enter in psp_cmd_submit_buf does not protect
> anything.
> > And it is not used to prevent concurrent access.
> > If DRM device is unplugged, it will always check the condition in WARN_ON.
> > We'd better not keep adding commands to the list.
> > Simply moving the drm_dev_enter/drm_dev_exit higher level will not
> solve the issue.
> > Because psp_cmd_submit_buf is called in many places, such as
> psp_hw_init-->psp_load_fw, psp_suspend-->psp_xgmi_terminate,
> amdgpu_device_gpu_recover-->amdgpu_ras_suspend.
> > So drop drm_dev_enter/drm_dev_exit in psp_cmd_submit_buf.
> >
> > When removing amdgpu, the calling order as follows:
> > amdgpu_pci_remove
> > drm_dev_unplug
> > amdgpu_driver_unload_kms
> > amdgpu_device_fini_hw
> > amdgpu_device_ip_fini_early
> > psp_hw_fini
> > psp_ras_terminate
> > psp_ta_unloadye
> >
>   psp_cmd_submit_buf
> >
> > [ 4507.740388] Call Trace:
> > [ 4507.740389]  
> > [ 4507.740391]  psp_ta_unload+0x44/0x70 [amdgpu] [ 4507.740485]
> > psp_ras_terminate+0x4d/0x70 [amdgpu] [ 4507.740575]
> > psp_hw_fini+0x28/0xa0 [amdgpu] [ 4507.740662]
> > amdgpu_device_fini_hw+0x328/0x442 [amdgpu] [ 4507.740791]
> > amdgpu_driver_unload_kms+0x51/0x60 [amdgpu] [ 4507.740875]
> > amdgpu_pci_remove+0x5a/0x140 [amdgpu] [ 4507.740962]  ?
> > _raw_spin_unlock_irqrestore+0x27/0x43
> > [ 4507.740965]  ? __pm_runtime_resume+0x60/0x90 [ 4507.740968]
> > pci_device_remove+0x39/0xb0 [ 4507.740971]  device_remove+0x46/0x70
> [
> > 4507.740972]  device_release_driver_internal+0xd1/0x160
> > [ 4507.740974]  driver_detach+0x4a/0x90 [ 4507.740975]
> > bus_remove_driver+0x6c/0xf0 [ 4507.740976]
> > driver_unregister+0x31/0x50 [ 4507.740977]
> > pci_unregister_driver+0x40/0x90 [ 4507.740978]  amdgpu_exit+0x15/0x120
> > [amdgpu]
> >
> > Signed-off-by: lyndonli 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 17 +
> >   1 file changed, 1 insertion(+), 16 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index 4c617faaa7c9..02f948adae72 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -603,27 +603,14 @@ psp_cmd_submit_buf(struct psp_context *psp,
> >struct psp_gfx_cmd_resp *cmd, uint64_t fence_mc_addr)
> {
> > int ret;
> > -   int index, idx;
> > +   int index;
> > int timeout = 2;
> > bool ras_intr = false;
> > bool skip_unsupport = false;
> > -   bool dev_entered;
> >
> > if (psp->ad

RE: [PATCH] drm/amdgpu: Fix call trace warning and hang when removing amdgpu device

2023-03-01 Thread Li, Lyndon
Sorry, the code base was incorrect.
I just sent out a v2.

Regards,
Lyndon

> -Original Message-
> From: Chen, Guchun 
> Sent: Thursday, March 2, 2023 1:28 PM
> To: Li, Lyndon ; amd-gfx@lists.freedesktop.org
> Cc: Koenig, Christian ; Chai, Thomas
> ; Xu, Feifei ; Li, Lyndon
> 
> Subject: RE: [PATCH] drm/amdgpu: Fix call trace warning and hang when
> removing amdgpu device
> 
> - adev->in_suspend || adev->ddev.unplugged)
> 
> I don't think the code base is correct. Please double check it.
> 
> Regards,
> Guchun
> 
> -Original Message-
> From: lyndonli 
> Sent: Thursday, March 2, 2023 12:57 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Koenig, Christian ; Chen, Guchun
> ; Chai, Thomas ; Xu,
> Feifei ; Li, Lyndon 
> Subject: [PATCH] drm/amdgpu: Fix call trace warning and hang when
> removing amdgpu device
> 
> On GPUs with RAS enabled, below call trace and hang are observed when
> shutting down device.
> 
> v2: use DRM device unplugged flag instead of shutdown flag as the check to
> prevent memory wipe in shutdown stage.
> 
> [ +0.00] RIP: 0010:amdgpu_vram_mgr_fini+0x18d/0x1c0 [amdgpu]
> [ +0.01] PKRU: 5554 [ +0.01] Call Trace:
> [ +0.01] 
> [ +0.02] amdgpu_ttm_fini+0x140/0x1c0 [amdgpu] [ +0.000183]
> amdgpu_bo_fini+0x27/0xa0 [amdgpu] [ +0.000184]
> gmc_v11_0_sw_fini+0x2b/0x40 [amdgpu] [ +0.000163]
> amdgpu_device_fini_sw+0xb6/0x510 [amdgpu] [ +0.000152]
> amdgpu_driver_release_kms+0x16/0x30 [amdgpu] [ +0.90]
> drm_dev_release+0x28/0x50 [drm] [ +0.16]
> devm_drm_dev_init_release+0x38/0x60 [drm] [ +0.11]
> devm_action_release+0x15/0x20 [ +0.03] release_nodes+0x40/0xc0
> [ +0.01] devres_release_all+0x9e/0xe0 [ +0.01]
> device_unbind_cleanup+0x12/0x80 [ +0.03]
> device_release_driver_internal+0xff/0x160
> [ +0.01] driver_detach+0x4a/0x90
> [ +0.01] bus_remove_driver+0x6c/0xf0 [ +0.01]
> driver_unregister+0x31/0x50 [ +0.01] pci_unregister_driver+0x40/0x90
> [ +0.03] amdgpu_exit+0x15/0x120 [amdgpu]
> 
> Signed-off-by: lyndonli 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index d50f8bfb9be9..5554ff22d724 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1314,7 +1314,7 @@ void amdgpu_bo_release_notify(struct
> ttm_buffer_object *bo)
> 
>   if (!bo->resource || bo->resource->mem_type != TTM_PL_VRAM ||
>   !(abo->flags &
> AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE) ||
> - adev->in_suspend || adev->ddev.unplugged)
> + adev->in_suspend ||
> drm_dev_is_unplugged(adev_to_drm(adev)))
>   return;
> 
>   if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
> --
> 2.34.1