RE: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2
[AMD Official Use Only - General] Yes, I will add the sequence adjustment to the comment. - Best Regards, Thomas From: Zhang, Hawking Sent: Wednesday, September 7, 2022 11:42 AM To: Chai, Thomas ; amd-gfx@lists.freedesktop.org Cc: Zhou1, Tao ; Wang, Yang(Kevin) Subject: Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 Thanks. Can you please share more details to help me understand the sequence adjustment in suspend? Regards, Hawking From: Chai, Thomas mailto:yipeng.c...@amd.com>> Date: Wednesday, September 7, 2022 at 11:29 To: Zhang, Hawking mailto:hawking.zh...@amd.com>>, amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>> Cc: Zhou1, Tao mailto:tao.zh...@amd.com>>, Wang, Yang(Kevin) mailto:kevinyang.w...@amd.com>> Subject: RE: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 [AMD Official Use Only - General] OK, I will update patch. - Best Regards, Thomas From: Zhang, Hawking mailto:hawking.zh...@amd.com>> Sent: Wednesday, September 7, 2022 10:40 AM To: Chai, Thomas mailto:yipeng.c...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>; Zhou1, Tao mailto:tao.zh...@amd.com>>; Wang, Yang(Kevin) mailto:kevinyang.w...@amd.com>> Subject: Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 [AMD Official Use Only - General] +static void amdgpu_device_gpu_reset(struct amdgpu_device *adev) +{ + struct amdgpu_reset_context reset_context; + + memset(_context, 0, sizeof(reset_context)); + reset_context.method = AMD_RESET_METHOD_NONE; + reset_context.reset_req_dev = adev; + set_bit(AMDGPU_NEED_FULL_RESET, _context.flags); + set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context.flags); + + amdgpu_device_gpu_recover(adev, NULL, _context); +} This wrapper is kind of confusing. Let's keep amdgpu_device_gpu_recover as the only entry point for recovery handling. If possible, please drop this wrapper, initialize reset_context and call amdgpu_device_gpu_recover directly + /* If in_remove is true, psp_hw_fini should be executed after +* psp_suspend to free psp shared buffers. +*/ + if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP)) + continue; Can you please share more details to help me understand the sequence adjustment here? Regards, Hawking From: Chai, Thomas mailto:yipeng.c...@amd.com>> Date: Tuesday, September 6, 2022 at 15:48 To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>> Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>, Zhang, Hawking mailto:hawking.zh...@amd.com>>, Zhou1, Tao mailto:tao.zh...@amd.com>>, Wang, Yang(Kevin) mailto:kevinyang.w...@amd.com>>, Chai, Thomas mailto:yipeng.c...@amd.com>> Subject: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 Adjust removal control flow for smu v13_0_2: During amdgpu uninstallation, when removing the first device, the kernel needs to first send a mode1reset message to all gpu devices. Otherwise, smu initialization will fail the next time amdgpu is installed. V2: 1. Update commit comments. 2. Remove the global variable amdgpu_device_remove_cnt and add a variable to the structure amdgpu_hive_info. 3. Use hive to detect the first removed device instead of a global variable. Signed-off-by: YiPeng Chai mailto:yipeng.c...@amd.com>> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 3 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 35 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 16 - drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 1 + drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 +++- 7 files changed, 99 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 79bb6fd83094..465295318830 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -997,6 +997,9 @@ struct amdgpu_device { boolin_s4; boolin_s0ix; + /* uninstall */ + boolin_remove; + enum pp_mp1_state mp1_state; struct amdgpu_doorbell_index doorbell_index; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 62b26f0e37b0..1402717673f7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devi
Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2
Thanks. Can you please share more details to help me understand the sequence adjustment in suspend? Regards, Hawking From: Chai, Thomas Date: Wednesday, September 7, 2022 at 11:29 To: Zhang, Hawking , amd-gfx@lists.freedesktop.org Cc: Zhou1, Tao , Wang, Yang(Kevin) Subject: RE: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 [AMD Official Use Only - General] OK, I will update patch. - Best Regards, Thomas From: Zhang, Hawking Sent: Wednesday, September 7, 2022 10:40 AM To: Chai, Thomas ; amd-gfx@lists.freedesktop.org Cc: Chai, Thomas ; Zhou1, Tao ; Wang, Yang(Kevin) Subject: Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 [AMD Official Use Only - General] +static void amdgpu_device_gpu_reset(struct amdgpu_device *adev) +{ + struct amdgpu_reset_context reset_context; + + memset(_context, 0, sizeof(reset_context)); + reset_context.method = AMD_RESET_METHOD_NONE; + reset_context.reset_req_dev = adev; + set_bit(AMDGPU_NEED_FULL_RESET, _context.flags); + set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context.flags); + + amdgpu_device_gpu_recover(adev, NULL, _context); +} This wrapper is kind of confusing. Let’s keep amdgpu_device_gpu_recover as the only entry point for recovery handling. If possible, please drop this wrapper, initialize reset_context and call amdgpu_device_gpu_recover directly + /* If in_remove is true, psp_hw_fini should be executed after +* psp_suspend to free psp shared buffers. +*/ + if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP)) + continue; Can you please share more details to help me understand the sequence adjustment here? Regards, Hawking From: Chai, Thomas mailto:yipeng.c...@amd.com>> Date: Tuesday, September 6, 2022 at 15:48 To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>> Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>, Zhang, Hawking mailto:hawking.zh...@amd.com>>, Zhou1, Tao mailto:tao.zh...@amd.com>>, Wang, Yang(Kevin) mailto:kevinyang.w...@amd.com>>, Chai, Thomas mailto:yipeng.c...@amd.com>> Subject: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 Adjust removal control flow for smu v13_0_2: During amdgpu uninstallation, when removing the first device, the kernel needs to first send a mode1reset message to all gpu devices. Otherwise, smu initialization will fail the next time amdgpu is installed. V2: 1. Update commit comments. 2. Remove the global variable amdgpu_device_remove_cnt and add a variable to the structure amdgpu_hive_info. 3. Use hive to detect the first removed device instead of a global variable. Signed-off-by: YiPeng Chai mailto:yipeng.c...@amd.com>> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 3 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 35 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 16 - drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 1 + drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 +++- 7 files changed, 99 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 79bb6fd83094..465295318830 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -997,6 +997,9 @@ struct amdgpu_device { boolin_s4; boolin_s0ix; + /* uninstall */ + boolin_remove; + enum pp_mp1_state mp1_state; struct amdgpu_doorbell_index doorbell_index; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 62b26f0e37b0..1402717673f7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2999,6 +2999,13 @@ static int amdgpu_device_ip_suspend_phase2(struct amdgpu_device *adev) DRM_ERROR("suspend of IP block <%s> failed %d\n", adev->ip_blocks[i].version->funcs->name, r); } + + /* If in_remove is true, psp_hw_fini should be executed after +* psp_suspend to free psp shared buffers. +*/ + if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP)) + continue; + adev->ip_blocks[i].status.hw = false; /* handle putting the SMC in the appropriate state */ if(!amdgpu_sriov_vf(adev)){ @@ -4739,6 +4746
RE: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2
[AMD Official Use Only - General] OK, I will update patch. - Best Regards, Thomas From: Zhang, Hawking Sent: Wednesday, September 7, 2022 10:40 AM To: Chai, Thomas ; amd-gfx@lists.freedesktop.org Cc: Chai, Thomas ; Zhou1, Tao ; Wang, Yang(Kevin) Subject: Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 [AMD Official Use Only - General] +static void amdgpu_device_gpu_reset(struct amdgpu_device *adev) +{ + struct amdgpu_reset_context reset_context; + + memset(_context, 0, sizeof(reset_context)); + reset_context.method = AMD_RESET_METHOD_NONE; + reset_context.reset_req_dev = adev; + set_bit(AMDGPU_NEED_FULL_RESET, _context.flags); + set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context.flags); + + amdgpu_device_gpu_recover(adev, NULL, _context); +} This wrapper is kind of confusing. Let's keep amdgpu_device_gpu_recover as the only entry point for recovery handling. If possible, please drop this wrapper, initialize reset_context and call amdgpu_device_gpu_recover directly + /* If in_remove is true, psp_hw_fini should be executed after +* psp_suspend to free psp shared buffers. +*/ + if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP)) + continue; Can you please share more details to help me understand the sequence adjustment here? Regards, Hawking From: Chai, Thomas mailto:yipeng.c...@amd.com>> Date: Tuesday, September 6, 2022 at 15:48 To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>> Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>, Zhang, Hawking mailto:hawking.zh...@amd.com>>, Zhou1, Tao mailto:tao.zh...@amd.com>>, Wang, Yang(Kevin) mailto:kevinyang.w...@amd.com>>, Chai, Thomas mailto:yipeng.c...@amd.com>> Subject: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 Adjust removal control flow for smu v13_0_2: During amdgpu uninstallation, when removing the first device, the kernel needs to first send a mode1reset message to all gpu devices. Otherwise, smu initialization will fail the next time amdgpu is installed. V2: 1. Update commit comments. 2. Remove the global variable amdgpu_device_remove_cnt and add a variable to the structure amdgpu_hive_info. 3. Use hive to detect the first removed device instead of a global variable. Signed-off-by: YiPeng Chai mailto:yipeng.c...@amd.com>> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 3 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 35 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 16 - drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 1 + drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 +++- 7 files changed, 99 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 79bb6fd83094..465295318830 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -997,6 +997,9 @@ struct amdgpu_device { boolin_s4; boolin_s0ix; + /* uninstall */ + boolin_remove; + enum pp_mp1_state mp1_state; struct amdgpu_doorbell_index doorbell_index; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 62b26f0e37b0..1402717673f7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2999,6 +2999,13 @@ static int amdgpu_device_ip_suspend_phase2(struct amdgpu_device *adev) DRM_ERROR("suspend of IP block <%s> failed %d\n", adev->ip_blocks[i].version->funcs->name, r); } + + /* If in_remove is true, psp_hw_fini should be executed after +* psp_suspend to free psp shared buffers. +*/ + if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP)) + continue; + adev->ip_blocks[i].status.hw = false; /* handle putting the SMC in the appropriate state */ if(!amdgpu_sriov_vf(adev)){ @@ -4739,6 +4746,7 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, struct amdgpu_device *tmp_adev = NULL; bool need_full_reset, skip_hw_reset, vram_lost = false; int r = 0; + bool gpu_reset_for_dev_remove = 0; /* Try reset handler method first */ tmp_adev = list_first_entry(device_list_handle, struct amdgpu_devi
Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2
[AMD Official Use Only - General] +static void amdgpu_device_gpu_reset(struct amdgpu_device *adev) +{ + struct amdgpu_reset_context reset_context; + + memset(_context, 0, sizeof(reset_context)); + reset_context.method = AMD_RESET_METHOD_NONE; + reset_context.reset_req_dev = adev; + set_bit(AMDGPU_NEED_FULL_RESET, _context.flags); + set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context.flags); + + amdgpu_device_gpu_recover(adev, NULL, _context); +} This wrapper is kind of confusing. Let’s keep amdgpu_device_gpu_recover as the only entry point for recovery handling. If possible, please drop this wrapper, initialize reset_context and call amdgpu_device_gpu_recover directly + /* If in_remove is true, psp_hw_fini should be executed after +* psp_suspend to free psp shared buffers. +*/ + if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP)) + continue; Can you please share more details to help me understand the sequence adjustment here? Regards, Hawking From: Chai, Thomas Date: Tuesday, September 6, 2022 at 15:48 To: amd-gfx@lists.freedesktop.org Cc: Chai, Thomas , Zhang, Hawking , Zhou1, Tao , Wang, Yang(Kevin) , Chai, Thomas Subject: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2 Adjust removal control flow for smu v13_0_2: During amdgpu uninstallation, when removing the first device, the kernel needs to first send a mode1reset message to all gpu devices. Otherwise, smu initialization will fail the next time amdgpu is installed. V2: 1. Update commit comments. 2. Remove the global variable amdgpu_device_remove_cnt and add a variable to the structure amdgpu_hive_info. 3. Use hive to detect the first removed device instead of a global variable. Signed-off-by: YiPeng Chai --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 3 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 35 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 16 - drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 1 + drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 +++- 7 files changed, 99 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 79bb6fd83094..465295318830 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -997,6 +997,9 @@ struct amdgpu_device { boolin_s4; boolin_s0ix; + /* uninstall */ + boolin_remove; + enum pp_mp1_state mp1_state; struct amdgpu_doorbell_index doorbell_index; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 62b26f0e37b0..1402717673f7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2999,6 +2999,13 @@ static int amdgpu_device_ip_suspend_phase2(struct amdgpu_device *adev) DRM_ERROR("suspend of IP block <%s> failed %d\n", adev->ip_blocks[i].version->funcs->name, r); } + + /* If in_remove is true, psp_hw_fini should be executed after +* psp_suspend to free psp shared buffers. +*/ + if (adev->in_remove && (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP)) + continue; + adev->ip_blocks[i].status.hw = false; /* handle putting the SMC in the appropriate state */ if(!amdgpu_sriov_vf(adev)){ @@ -4739,6 +4746,7 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, struct amdgpu_device *tmp_adev = NULL; bool need_full_reset, skip_hw_reset, vram_lost = false; int r = 0; + bool gpu_reset_for_dev_remove = 0; /* Try reset handler method first */ tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device, @@ -4758,6 +4766,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, test_bit(AMDGPU_NEED_FULL_RESET, _context->flags); skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, _context->flags); + gpu_reset_for_dev_remove = + test_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context->flags) && + test_bit(AMDGPU_NEED_FULL_RESET, _context->flags); + /* * ASIC reset has to be done on all XGMI hive nodes ASAP * to allow proper links negotiation in FW (within 1 sec) @@ -4802,6 +4814,16 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, amdgpu_ras_intr_cleared(); } + /* Fixed the problem