RE: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2

2022-09-06 Thread Chai, Thomas
[AMD Official Use Only - General]

Yes, I will add the sequence adjustment to the comment.


-
Best Regards,
Thomas

From: Zhang, Hawking 
Sent: Wednesday, September 7, 2022 11:42 AM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Wang, Yang(Kevin) 
Subject: Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2

Thanks.

Can you please share more details to help me understand the sequence adjustment 
in suspend?

Regards,
Hawking

From: Chai, Thomas mailto:yipeng.c...@amd.com>>
Date: Wednesday, September 7, 2022 at 11:29
To: Zhang, Hawking mailto:hawking.zh...@amd.com>>, 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Zhou1, Tao mailto:tao.zh...@amd.com>>, Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>
Subject: RE: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2

[AMD Official Use Only - General]

OK, I will update patch.


-
Best Regards,
Thomas

From: Zhang, Hawking mailto:hawking.zh...@amd.com>>
Sent: Wednesday, September 7, 2022 10:40 AM
To: Chai, Thomas mailto:yipeng.c...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>; Zhou1, Tao 
mailto:tao.zh...@amd.com>>; Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>
Subject: Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2


[AMD Official Use Only - General]

+static void amdgpu_device_gpu_reset(struct amdgpu_device *adev)
+{
+   struct amdgpu_reset_context reset_context;
+
+   memset(_context, 0, sizeof(reset_context));
+   reset_context.method = AMD_RESET_METHOD_NONE;
+   reset_context.reset_req_dev = adev;
+   set_bit(AMDGPU_NEED_FULL_RESET, _context.flags);
+   set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context.flags);
+
+   amdgpu_device_gpu_recover(adev, NULL, _context);
+}

This wrapper is kind of confusing. Let's keep amdgpu_device_gpu_recover as the 
only entry point for recovery handling. If possible, please drop this wrapper,  
initialize reset_context and call amdgpu_device_gpu_recover directly


+   /* If in_remove is true, psp_hw_fini should be executed after
+*  psp_suspend to free psp shared buffers.
+*/
+   if (adev->in_remove && (adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_PSP))
+   continue;
Can you please share more details to help me understand the sequence adjustment 
here?

Regards,
Hawking

From: Chai, Thomas mailto:yipeng.c...@amd.com>>
Date: Tuesday, September 6, 2022 at 15:48
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>, Zhang, 
Hawking mailto:hawking.zh...@amd.com>>, Zhou1, Tao 
mailto:tao.zh...@amd.com>>, Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>, Chai, Thomas 
mailto:yipeng.c...@amd.com>>
Subject: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2
Adjust removal control flow for smu v13_0_2:
   During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.

V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
   and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
   a global variable.

Signed-off-by: YiPeng Chai mailto:yipeng.c...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 35 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 16 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 +
 drivers/gpu/drm/amd/pm/amdgpu_pm.c |  6 +++-
 7 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 79bb6fd83094..465295318830 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -997,6 +997,9 @@ struct amdgpu_device {
 boolin_s4;
 boolin_s0ix;

+   /* uninstall */
+   boolin_remove;
+
 enum pp_mp1_state   mp1_state;
 struct amdgpu_doorbell_index doorbell_index;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62b26f0e37b0..1402717673f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_devi

Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2

2022-09-06 Thread Zhang, Hawking
Thanks.

Can you please share more details to help me understand the sequence adjustment 
in suspend?

Regards,
Hawking

From: Chai, Thomas 
Date: Wednesday, September 7, 2022 at 11:29
To: Zhang, Hawking , amd-gfx@lists.freedesktop.org 

Cc: Zhou1, Tao , Wang, Yang(Kevin) 
Subject: RE: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2

[AMD Official Use Only - General]

OK, I will update patch.


-
Best Regards,
Thomas

From: Zhang, Hawking 
Sent: Wednesday, September 7, 2022 10:40 AM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Chai, Thomas ; Zhou1, Tao ; Wang, 
Yang(Kevin) 
Subject: Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2


[AMD Official Use Only - General]

+static void amdgpu_device_gpu_reset(struct amdgpu_device *adev)
+{
+   struct amdgpu_reset_context reset_context;
+
+   memset(_context, 0, sizeof(reset_context));
+   reset_context.method = AMD_RESET_METHOD_NONE;
+   reset_context.reset_req_dev = adev;
+   set_bit(AMDGPU_NEED_FULL_RESET, _context.flags);
+   set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context.flags);
+
+   amdgpu_device_gpu_recover(adev, NULL, _context);
+}

This wrapper is kind of confusing. Let’s keep amdgpu_device_gpu_recover as the 
only entry point for recovery handling. If possible, please drop this wrapper,  
initialize reset_context and call amdgpu_device_gpu_recover directly


+   /* If in_remove is true, psp_hw_fini should be executed after
+*  psp_suspend to free psp shared buffers.
+*/
+   if (adev->in_remove && (adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_PSP))
+   continue;
Can you please share more details to help me understand the sequence adjustment 
here?

Regards,
Hawking

From: Chai, Thomas mailto:yipeng.c...@amd.com>>
Date: Tuesday, September 6, 2022 at 15:48
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>, Zhang, 
Hawking mailto:hawking.zh...@amd.com>>, Zhou1, Tao 
mailto:tao.zh...@amd.com>>, Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>, Chai, Thomas 
mailto:yipeng.c...@amd.com>>
Subject: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2
Adjust removal control flow for smu v13_0_2:
   During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.

V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
   and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
   a global variable.

Signed-off-by: YiPeng Chai mailto:yipeng.c...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 35 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 16 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 +
 drivers/gpu/drm/amd/pm/amdgpu_pm.c |  6 +++-
 7 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 79bb6fd83094..465295318830 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -997,6 +997,9 @@ struct amdgpu_device {
 boolin_s4;
 boolin_s0ix;

+   /* uninstall */
+   boolin_remove;
+
 enum pp_mp1_state   mp1_state;
 struct amdgpu_doorbell_index doorbell_index;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62b26f0e37b0..1402717673f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2999,6 +2999,13 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
 DRM_ERROR("suspend of IP block <%s> failed %d\n",
   adev->ip_blocks[i].version->funcs->name, r);
 }
+
+   /* If in_remove is true, psp_hw_fini should be executed after
+*  psp_suspend to free psp shared buffers.
+*/
+   if (adev->in_remove && (adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_PSP))
+   continue;
+
 adev->ip_blocks[i].status.hw = false;
 /* handle putting the SMC in the appropriate state */
 if(!amdgpu_sriov_vf(adev)){
@@ -4739,6 +4746

RE: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2

2022-09-06 Thread Chai, Thomas
[AMD Official Use Only - General]

OK, I will update patch.


-
Best Regards,
Thomas

From: Zhang, Hawking 
Sent: Wednesday, September 7, 2022 10:40 AM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Chai, Thomas ; Zhou1, Tao ; Wang, 
Yang(Kevin) 
Subject: Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2


[AMD Official Use Only - General]

+static void amdgpu_device_gpu_reset(struct amdgpu_device *adev)
+{
+   struct amdgpu_reset_context reset_context;
+
+   memset(_context, 0, sizeof(reset_context));
+   reset_context.method = AMD_RESET_METHOD_NONE;
+   reset_context.reset_req_dev = adev;
+   set_bit(AMDGPU_NEED_FULL_RESET, _context.flags);
+   set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context.flags);
+
+   amdgpu_device_gpu_recover(adev, NULL, _context);
+}

This wrapper is kind of confusing. Let's keep amdgpu_device_gpu_recover as the 
only entry point for recovery handling. If possible, please drop this wrapper,  
initialize reset_context and call amdgpu_device_gpu_recover directly


+   /* If in_remove is true, psp_hw_fini should be executed after
+*  psp_suspend to free psp shared buffers.
+*/
+   if (adev->in_remove && (adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_PSP))
+   continue;
Can you please share more details to help me understand the sequence adjustment 
here?

Regards,
Hawking

From: Chai, Thomas mailto:yipeng.c...@amd.com>>
Date: Tuesday, September 6, 2022 at 15:48
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Chai, Thomas mailto:yipeng.c...@amd.com>>, Zhang, 
Hawking mailto:hawking.zh...@amd.com>>, Zhou1, Tao 
mailto:tao.zh...@amd.com>>, Wang, Yang(Kevin) 
mailto:kevinyang.w...@amd.com>>, Chai, Thomas 
mailto:yipeng.c...@amd.com>>
Subject: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2
Adjust removal control flow for smu v13_0_2:
   During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.

V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
   and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
   a global variable.

Signed-off-by: YiPeng Chai mailto:yipeng.c...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 35 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 16 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 +
 drivers/gpu/drm/amd/pm/amdgpu_pm.c |  6 +++-
 7 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 79bb6fd83094..465295318830 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -997,6 +997,9 @@ struct amdgpu_device {
 boolin_s4;
 boolin_s0ix;

+   /* uninstall */
+   boolin_remove;
+
 enum pp_mp1_state   mp1_state;
 struct amdgpu_doorbell_index doorbell_index;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62b26f0e37b0..1402717673f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2999,6 +2999,13 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
 DRM_ERROR("suspend of IP block <%s> failed %d\n",
   adev->ip_blocks[i].version->funcs->name, r);
 }
+
+   /* If in_remove is true, psp_hw_fini should be executed after
+*  psp_suspend to free psp shared buffers.
+*/
+   if (adev->in_remove && (adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_PSP))
+   continue;
+
 adev->ip_blocks[i].status.hw = false;
 /* handle putting the SMC in the appropriate state */
 if(!amdgpu_sriov_vf(adev)){
@@ -4739,6 +4746,7 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
 struct amdgpu_device *tmp_adev = NULL;
 bool need_full_reset, skip_hw_reset, vram_lost = false;
 int r = 0;
+   bool gpu_reset_for_dev_remove = 0;

 /* Try reset handler method first */
 tmp_adev = list_first_entry(device_list_handle, struct amdgpu_devi

Re: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2

2022-09-06 Thread Zhang, Hawking
[AMD Official Use Only - General]

+static void amdgpu_device_gpu_reset(struct amdgpu_device *adev)
+{
+   struct amdgpu_reset_context reset_context;
+
+   memset(_context, 0, sizeof(reset_context));
+   reset_context.method = AMD_RESET_METHOD_NONE;
+   reset_context.reset_req_dev = adev;
+   set_bit(AMDGPU_NEED_FULL_RESET, _context.flags);
+   set_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context.flags);
+
+   amdgpu_device_gpu_recover(adev, NULL, _context);
+}

This wrapper is kind of confusing. Let’s keep amdgpu_device_gpu_recover as the 
only entry point for recovery handling. If possible, please drop this wrapper,  
initialize reset_context and call amdgpu_device_gpu_recover directly


+   /* If in_remove is true, psp_hw_fini should be executed after
+*  psp_suspend to free psp shared buffers.
+*/
+   if (adev->in_remove && (adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_PSP))
+   continue;

Can you please share more details to help me understand the sequence adjustment 
here?

Regards,
Hawking

From: Chai, Thomas 
Date: Tuesday, September 6, 2022 at 15:48
To: amd-gfx@lists.freedesktop.org 
Cc: Chai, Thomas , Zhang, Hawking , 
Zhou1, Tao , Wang, Yang(Kevin) , 
Chai, Thomas 
Subject: [PATCH V2] drm/amdgpu: Adjust removal control flow for smu v13_0_2
Adjust removal control flow for smu v13_0_2:
   During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.

V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
   and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
   a global variable.

Signed-off-by: YiPeng Chai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 35 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 16 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 +
 drivers/gpu/drm/amd/pm/amdgpu_pm.c |  6 +++-
 7 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 79bb6fd83094..465295318830 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -997,6 +997,9 @@ struct amdgpu_device {
 boolin_s4;
 boolin_s0ix;

+   /* uninstall */
+   boolin_remove;
+
 enum pp_mp1_state   mp1_state;
 struct amdgpu_doorbell_index doorbell_index;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62b26f0e37b0..1402717673f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2999,6 +2999,13 @@ static int amdgpu_device_ip_suspend_phase2(struct 
amdgpu_device *adev)
 DRM_ERROR("suspend of IP block <%s> failed %d\n",
   adev->ip_blocks[i].version->funcs->name, r);
 }
+
+   /* If in_remove is true, psp_hw_fini should be executed after
+*  psp_suspend to free psp shared buffers.
+*/
+   if (adev->in_remove && (adev->ip_blocks[i].version->type == 
AMD_IP_BLOCK_TYPE_PSP))
+   continue;
+
 adev->ip_blocks[i].status.hw = false;
 /* handle putting the SMC in the appropriate state */
 if(!amdgpu_sriov_vf(adev)){
@@ -4739,6 +4746,7 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
 struct amdgpu_device *tmp_adev = NULL;
 bool need_full_reset, skip_hw_reset, vram_lost = false;
 int r = 0;
+   bool gpu_reset_for_dev_remove = 0;

 /* Try reset handler method first */
 tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
@@ -4758,6 +4766,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
 test_bit(AMDGPU_NEED_FULL_RESET, _context->flags);
 skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, _context->flags);

+   gpu_reset_for_dev_remove =
+   test_bit(AMDGPU_RESET_FOR_DEVICE_REMOVE, _context->flags) 
&&
+   test_bit(AMDGPU_NEED_FULL_RESET, _context->flags);
+
 /*
  * ASIC reset has to be done on all XGMI hive nodes ASAP
  * to allow proper links negotiation in FW (within 1 sec)
@@ -4802,6 +4814,16 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
 amdgpu_ras_intr_cleared();
 }

+   /* Fixed the problem