RE: [PATCH] drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform

2021-12-24 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Friday, December 24, 2021 12:44 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Chen, Guchun
> 
> Subject: Re: [PATCH] drm/amdgpu: put SMU into proper state on runpm
> suspending for BOCO capable platform
> 
> 
> 
> On 12/24/2021 8:46 AM, Evan Quan wrote:
> > By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some
> proper
> > cleanups and put itself into a state ready for PNP(which fits the scenario
> BOCO stands for).
> 
> "BOCO similar to PNP" is not correct. Mention this as a workaround. With that
> changed
[Quan, Evan] Sorry for the confusing. I did not mean "BOCO" == "PNP". What I 
wanted to express is for BOCO/PNP(unlike BACO), SMU does not have to be alive.
From that perspective, it's reasonable to share the SMU cleanup process 
designed for PNP for BOCO. Anyway thanks for pointing this out. I will drop 
that confusing expression.

BR
Evan
>   Reviewed-by: Lijo Lazar 
> 
> Thanks,
> Lijo
> 
> > That can address some random resuming failure observed on BOCO
> capable platforms.
> >
> > Signed-off-by: Evan Quan 
> > Change-Id: I9804c4f04b6d2ef737b076cabf85d2880179efe2
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 15 +++
> >   1 file changed, 15 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index e431c7f10755..ad8370b41e74 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -2235,12 +2235,27 @@ static int
> amdgpu_pmops_runtime_suspend(struct device *dev)
> > if (amdgpu_device_supports_px(drm_dev))
> > drm_dev->switch_power_state =
> DRM_SWITCH_POWER_CHANGING;
> >
> > +   /*
> > +* By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do
> some
> > +* proper cleanups and put itself into a state ready for PNP. That
> > +* can address some random resuming failure observed on BOCO
> capable
> > +* platforms.
> > +* TODO: this may be also needed for PX capable platform.
> > +*/
> > +   if (amdgpu_device_supports_boco(drm_dev))
> > +   adev->mp1_state = PP_MP1_STATE_UNLOAD;
> > +
> > ret = amdgpu_device_suspend(drm_dev, false);
> > if (ret) {
> > adev->in_runpm = false;
> > +   if (amdgpu_device_supports_boco(drm_dev))
> > +   adev->mp1_state = PP_MP1_STATE_NONE;
> > return ret;
> > }
> >
> > +   if (amdgpu_device_supports_boco(drm_dev))
> > +   adev->mp1_state = PP_MP1_STATE_NONE;
> > +
> > if (amdgpu_device_supports_px(drm_dev)) {
> > /* Only need to handle PCI state in the driver for ATPX
> >  * PCI core handles it for _PR3.
> >


Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-24 Thread JingWen Chen
I do agree with shaoyun, if the host find the gpu engine hangs first, and do 
the flr, guest side thread may not know this and still try to access HW(e.g. 
kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset 
status). And this may lead to very bad result.

On 2021/12/24 下午4:58, Deng, Emily wrote:
> These patches look good to me. JingWen will pull these patches and do some 
> basic TDR test on sriov environment, and give feedback.
>
> Best wishes
> Emily Deng
>
>
>
>> -Original Message-
>> From: Liu, Monk 
>> Sent: Thursday, December 23, 2021 6:14 PM
>> To: Koenig, Christian ; Grodzovsky, Andrey
>> ; dri-de...@lists.freedesktop.org; amd-
>> g...@lists.freedesktop.org; Chen, Horace ; Chen,
>> JingWen ; Deng, Emily 
>> Cc: dan...@ffwll.ch
>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> [AMD Official Use Only]
>>
>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>
>> Please take a review on Andrey's patch
>>
>> Thanks
>> ---
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> ---
>> we are hiring software manager for CVS core team
>> ---
>>
>> -Original Message-
>> From: Koenig, Christian 
>> Sent: Thursday, December 23, 2021 4:42 PM
>> To: Grodzovsky, Andrey ; dri-
>> de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>> Cc: dan...@ffwll.ch; Liu, Monk ; Chen, Horace
>> 
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>> for SRIOV
>>
>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>> Since now flr work is serialized against  GPU resets there is no need
>>> for this.
>>>
>>> Signed-off-by: Andrey Grodzovsky 
>> Acked-by: Christian König 
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ---
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ---
>>>   2 files changed, 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> index 487cd654b69e..7d59a66e3988 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>> struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -   /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -* otherwise the mailbox msg will be ruined/reseted by
>>> -* the VF FLR.
>>> -*/
>>> -   if (!down_write_trylock(>reset_sem))
>>> -   return;
>>> -
>>> amdgpu_virt_fini_data_exchange(adev);
>>> -   atomic_set(>in_gpu_reset, 1);
>>>
>>> xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>> work_struct *work)
>>> } while (timeout > 1);
>>>
>>>   flr_done:
>>> -   atomic_set(>in_gpu_reset, 0);
>>> -   up_write(>reset_sem);
>>> -
>>> /* Trigger recovery for world switch failure if no TDR */
>>> if (amdgpu_device_should_recover_gpu(adev)
>>> && (!amdgpu_device_has_job_running(adev) || diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> index e3869067a31d..f82c066c8e8d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>> struct amdgpu_device *adev = container_of(virt, struct
>> amdgpu_device, virt);
>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>
>>> -   /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>> -* otherwise the mailbox msg will be ruined/reseted by
>>> -* the VF FLR.
>>> -*/
>>> -   if (!down_write_trylock(>reset_sem))
>>> -   return;
>>> -
>>> amdgpu_virt_fini_data_exchange(adev);
>>> -   atomic_set(>in_gpu_reset, 1);
>>>
>>> xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>
>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>> work_struct *work)
>>> } while (timeout > 1);
>>>
>>>   flr_done:
>>> -   atomic_set(>in_gpu_reset, 0);
>>> -   up_write(>reset_sem);
>>> -
>>> /* Trigger recovery for world switch failure if no TDR */
>>> if (amdgpu_device_should_recover_gpu(adev)
>>> && (!amdgpu_device_has_job_running(adev) ||


RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-24 Thread Deng, Emily
These patches look good to me. JingWen will pull these patches and do some 
basic TDR test on sriov environment, and give feedback.

Best wishes
Emily Deng



>-Original Message-
>From: Liu, Monk 
>Sent: Thursday, December 23, 2021 6:14 PM
>To: Koenig, Christian ; Grodzovsky, Andrey
>; dri-de...@lists.freedesktop.org; amd-
>g...@lists.freedesktop.org; Chen, Horace ; Chen,
>JingWen ; Deng, Emily 
>Cc: dan...@ffwll.ch
>Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>for SRIOV
>
>[AMD Official Use Only]
>
>@Chen, Horace @Chen, JingWen @Deng, Emily
>
>Please take a review on Andrey's patch
>
>Thanks
>---
>Monk Liu | Cloud GPU & Virtualization Solution | AMD
>---
>we are hiring software manager for CVS core team
>---
>
>-Original Message-
>From: Koenig, Christian 
>Sent: Thursday, December 23, 2021 4:42 PM
>To: Grodzovsky, Andrey ; dri-
>de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>Cc: dan...@ffwll.ch; Liu, Monk ; Chen, Horace
>
>Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
>for SRIOV
>
>Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>> Since now flr work is serialized against  GPU resets there is no need
>> for this.
>>
>> Signed-off-by: Andrey Grodzovsky 
>
>Acked-by: Christian König 
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ---
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ---
>>   2 files changed, 22 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> index 487cd654b69e..7d59a66e3988 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>work_struct *work)
>>  struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>>  int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>
>> -/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>> - * otherwise the mailbox msg will be ruined/reseted by
>> - * the VF FLR.
>> - */
>> -if (!down_write_trylock(>reset_sem))
>> -return;
>> -
>>  amdgpu_virt_fini_data_exchange(adev);
>> -atomic_set(>in_gpu_reset, 1);
>>
>>  xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>
>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>work_struct *work)
>>  } while (timeout > 1);
>>
>>   flr_done:
>> -atomic_set(>in_gpu_reset, 0);
>> -up_write(>reset_sem);
>> -
>>  /* Trigger recovery for world switch failure if no TDR */
>>  if (amdgpu_device_should_recover_gpu(adev)
>>  && (!amdgpu_device_has_job_running(adev) || diff --git
>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> index e3869067a31d..f82c066c8e8d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
>>  struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>>  int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>
>> -/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>> - * otherwise the mailbox msg will be ruined/reseted by
>> - * the VF FLR.
>> - */
>> -if (!down_write_trylock(>reset_sem))
>> -return;
>> -
>>  amdgpu_virt_fini_data_exchange(adev);
>> -atomic_set(>in_gpu_reset, 1);
>>
>>  xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>
>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
>>  } while (timeout > 1);
>>
>>   flr_done:
>> -atomic_set(>in_gpu_reset, 0);
>> -up_write(>reset_sem);
>> -
>>  /* Trigger recovery for world switch failure if no TDR */
>>  if (amdgpu_device_should_recover_gpu(adev)
>>  && (!amdgpu_device_has_job_running(adev) ||


RE: [PATCH] drm/amdgpu: no DC support for headless chips

2021-12-24 Thread Chen, Guchun
[Public]

Hi Alex,

Thanks for clarification. The patch is: Reviewed-by: Guchun Chen 
 .

My concern is that amdgpu_device_has_dc_support is called at multiple places. 
Before this patch, for ARCTURUS and ALDEBARAN, it goes to default case, and 
returns true by default, but hardcoded IP discovery setting guarantees no DC is 
initialized on those two, so far, it's fine. However, after this patch, 
amdgpu_device_has_dc_support will explicitly return false, and accordingly it 
changed some setting/execution like driver_feature or in suspend/resume. I am 
not pretty sure about the impact. Anyway, we can re-visit it if there is 
regression.

Regards,
Guchun

-Original Message-
From: Alex Deucher  
Sent: Friday, December 24, 2021 2:16 PM
To: Chen, Guchun 
Cc: Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org; tarequemd.ha...@yahoo.com
Subject: Re: [PATCH] drm/amdgpu: no DC support for headless chips

On Thu, Dec 23, 2021 at 9:54 PM Chen, Guchun  wrote:
>
> [Public]
>
> For the first two CHIP_HAINAN and CHIP_TOPAZ, using asic_type is fine. But 
> for CHIP_ARCTURUS and CHIP_ALDEBARAN, I wonder if there is any dc hardware 
> harvesting info carried by harvest table in VBIOS. If that's the case, I 
> think we can drop these two, as we can promise it by checking 
> AMD_HARVEST_IP_DMU_MASK in amdgpu_device_has_dc_support.

There is no IP discovery table for these chips, but they don't have any display 
IPs in the hardcoded IP discovery info in the driver.  I don't think this 
should affect them, but I wasn't sure..

Alex


>
> Regards,
> Guchun
>
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Alex Deucher
> Sent: Friday, December 24, 2021 3:20 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; 
> tarequemd.ha...@yahoo.com
> Subject: [PATCH] drm/amdgpu: no DC support for headless chips
>
> Chips with no display hardware should return false for DC support.
>
> Fixes: f7f12b25823c0d ("drm/amdgpu: default to true in 
> amdgpu_device_asic_has_dc_support")
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9dc86c5a1cad..58e2034984de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3166,6 +3166,14 @@ static void amdgpu_device_detect_sriov_bios(struct 
> amdgpu_device *adev)  bool amdgpu_device_asic_has_dc_support(enum 
> amd_asic_type asic_type)  {
> switch (asic_type) {
> +#ifdef CONFIG_DRM_AMDGPU_SI
> +   case CHIP_HAINAN:
> +#endif
> +   case CHIP_TOPAZ:
> +   case CHIP_ARCTURUS:
> +   case CHIP_ALDEBARAN:
> +   /* chips with no display hardware */
> +   return false;
>  #if defined(CONFIG_DRM_AMD_DC)
> case CHIP_TAHITI:
> case CHIP_PITCAIRN:
> --
> 2.33.1