Re: [PATCH] drm/amdgpu: no DC support for headless chips

2021-12-23 Thread Alex Deucher
On Thu, Dec 23, 2021 at 9:54 PM Chen, Guchun  wrote:
>
> [Public]
>
> For the first two CHIP_HAINAN and CHIP_TOPAZ, using asic_type is fine. But 
> for CHIP_ARCTURUS and CHIP_ALDEBARAN, I wonder if there is any dc hardware 
> harvesting info carried by harvest table in VBIOS. If that's the case, I 
> think we can drop these two, as we can promise it by checking 
> AMD_HARVEST_IP_DMU_MASK in amdgpu_device_has_dc_support.

There is no IP discovery table for these chips, but they don't have
any display IPs in the hardcoded IP discovery info in the driver.  I
don't think this should affect them, but I wasn't sure..

Alex


>
> Regards,
> Guchun
>
> -Original Message-
> From: amd-gfx  On Behalf Of Alex 
> Deucher
> Sent: Friday, December 24, 2021 3:20 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; tarequemd.ha...@yahoo.com
> Subject: [PATCH] drm/amdgpu: no DC support for headless chips
>
> Chips with no display hardware should return false for DC support.
>
> Fixes: f7f12b25823c0d ("drm/amdgpu: default to true in 
> amdgpu_device_asic_has_dc_support")
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9dc86c5a1cad..58e2034984de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3166,6 +3166,14 @@ static void amdgpu_device_detect_sriov_bios(struct 
> amdgpu_device *adev)  bool amdgpu_device_asic_has_dc_support(enum 
> amd_asic_type asic_type)  {
> switch (asic_type) {
> +#ifdef CONFIG_DRM_AMDGPU_SI
> +   case CHIP_HAINAN:
> +#endif
> +   case CHIP_TOPAZ:
> +   case CHIP_ARCTURUS:
> +   case CHIP_ALDEBARAN:
> +   /* chips with no display hardware */
> +   return false;
>  #if defined(CONFIG_DRM_AMD_DC)
> case CHIP_TAHITI:
> case CHIP_PITCAIRN:
> --
> 2.33.1


RE: [PATCH] drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform

2021-12-23 Thread Chen, Guchun
[Public]

Never mind. I checked the code in amdgpu_device_supports_boco, it's only true 
on dGPU. So pls skip my question.

Regards,
Guchun

-Original Message-
From: Chen, Guchun 
Sent: Friday, December 24, 2021 12:42 PM
To: Evan Quan ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Quan, Evan 

Subject: RE: [PATCH] drm/amdgpu: put SMU into proper state on runpm suspending 
for BOCO capable platform

[Public]

Thank you Evan. Shall we limit it to dGPU only?

With above concern clarified, the patch is:

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Evan Quan
Sent: Friday, December 24, 2021 11:17 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Quan, Evan 
; Chen, Guchun 
Subject: [PATCH] drm/amdgpu: put SMU into proper state on runpm suspending for 
BOCO capable platform

By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some proper cleanups 
and put itself into a state ready for PNP(which fits the scenario BOCO stands 
for).
That can address some random resuming failure observed on BOCO capable 
platforms.

Signed-off-by: Evan Quan 
Change-Id: I9804c4f04b6d2ef737b076cabf85d2880179efe2
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index e431c7f10755..ad8370b41e74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2235,12 +2235,27 @@ static int amdgpu_pmops_runtime_suspend(struct device 
*dev)
if (amdgpu_device_supports_px(drm_dev))
drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 
+   /*
+* By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some
+* proper cleanups and put itself into a state ready for PNP. That
+* can address some random resuming failure observed on BOCO capable
+* platforms.
+* TODO: this may be also needed for PX capable platform.
+*/
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_UNLOAD;
+
ret = amdgpu_device_suspend(drm_dev, false);
if (ret) {
adev->in_runpm = false;
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_NONE;
return ret;
}
 
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_NONE;
+
if (amdgpu_device_supports_px(drm_dev)) {
/* Only need to handle PCI state in the driver for ATPX
 * PCI core handles it for _PR3.
--
2.29.0


Re: [PATCH] drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform

2021-12-23 Thread Lazar, Lijo




On 12/24/2021 8:46 AM, Evan Quan wrote:

By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some proper cleanups 
and
put itself into a state ready for PNP(which fits the scenario BOCO stands for).


"BOCO similar to PNP" is not correct. Mention this as a workaround. With 
that changed

Reviewed-by: Lijo Lazar 

Thanks,
Lijo


That can address some random resuming failure observed on BOCO capable 
platforms.

Signed-off-by: Evan Quan 
Change-Id: I9804c4f04b6d2ef737b076cabf85d2880179efe2
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index e431c7f10755..ad8370b41e74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2235,12 +2235,27 @@ static int amdgpu_pmops_runtime_suspend(struct device 
*dev)
if (amdgpu_device_supports_px(drm_dev))
drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
  
+	/*

+* By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some
+* proper cleanups and put itself into a state ready for PNP. That
+* can address some random resuming failure observed on BOCO capable
+* platforms.
+* TODO: this may be also needed for PX capable platform.
+*/
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_UNLOAD;
+
ret = amdgpu_device_suspend(drm_dev, false);
if (ret) {
adev->in_runpm = false;
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_NONE;
return ret;
}
  
+	if (amdgpu_device_supports_boco(drm_dev))

+   adev->mp1_state = PP_MP1_STATE_NONE;
+
if (amdgpu_device_supports_px(drm_dev)) {
/* Only need to handle PCI state in the driver for ATPX
 * PCI core handles it for _PR3.



RE: [PATCH] drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform

2021-12-23 Thread Chen, Guchun
[Public]

Thank you Evan. Shall we limit it to dGPU only?

With above concern clarified, the patch is:

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Evan Quan
Sent: Friday, December 24, 2021 11:17 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Quan, Evan 
; Chen, Guchun 
Subject: [PATCH] drm/amdgpu: put SMU into proper state on runpm suspending for 
BOCO capable platform

By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some proper cleanups 
and put itself into a state ready for PNP(which fits the scenario BOCO stands 
for).
That can address some random resuming failure observed on BOCO capable 
platforms.

Signed-off-by: Evan Quan 
Change-Id: I9804c4f04b6d2ef737b076cabf85d2880179efe2
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index e431c7f10755..ad8370b41e74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2235,12 +2235,27 @@ static int amdgpu_pmops_runtime_suspend(struct device 
*dev)
if (amdgpu_device_supports_px(drm_dev))
drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 
+   /*
+* By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some
+* proper cleanups and put itself into a state ready for PNP. That
+* can address some random resuming failure observed on BOCO capable
+* platforms.
+* TODO: this may be also needed for PX capable platform.
+*/
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_UNLOAD;
+
ret = amdgpu_device_suspend(drm_dev, false);
if (ret) {
adev->in_runpm = false;
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_NONE;
return ret;
}
 
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_NONE;
+
if (amdgpu_device_supports_px(drm_dev)) {
/* Only need to handle PCI state in the driver for ATPX
 * PCI core handles it for _PR3.
--
2.29.0


Re: [PATCH] drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform

2021-12-23 Thread Alex Deucher
On Thu, Dec 23, 2021 at 10:17 PM Evan Quan  wrote:
>
> By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some proper cleanups 
> and
> put itself into a state ready for PNP(which fits the scenario BOCO stands 
> for).
> That can address some random resuming failure observed on BOCO capable 
> platforms.
>
> Signed-off-by: Evan Quan 
> Change-Id: I9804c4f04b6d2ef737b076cabf85d2880179efe2

Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index e431c7f10755..ad8370b41e74 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2235,12 +2235,27 @@ static int amdgpu_pmops_runtime_suspend(struct device 
> *dev)
> if (amdgpu_device_supports_px(drm_dev))
> drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
>
> +   /*
> +* By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some
> +* proper cleanups and put itself into a state ready for PNP. That
> +* can address some random resuming failure observed on BOCO capable
> +* platforms.
> +* TODO: this may be also needed for PX capable platform.
> +*/
> +   if (amdgpu_device_supports_boco(drm_dev))
> +   adev->mp1_state = PP_MP1_STATE_UNLOAD;
> +
> ret = amdgpu_device_suspend(drm_dev, false);
> if (ret) {
> adev->in_runpm = false;
> +   if (amdgpu_device_supports_boco(drm_dev))
> +   adev->mp1_state = PP_MP1_STATE_NONE;
> return ret;
> }
>
> +   if (amdgpu_device_supports_boco(drm_dev))
> +   adev->mp1_state = PP_MP1_STATE_NONE;
> +
> if (amdgpu_device_supports_px(drm_dev)) {
> /* Only need to handle PCI state in the driver for ATPX
>  * PCI core handles it for _PR3.
> --
> 2.29.0
>


[PATCH] drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform

2021-12-23 Thread Evan Quan
By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some proper cleanups 
and
put itself into a state ready for PNP(which fits the scenario BOCO stands for).
That can address some random resuming failure observed on BOCO capable 
platforms.

Signed-off-by: Evan Quan 
Change-Id: I9804c4f04b6d2ef737b076cabf85d2880179efe2
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index e431c7f10755..ad8370b41e74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2235,12 +2235,27 @@ static int amdgpu_pmops_runtime_suspend(struct device 
*dev)
if (amdgpu_device_supports_px(drm_dev))
drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING;
 
+   /*
+* By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some
+* proper cleanups and put itself into a state ready for PNP. That
+* can address some random resuming failure observed on BOCO capable
+* platforms.
+* TODO: this may be also needed for PX capable platform.
+*/
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_UNLOAD;
+
ret = amdgpu_device_suspend(drm_dev, false);
if (ret) {
adev->in_runpm = false;
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_NONE;
return ret;
}
 
+   if (amdgpu_device_supports_boco(drm_dev))
+   adev->mp1_state = PP_MP1_STATE_NONE;
+
if (amdgpu_device_supports_px(drm_dev)) {
/* Only need to handle PCI state in the driver for ATPX
 * PCI core handles it for _PR3.
-- 
2.29.0



RE: [PATCH] drm/amdgpu: no DC support for headless chips

2021-12-23 Thread Chen, Guchun
[Public]

For the first two CHIP_HAINAN and CHIP_TOPAZ, using asic_type is fine. But for 
CHIP_ARCTURUS and CHIP_ALDEBARAN, I wonder if there is any dc hardware 
harvesting info carried by harvest table in VBIOS. If that's the case, I think 
we can drop these two, as we can promise it by checking AMD_HARVEST_IP_DMU_MASK 
in amdgpu_device_has_dc_support.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Friday, December 24, 2021 3:20 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; tarequemd.ha...@yahoo.com
Subject: [PATCH] drm/amdgpu: no DC support for headless chips

Chips with no display hardware should return false for DC support.

Fixes: f7f12b25823c0d ("drm/amdgpu: default to true in 
amdgpu_device_asic_has_dc_support")
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9dc86c5a1cad..58e2034984de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3166,6 +3166,14 @@ static void amdgpu_device_detect_sriov_bios(struct 
amdgpu_device *adev)  bool amdgpu_device_asic_has_dc_support(enum amd_asic_type 
asic_type)  {
switch (asic_type) {
+#ifdef CONFIG_DRM_AMDGPU_SI
+   case CHIP_HAINAN:
+#endif
+   case CHIP_TOPAZ:
+   case CHIP_ARCTURUS:
+   case CHIP_ALDEBARAN:
+   /* chips with no display hardware */
+   return false;
 #if defined(CONFIG_DRM_AMD_DC)
case CHIP_TAHITI:
case CHIP_PITCAIRN:
--
2.33.1


Re: [PATCH v2] drm/amd/display: move calcs folder into DML

2021-12-23 Thread Dhillon, Jasdeep
[AMD Official Use Only]

Hi Isabbasso,

The patch fails to compile when there is No DCN because the calc object files 
fail to generate since dml depends on the CONFIG_DRM_AMD_DC_DCN being enabled 
(Makefile inside dc folder):

ifdef CONFIG_DRM_AMD_DC_DCN
DC_LIBS += dcn20
DC_LIBS += dsc
DC_LIBS += dcn10 dml
DC_LIBS += dcn21
DC_LIBS += dcn30
DC_LIBS += dcn301
DC_LIBS += dcn302
DC_LIBS += dcn303
endif


A few changes need to be made to the patch, which are:

-The Makefile in dc needs the line: DC_LIBS+= dml/calcs
-the Makefile in the calcs folder that the patch deletes can be placed inside 
of dc/dml/calcs instead of adding it to the Makefiles in dc/dml

Could you revise your patch based on these changes.

Regards,
Jasdeep

From: isabba...@riseup.net 
Sent: December 20, 2021 6:23 PM
To: Deucher, Alexander ; Koenig, Christian 
; dan...@ffwll.ch ; Wentland, Harry 
; Siqueira, Rodrigo ; Li, Sun 
peng (Leo) ; Pan, Xinhui ; Zhuo, 
Qingqing (Lillian) ; Dhillon, Jasdeep 
; m...@igalia.com 
Cc: amd-gfx@lists.freedesktop.org ; 
~lkcamp/patc...@lists.sr.ht <~lkcamp/patc...@lists.sr.ht>
Subject: Re: [PATCH v2] drm/amd/display: move calcs folder into DML

On 2021-12-20 20:20, Isabella Basso wrote:
> The calcs folder has FPU code on it, which should be isolated inside the
> DML folder as per 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fseries%2F93042%2Fdata=04%7C01%7Cjasdeep.dhillon%40amd.com%7C01959e019f6e45e25a6208d9c40fc233%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637756394247493762%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=bB4zotGMArbsTzZNDr0u2uw3VBD4jxNornJMol9YJlA%3Dreserved=0.
>
> This commit aims single-handedly to correct the location of such FPU
> code and does not refactor any functions.
>
> Signed-off-by: Isabella Basso 
> ---
>  drivers/gpu/drm/amd/display/dc/Makefile   |  2 +-
>  drivers/gpu/drm/amd/display/dc/calcs/Makefile | 68 ---
>  drivers/gpu/drm/amd/display/dc/dml/Makefile   | 13 +++-
>  .../amd/display/dc/{ => dml}/calcs/bw_fixed.c |  0
>  .../display/dc/{ => dml}/calcs/calcs_logger.h |  0
>  .../display/dc/{ => dml}/calcs/custom_float.c |  0
>  .../display/dc/{ => dml}/calcs/dce_calcs.c|  0
>  .../dc/{ => dml}/calcs/dcn_calc_auto.c|  0
>  .../dc/{ => dml}/calcs/dcn_calc_auto.h|  0
>  .../dc/{ => dml}/calcs/dcn_calc_math.c|  0
>  .../display/dc/{ => dml}/calcs/dcn_calcs.c|  0
>  11 files changed, 13 insertions(+), 70 deletions(-)
>  delete mode 100644 drivers/gpu/drm/amd/display/dc/calcs/Makefile
>  rename drivers/gpu/drm/amd/display/dc/{ => dml}/calcs/bw_fixed.c (100%)
>  rename drivers/gpu/drm/amd/display/dc/{ => dml}/calcs/calcs_logger.h (100%)
>  rename drivers/gpu/drm/amd/display/dc/{ => dml}/calcs/custom_float.c (100%)
>  rename drivers/gpu/drm/amd/display/dc/{ => dml}/calcs/dce_calcs.c (100%)
>  rename drivers/gpu/drm/amd/display/dc/{ => dml}/calcs/dcn_calc_auto.c (100%)
>  rename drivers/gpu/drm/amd/display/dc/{ => dml}/calcs/dcn_calc_auto.h (100%)
>  rename drivers/gpu/drm/amd/display/dc/{ => dml}/calcs/dcn_calc_math.c (100%)
>  rename drivers/gpu/drm/amd/display/dc/{ => dml}/calcs/dcn_calcs.c (100%)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/Makefile
> b/drivers/gpu/drm/amd/display/dc/Makefile
> index b1f0d6260226..1872adc96a00 100644
> --- a/drivers/gpu/drm/amd/display/dc/Makefile
> +++ b/drivers/gpu/drm/amd/display/dc/Makefile
> @@ -23,7 +23,7 @@
>  # Makefile for Display Core (dc) component.
>  #
>
> -DC_LIBS = basics bios calcs clk_mgr dce gpio irq virtual
> +DC_LIBS = basics bios clk_mgr dce gpio irq virtual
>
>  ifdef CONFIG_DRM_AMD_DC_DCN
>  DC_LIBS += dcn20
> diff --git a/drivers/gpu/drm/amd/display/dc/calcs/Makefile
> b/drivers/gpu/drm/amd/display/dc/calcs/Makefile
> deleted file mode 100644
> index f3c00f479e1c..
> --- a/drivers/gpu/drm/amd/display/dc/calcs/Makefile
> +++ /dev/null
> @@ -1,68 +0,0 @@
> -#
> -# Copyright 2017 Advanced Micro Devices, Inc.
> -# Copyright 2019 Raptor Engineering, LLC
> -#
> -# Permission is hereby granted, free of charge, to any person obtaining a
> -# copy of this software and associated documentation files (the "Software"),
> -# to deal in the Software without restriction, including without limitation
> -# the rights to use, copy, modify, merge, publish, distribute, sublicense,
> -# and/or sell copies of the Software, and to permit persons to whom the
> -# Software is furnished to do so, subject to the following conditions:
> -#
> -# The above copyright notice and this permission notice shall be included in
> -# all copies or substantial portions of the Software.
> -#
> -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> -# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> -# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> -# THE COPYRIGHT HOLDER(S) OR 

Re: [bisected][regression] drm: amdgpu: system hangs on tty login prompt

2021-12-23 Thread Alex Deucher
The attached patch should fix this.

Alex

On Thu, Dec 23, 2021 at 2:04 PM Tareque Md.Hanif
 wrote:
>
> Hi,
>
> I have been testing the 5.16-rc6 kernel and it hangs on tty login prompt on 
> my laptop. Sometimes it lets me to login but "startx" hangs the system. The 
> system is booting fine in 5.15.7.
>
> So I bisected the bug.
>
> first bad commit: [f7f12b25823c0dce1165b390522d29f99c4585b4] drm/amdgpu: 
> default to true in amdgpu_device_asic_has_dc_support
>
> Reverting this commit fixes the issue.
>
> Device Information:
> Laptop model: Dell Inspiron 15 5567
> GPU 0: Intel HD Graphics 620
> GPU 1: AMD ATI Radeon R7 M445
>
> Regards,
> Tareque.
>
From 7b076f1ab562211d2f21d22f2ff9920503796f06 Mon Sep 17 00:00:00 2001
From: Alex Deucher 
Date: Thu, 23 Dec 2021 14:13:02 -0500
Subject: [PATCH] drm/amdgpu: no DC support for headless chips

Chips with no display hardware should return false for
DC support.

Fixes: f7f12b25823c0d ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support")
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9dc86c5a1cad..58e2034984de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3166,6 +3166,14 @@ static void amdgpu_device_detect_sriov_bios(struct amdgpu_device *adev)
 bool amdgpu_device_asic_has_dc_support(enum amd_asic_type asic_type)
 {
 	switch (asic_type) {
+#ifdef CONFIG_DRM_AMDGPU_SI
+	case CHIP_HAINAN:
+#endif
+	case CHIP_TOPAZ:
+	case CHIP_ARCTURUS:
+	case CHIP_ALDEBARAN:
+		/* chips with no display hardware */
+		return false;
 #if defined(CONFIG_DRM_AMD_DC)
 	case CHIP_TAHITI:
 	case CHIP_PITCAIRN:
-- 
2.33.1



[PATCH] drm/amdgpu: no DC support for headless chips

2021-12-23 Thread Alex Deucher
Chips with no display hardware should return false for
DC support.

Fixes: f7f12b25823c0d ("drm/amdgpu: default to true in 
amdgpu_device_asic_has_dc_support")
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9dc86c5a1cad..58e2034984de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3166,6 +3166,14 @@ static void amdgpu_device_detect_sriov_bios(struct 
amdgpu_device *adev)
 bool amdgpu_device_asic_has_dc_support(enum amd_asic_type asic_type)
 {
switch (asic_type) {
+#ifdef CONFIG_DRM_AMDGPU_SI
+   case CHIP_HAINAN:
+#endif
+   case CHIP_TOPAZ:
+   case CHIP_ARCTURUS:
+   case CHIP_ALDEBARAN:
+   /* chips with no display hardware */
+   return false;
 #if defined(CONFIG_DRM_AMD_DC)
case CHIP_TAHITI:
case CHIP_PITCAIRN:
-- 
2.33.1



[PATCH] SWDEV-311259 - dc: move FPU associated DCN302 code to DML

2021-12-23 Thread Jasdeep Dhillon
[Why & How]
As part of the FPU isolation work documented in
https://patchwork.freedesktop.org/series/93042/, isolate
code that uses FPU in DCN302 to DML, where all FPU code
should locate.

Signed-off-by: Jasdeep Dhillon 
---
 CMakeLists.txt  |   1 +
 dc/dcn302/Makefile  |  43 ++--
 dc/dcn302/dcn302_resource.c | 340 +---
 dc/dcn302/dcn302_resource.h |   3 +
 dc/dml/Makefile |   2 +
 dc/dml/dcn302/dcn302_fpu.c  | 381 
 dc/dml/dcn302/dcn302_fpu.h  |  32 +++
 7 files changed, 448 insertions(+), 354 deletions(-)
 create mode 100644 dc/dml/dcn302/dcn302_fpu.c
 create mode 100644 dc/dml/dcn302/dcn302_fpu.h

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 01c6724df..dfee2cd38 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -243,6 +243,7 @@ list (APPEND srcs "dc/dcn301/dcn301_hubbub.c")
 list (APPEND srcs "dc/dcn302/dcn302_init.c")
 list (APPEND srcs "dc/dcn302/dcn302_hwseq.c")
 list (APPEND srcs "dc/dcn302/dcn302_resource.c")
+list (APPEND srcs "dc/dml/dcn302/dcn302_fpu.c")
 list (APPEND srcs "dc/dcn303/dcn303_init.c")
 list (APPEND srcs "dc/dcn303/dcn303_hwseq.c")
 list (APPEND srcs "dc/dcn303/dcn303_resource.c")
diff --git a/dc/dcn302/Makefile b/dc/dcn302/Makefile
index 101620a88..35a6ffbdd 100644
--- a/dc/dcn302/Makefile
+++ b/dc/dcn302/Makefile
@@ -1,42 +1,37 @@
 #
 # (c) Copyright 2020 Advanced Micro Devices, Inc. All the rights reserved
 #
-#  All rights reserved.  This notice is intended as a precaution against
-#  inadvertent publication and does not imply publication or any waiver
-#  of confidentiality.  The year included in the foregoing notice is the
-#  year of creation of the work.
-#
 #  Authors: AMD
 #
 # Makefile for dcn302.
 
 DCN3_02 = dcn302_init.o dcn302_hwseq.o dcn302_resource.o
 
-ifdef CONFIG_X86
-CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o := -msse
-endif
+#ifdef CONFIG_X86
+#CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o := -msse
+#endif
 
-ifdef CONFIG_PPC64
-CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o := -mhard-float -maltivec
-endif
+#ifdef CONFIG_PPC64
+#CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o := -mhard-float -maltivec
+#endif
 
-ifdef CONFIG_CC_IS_GCC
-ifeq ($(call cc-ifversion, -lt, 0701, y), y)
-IS_OLD_GCC = 1
-endif
-CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o += -mhard-float
-endif
+#ifdef CONFIG_CC_IS_GCC
+#ifeq ($(call cc-ifversion, -lt, 0701, y), y)
+#IS_OLD_GCC = 1
+#endif
+#CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o += -mhard-float
+#endif
 
-ifdef CONFIG_X86
-ifdef IS_OLD_GCC
+#ifdef CONFIG_X86
+#ifdef IS_OLD_GCC
 # Stack alignment mismatch, proceed with caution.
 # GCC < 7.1 cannot compile code using `double` and -mpreferred-stack-boundary=3
 # (8B stack alignment).
-CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o += 
-mpreferred-stack-boundary=4
-else
-CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o += -msse2
-endif
-endif
+#CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o += 
-mpreferred-stack-boundary=4
+#else
+#CFLAGS_$(AMDDALPATH)/dc/dcn302/dcn302_resource.o += -msse2
+#endif
+#endif
 
 AMD_DAL_DCN3_02 = $(addprefix $(AMDDALPATH)/dc/dcn302/,$(DCN3_02))
 
diff --git a/dc/dcn302/dcn302_resource.c b/dc/dcn302/dcn302_resource.c
index a0f02aa9c..8fa290164 100644
--- a/dc/dcn302/dcn302_resource.c
+++ b/dc/dcn302/dcn302_resource.c
@@ -65,6 +65,8 @@
 #include "resource.h"
 #include "vm_helper.h"
 
+#include "dml/dcn302/dcn302_fpu.h"
+
 #include "include_legacy/dcn3/sienna_cichlid_ip_offset.h"
 #include "include_legacy/dcn302/dcn_3_0_2_offset.h"
 #include "include_legacy/dcn302/dcn_3_0_2_sh_mask.h"
@@ -81,164 +83,6 @@
 #define DC_LOGGER_INIT(logger)
 #endif
 
-struct _vcs_dpi_ip_params_st dcn3_02_ip = {
-   .use_min_dcfclk = 0,
-   .clamp_min_dcfclk = 0,
-   .odm_capable = 1,
-   .gpuvm_enable = 1,
-   .hostvm_enable = 0,
-   .gpuvm_max_page_table_levels = 4,
-   .hostvm_max_page_table_levels = 4,
-   .hostvm_cached_page_table_levels = 0,
-   .pte_group_size_bytes = 2048,
-   .num_dsc = 5,
-   .rob_buffer_size_kbytes = 184,
-   .det_buffer_size_kbytes = 184,
-   .dpte_buffer_size_in_pte_reqs_luma = 64,
-   .dpte_buffer_size_in_pte_reqs_chroma = 34,
-   .pde_proc_buffer_size_64k_reqs = 48,
-   .dpp_output_buffer_pixels = 2560,
-   .opp_output_buffer_lines = 1,
-   .pixel_chunk_size_kbytes = 8,
-   .pte_enable = 1,
-   .max_page_table_levels = 2,
-   .pte_chunk_size_kbytes = 2,  // ?
-   .meta_chunk_size_kbytes = 2,
-   .writeback_chunk_size_kbytes = 8,
-   .line_buffer_size_bits = 789504,
-   .is_line_buffer_bpp_fixed = 0,  // ?
-   .line_buffer_fixed_bpp = 0, // ?
-   .dcc_supported = true,
-   

[RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

2021-12-23 Thread Andrey Grodzovsky
No need to to trigger another work queue inside the work queue.

v3:

Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.

Suggested-by: Liu Shaoyun 
Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 9 ++---
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 9 ++---
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 9 ++---
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 23b066bcffb2..bdeb8e933bb4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -276,7 +276,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) ||
adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
-   amdgpu_device_gpu_recover(adev, NULL);
+   amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -301,8 +301,11 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device 
*adev,
 
switch (event) {
case IDH_FLR_NOTIFICATION:
-   if (amdgpu_sriov_runtime(adev))
-   schedule_work(>virt.flr_work);
+   if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+   WARN_ONCE(!queue_work(adev->reset_domain.wq,
+ >virt.flr_work),
+ "Failed to queue work! at %s",
+ __FUNCTION__ );
break;
case IDH_QUERY_ALIVE:
xgpu_ai_mailbox_send_ack(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index a35e6d87e537..dd8dc0f6028c 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -308,7 +308,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-   amdgpu_device_gpu_recover(adev, NULL);
+   amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -336,8 +336,11 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device 
*adev,
 
switch (event) {
case IDH_FLR_NOTIFICATION:
-   if (amdgpu_sriov_runtime(adev))
-   schedule_work(>virt.flr_work);
+   if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+   WARN_ONCE(!queue_work(adev->reset_domain.wq,
+ >virt.flr_work),
+ "Failed to queue work! at %s",
+ __FUNCTION__ );
break;
/* READY_TO_ACCESS_GPU is fetched by kernel polling, IRQ can 
ignore
 * it byfar since that polling thread will handle it,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index aef9d059ae52..c2afb72f97ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -521,7 +521,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct 
*work)
 
/* Trigger recovery due to world switch failure */
if (amdgpu_device_should_recover_gpu(adev))
-   amdgpu_device_gpu_recover(adev, NULL);
+   amdgpu_device_gpu_recover_imp(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
@@ -550,8 +550,11 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device 
*adev,
r = xgpu_vi_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
 
/* only handle FLR_NOTIFY now */
-   if (!r)
-   schedule_work(>virt.flr_work);
+   if (!r && !amdgpu_in_reset(adev))
+   WARN_ONCE(!queue_work(adev->reset_domain.wq,
+ >virt.flr_work),
+ "Failed to queue work! at %s",
+ __FUNCTION__ );
}
 
return 0;
-- 
2.25.1



RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-23 Thread Liu, Shaoyun
[AMD Official Use Only]

I have  a discussion with  Andrey  about this offline.   It seems dangerous  to 
remove the in_gpu_reset and  reset_semm directly inside the  flr_work.  In the 
case when the reset is triggered from host side , gpu need to be locked while 
host perform reset after flr_work reply the host with  READY_TO_RESET. 
The original comments seems need to be updated. 

Regards
Shaoyun.liu
 

-Original Message-
From: amd-gfx  On Behalf Of Andrey 
Grodzovsky
Sent: Wednesday, December 22, 2021 5:14 PM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk ; Grodzovsky, Andrey 
; Chen, Horace ; Koenig, 
Christian ; dan...@ffwll.ch
Subject: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for 
SRIOV

Since now flr work is serialized against  GPU resets there is no need for this.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ---  
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ---
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 
-   /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-* otherwise the mailbox msg will be ruined/reseted by
-* the VF FLR.
-*/
-   if (!down_write_trylock(>reset_sem))
-   return;
-
amdgpu_virt_fini_data_exchange(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
} while (timeout > 1);
 
 flr_done:
-   atomic_set(>in_gpu_reset, 0);
-   up_write(>reset_sem);
-
/* Trigger recovery for world switch failure if no TDR */
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) || diff --git 
a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 
-   /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-* otherwise the mailbox msg will be ruined/reseted by
-* the VF FLR.
-*/
-   if (!down_write_trylock(>reset_sem))
-   return;
-
amdgpu_virt_fini_data_exchange(adev);
-   atomic_set(>in_gpu_reset, 1);
 
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
} while (timeout > 1);
 
 flr_done:
-   atomic_set(>in_gpu_reset, 0);
-   up_write(>reset_sem);
-
/* Trigger recovery for world switch failure if no TDR */
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) ||
--
2.25.1


Re: [PATCH] drm/amd/display: fix dereference before NULL check

2021-12-23 Thread Harry Wentland
On 2021-12-16 13:14, José Expósito wrote:
> The "plane_state" pointer was access before checking if it was NULL.
> 
> Avoid a possible NULL pointer dereference by accessing the plane
> address after the check.
> 
> Addresses-Coverity-ID: 1474582 ("Dereference before null check")
> Fixes: 3f68c01be9a22 ("drm/amd/display: add cyan_skillfish display support")
> Signed-off-by: José Expósito 

Reviewed-by: Harry Wentland 

Harry

> ---
>  drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c 
> b/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c
> index cfd09b3f705e..fe22530242d2 100644
> --- a/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c
> +++ b/drivers/gpu/drm/amd/display/dc/dcn201/dcn201_hwseq.c
> @@ -134,11 +134,12 @@ void dcn201_update_plane_addr(const struct dc *dc, 
> struct pipe_ctx *pipe_ctx)
>   PHYSICAL_ADDRESS_LOC addr;
>   struct dc_plane_state *plane_state = pipe_ctx->plane_state;
>   struct dce_hwseq *hws = dc->hwseq;
> - struct dc_plane_address uma = plane_state->address;
> + struct dc_plane_address uma;
>  
>   if (plane_state == NULL)
>   return;
>  
> + uma = plane_state->address;
>   addr_patched = patch_address_for_sbs_tb_stereo(pipe_ctx, );
>  
>   plane_address_in_gpu_space_to_uma(hws, );


Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure

2021-12-23 Thread Andrey Grodzovsky

Adding back public list and Leslie specifically.

Lijo is right and it's not MTTR release only, also all the unmaps in 
amdgpu_device_unmap_mmio
since this patch makes call to amdgpu_device_unmap_mmio conditioned on 
device unplugged
we need then to still take care of unmapping even when device is NOT 
unplugged - for this i suggest
to look at 'drm/amdgpu: Unmap all MMIO mappings' and just conditionally 
call all the deleted unmaps

in the patch where the condition is 'if (drm_dev_enter(dev))'.

Andrey

On 2021-12-23 8:47 a.m., Lazar, Lijo wrote:

[AMD Official Use Only]

Actually, I was asking specifically about this -

gmc_v9_0_sw_init -> amdgpu_bo_init->adev->gmc.vram_mtrr = 
arch_phys_wc_add(adev->gmc.aper_base,
 adev->gmc.aper_size);


As per this patch, if the driver load failed due to some error which happens 
during hw_init(), this action is not undone. I was thinking why it's not 
considered important to skip this on driver failure to load. For ex: when some 
MTRR register is reserved for this mapping.

Thanks,
Lijo

-Original Message-
From: Grodzovsky, Andrey 
Sent: Friday, December 17, 2021 9:05 PM
To: Lazar, Lijo 
Subject: Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device 
is unplugged to prevent crash in GPU initialization failure

We do unmap on unload, this function is a collection of all MMIO unmapps we 
already do on unload, it's just does them early on in case of device hot 
removal. Before pci_driver.remove callback (amdgpu_pci_remove) finish execution.

Andrey

On 2021-12-17 10:23 a.m., Lazar, Lijo wrote:

[AMD Official Use Only]

As a side note,  even if it's a failed driver load, why it is not important to 
undo the mappings created during driver load? I'm wondering what is the impact 
on a system like MI200 A+A.

Thanks,
Lijo

-Original Message-
From: amd-gfx  On Behalf Of
Andrey Grodzovsky
Sent: Friday, December 17, 2021 8:32 PM
To: Koenig, Christian ; Shi, Leslie
; Pan, Xinhui ; Deucher,
Alexander ; amd-gfx@lists.freedesktop.org
Cc: Chen, Guchun 
Subject: Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if
device is unplugged to prevent crash in GPU initialization failure

Reviewed-by: Andrey Grodzovsky 

Andrey

On 2021-12-17 3:49 a.m., Christian König wrote:

Am 17.12.21 um 03:26 schrieb Leslie Shi:

[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error
during driver modprobe, it will start the error handle path
immediately and call into amdgpu_device_unmap_mmio as well to
release mapped VRAM. However, in the following release callback,
driver stills visits the unmapped memory like
vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent
invalid memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.

Signed-off-by: Leslie Shi 

Looks sane to me, but Andrey should probably nood as well.

Acked-by: Christian König 


---
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++-
    1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f31caec669e7..f6a47b927cfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3899,7 +3899,9 @@ void amdgpu_device_fini_hw(struct
amdgpu_device
*adev)
      amdgpu_gart_dummy_page_fini(adev);
    -    amdgpu_device_unmap_mmio(adev);
+    if (drm_dev_is_unplugged(adev_to_drm(adev)))
+    amdgpu_device_unmap_mmio(adev);
+
    }
      void amdgpu_device_fini_sw(struct amdgpu_device *adev)


RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-23 Thread Liu, Monk
[AMD Official Use Only]

@Chen, Horace @Chen, JingWen @Deng, Emily

Please take a review on Andrey's patch 

Thanks 
---
Monk Liu | Cloud GPU & Virtualization Solution | AMD
---
we are hiring software manager for CVS core team
---

-Original Message-
From: Koenig, Christian  
Sent: Thursday, December 23, 2021 4:42 PM
To: Grodzovsky, Andrey ; 
dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: dan...@ffwll.ch; Liu, Monk ; Chen, Horace 

Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection 
for SRIOV

Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
> Since now flr work is serialized against  GPU resets there is no need 
> for this.
>
> Signed-off-by: Andrey Grodzovsky 

Acked-by: Christian König 

> ---
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ---
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ---
>   2 files changed, 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 487cd654b69e..7d59a66e3988 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
> *work)
>   struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
> virt);
>   int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> - /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -  * otherwise the mailbox msg will be ruined/reseted by
> -  * the VF FLR.
> -  */
> - if (!down_write_trylock(>reset_sem))
> - return;
> -
>   amdgpu_virt_fini_data_exchange(adev);
> - atomic_set(>in_gpu_reset, 1);
>   
>   xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
> *work)
>   } while (timeout > 1);
>   
>   flr_done:
> - atomic_set(>in_gpu_reset, 0);
> - up_write(>reset_sem);
> -
>   /* Trigger recovery for world switch failure if no TDR */
>   if (amdgpu_device_should_recover_gpu(adev)
>   && (!amdgpu_device_has_job_running(adev) || diff --git 
> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index e3869067a31d..f82c066c8e8d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
> *work)
>   struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
> virt);
>   int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> - /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -  * otherwise the mailbox msg will be ruined/reseted by
> -  * the VF FLR.
> -  */
> - if (!down_write_trylock(>reset_sem))
> - return;
> -
>   amdgpu_virt_fini_data_exchange(adev);
> - atomic_set(>in_gpu_reset, 1);
>   
>   xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>   
> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
> *work)
>   } while (timeout > 1);
>   
>   flr_done:
> - atomic_set(>in_gpu_reset, 0);
> - up_write(>reset_sem);
> -
>   /* Trigger recovery for world switch failure if no TDR */
>   if (amdgpu_device_should_recover_gpu(adev)
>   && (!amdgpu_device_has_job_running(adev) ||


Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-23 Thread Christian König

Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:

Since now flr work is serialized against  GPU resets
there is no need for this.

Signed-off-by: Andrey Grodzovsky 


Acked-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ---
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ---
  2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
  
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,

-* otherwise the mailbox msg will be ruined/reseted by
-* the VF FLR.
-*/
-   if (!down_write_trylock(>reset_sem))
-   return;
-
amdgpu_virt_fini_data_exchange(adev);
-   atomic_set(>in_gpu_reset, 1);
  
  	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
  
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)

} while (timeout > 1);
  
  flr_done:

-   atomic_set(>in_gpu_reset, 0);
-   up_write(>reset_sem);
-
/* Trigger recovery for world switch failure if no TDR */
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) ||
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
  
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,

-* otherwise the mailbox msg will be ruined/reseted by
-* the VF FLR.
-*/
-   if (!down_write_trylock(>reset_sem))
-   return;
-
amdgpu_virt_fini_data_exchange(adev);
-   atomic_set(>in_gpu_reset, 1);
  
  	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
  
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)

} while (timeout > 1);
  
  flr_done:

-   atomic_set(>in_gpu_reset, 0);
-   up_write(>reset_sem);
-
/* Trigger recovery for world switch failure if no TDR */
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) ||




Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2021-12-23 Thread Christian König

Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:

Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeueue work and wait on it to finish.

v2: Rename to amdgpu_recover_work_struct

Signed-off-by: Andrey Grodzovsky 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  2 +-
  3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index b5ff76aae7e0..8e96b9a14452 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1296,6 +1296,8 @@ bool amdgpu_device_has_job_running(struct amdgpu_device 
*adev);
  bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  struct amdgpu_job* job);
+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+ struct amdgpu_job *job);
  void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
  int amdgpu_device_pci_reset(struct amdgpu_device *adev);
  bool amdgpu_device_need_post(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7c063fd37389..258ec3c0b2af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4979,7 +4979,7 @@ static void amdgpu_device_recheck_guilty_jobs(
   * Returns 0 for success or an error on failure.
   */
  
-int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

+int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
  struct amdgpu_job *job)
  {
struct list_head device_list, *device_list_handle =  NULL;
@@ -5237,6 +5237,37 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
return r;
  }
  
+struct amdgpu_recover_work_struct {

+   struct work_struct base;
+   struct amdgpu_device *adev;
+   struct amdgpu_job *job;
+   int ret;
+};
+
+static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
+{
+   struct amdgpu_recover_work_struct *recover_work = container_of(work, 
struct amdgpu_recover_work_struct, base);
+
+   recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, 
recover_work->job);
+}
+/*
+ * Serialize gpu recover into reset domain single threaded wq
+ */
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
+   struct amdgpu_job *job)
+{
+   struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
+
+   INIT_WORK(, amdgpu_device_queue_gpu_recover_work);
+
+   if (!queue_work(adev->reset_domain.wq, ))
+   return -EAGAIN;
+
+   flush_work();
+
+   return work.ret;
+}
+
  /**
   * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
   *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index bfc47bea23db..38c9fd7b7ad4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -63,7 +63,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
  ti.process_name, ti.tgid, ti.task_name, ti.pid);
  
  	if (amdgpu_device_should_recover_gpu(ring->adev)) {

-   amdgpu_device_gpu_recover(ring->adev, job);
+   amdgpu_device_gpu_recover_imp(ring->adev, job);
} else {
drm_sched_suspend_timeout(>sched);
if (amdgpu_sriov_vf(adev))




Re: [RFC v2 3/8] drm/amdgpu: Fix crash on modprobe

2021-12-23 Thread Christian König




Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:

Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 5527c68c51de..8ebd954e06c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -582,7 +582,7 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
if (!ring || !ring->fence_drv.initialized)
continue;
  
-		if (!ring->no_scheduler) {

+   if (adev->in_suspend && !ring->no_scheduler) {


Please add a TODO comment, something like "restructure resume to make 
that unnecessary".


With that done the patch is Reviewed-by: Christian König 
 as well.


Christian.


drm_sched_resubmit_jobs(>sched);
drm_sched_start(>sched, true);
}




Re: [RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready

2021-12-23 Thread Christian König

Am 22.12.21 um 23:05 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
  3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0f3e6c078f88..7c063fd37389 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2284,6 +2284,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
return r;
  }
  
+static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)

+{
+   long timeout;
+   int r, i;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   /* No need to setup the GPU scheduler for rings that don't need 
it */
+   if (!ring || ring->no_scheduler)
+   continue;
+
+   switch (ring->funcs->type) {
+   case AMDGPU_RING_TYPE_GFX:
+   timeout = adev->gfx_timeout;
+   break;
+   case AMDGPU_RING_TYPE_COMPUTE:
+   timeout = adev->compute_timeout;
+   break;
+   case AMDGPU_RING_TYPE_SDMA:
+   timeout = adev->sdma_timeout;
+   break;
+   default:
+   timeout = adev->video_timeout;
+   break;
+   }
+
+   r = drm_sched_init(>sched, _sched_ops,
+  ring->num_hw_submission, 
amdgpu_job_hang_limit,
+  timeout, adev->reset_domain.wq, 
ring->sched_score, ring->name);
+   if (r) {
+   DRM_ERROR("Failed to create scheduler on ring %s.\n",
+ ring->name);
+   return r;
+   }
+   }
+
+   return 0;
+}
+
+
  /**
   * amdgpu_device_ip_init - run init for hardware IPs
   *
@@ -2412,6 +2453,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
}
}
  
+	r = amdgpu_device_init_schedulers(adev);

+   if (r)
+   goto init_failed;
+
/* Don't init kfd if whole hive need to be reset during init */
if (!adev->gmc.xgmi.pending_reset)
amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 3b7e86ea7167..5527c68c51de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -456,8 +456,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
  atomic_t *sched_score)
  {
struct amdgpu_device *adev = ring->adev;
-   long timeout;
-   int r;
  
  	if (!adev)

return -EINVAL;
@@ -477,36 +475,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
*ring,
spin_lock_init(>fence_drv.lock);
ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
 GFP_KERNEL);
-   if (!ring->fence_drv.fences)
-   return -ENOMEM;
  
-	/* No need to setup the GPU scheduler for rings that don't need it */

-   if (ring->no_scheduler)
-   return 0;
+   ring->num_hw_submission = num_hw_submission;
+   ring->sched_score = sched_score;
  
-	switch (ring->funcs->type) {

-   case AMDGPU_RING_TYPE_GFX:
-   timeout = adev->gfx_timeout;
-   break;
-   case AMDGPU_RING_TYPE_COMPUTE:
-   timeout = adev->compute_timeout;
-   break;
-   case AMDGPU_RING_TYPE_SDMA:
-   timeout = adev->sdma_timeout;
-   break;
-   default:
-   timeout = adev->video_timeout;
-   break;
-   }
-
-   r = drm_sched_init(>sched, _sched_ops,
-  num_hw_submission, amdgpu_job_hang_limit,
-  timeout, NULL, sched_score, ring->name);
-   if (r) {
-   DRM_ERROR("Failed to create scheduler on ring %s.\n",
- ring->name);
-   return r;
-   }
+   if (!ring->fence_drv.fences)
+   return -ENOMEM;
  
  	return 0;

  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 4d380e79752c..a4b8279e3011 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++