date:20211021

RE: [PATCH] drm/amdgpu/nbio7.4: use original HDP_FLUSH bits for navi1x

2021-10-21 Thread Chen, Guchun

[Public]

Reviewed-and-tested-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Friday, October 22, 2021 12:30 PM
To: Deucher, Alexander 
Cc: amd-gfx list 
Subject: Re: [PATCH] drm/amdgpu/nbio7.4: use original HDP_FLUSH bits for navi1x

On Fri, Oct 22, 2021 at 12:21 AM Alex Deucher  wrote:
>

Copy paste typo in the patch title fixed locally.

> The extended bits were not available for use on vega20 and presumably 
> arcturus as well.
>
> Fixes: a0f9f854666834 ("drm/amdgpu/nbio7.4: don't use GPU_HDP_FLUSH 
> bit 12")
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c |  5 -
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c| 15 +++
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h|  1 +
>  3 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index 814e9620fac5..208a784475bd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -1125,10 +1125,13 @@ int amdgpu_discovery_set_ip_blocks(struct 
> amdgpu_device *adev)
> break;
> case IP_VERSION(7, 4, 0):
> case IP_VERSION(7, 4, 1):
> -   case IP_VERSION(7, 4, 4):
> adev->nbio.funcs = _v7_4_funcs;
> adev->nbio.hdp_flush_reg = _v7_4_hdp_flush_reg;
> break;
> +   case IP_VERSION(7, 4, 4):
> +   adev->nbio.funcs = _v7_4_funcs;
> +   adev->nbio.hdp_flush_reg = _v7_4_hdp_flush_reg_ald;
> +   break;
> case IP_VERSION(7, 2, 0):
> case IP_VERSION(7, 2, 1):
> case IP_VERSION(7, 5, 0):
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c 
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index 3b7775d74bb2..b8bd03d16dba 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -325,6 +325,21 @@ static u32 nbio_v7_4_get_pcie_data_offset(struct 
> amdgpu_device *adev)  }
>
>  const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg = {
> +   .ref_and_mask_cp0 = GPU_HDP_FLUSH_DONE__CP0_MASK,
> +   .ref_and_mask_cp1 = GPU_HDP_FLUSH_DONE__CP1_MASK,
> +   .ref_and_mask_cp2 = GPU_HDP_FLUSH_DONE__CP2_MASK,
> +   .ref_and_mask_cp3 = GPU_HDP_FLUSH_DONE__CP3_MASK,
> +   .ref_and_mask_cp4 = GPU_HDP_FLUSH_DONE__CP4_MASK,
> +   .ref_and_mask_cp5 = GPU_HDP_FLUSH_DONE__CP5_MASK,
> +   .ref_and_mask_cp6 = GPU_HDP_FLUSH_DONE__CP6_MASK,
> +   .ref_and_mask_cp7 = GPU_HDP_FLUSH_DONE__CP7_MASK,
> +   .ref_and_mask_cp8 = GPU_HDP_FLUSH_DONE__CP8_MASK,
> +   .ref_and_mask_cp9 = GPU_HDP_FLUSH_DONE__CP9_MASK,
> +   .ref_and_mask_sdma0 = GPU_HDP_FLUSH_DONE__SDMA0_MASK,
> +   .ref_and_mask_sdma1 = GPU_HDP_FLUSH_DONE__SDMA1_MASK, };
> +
> +const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg_ald = {
> .ref_and_mask_cp0 = GPU_HDP_FLUSH_DONE__CP0_MASK,
> .ref_and_mask_cp1 = GPU_HDP_FLUSH_DONE__CP1_MASK,
> .ref_and_mask_cp2 = GPU_HDP_FLUSH_DONE__CP2_MASK, diff --git 
> a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h 
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
> index b8216581ec8d..cc5692db6f98 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
> @@ -27,6 +27,7 @@
>  #include "soc15_common.h"
>
>  extern const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg;
> +extern const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg_ald;
>  extern const struct amdgpu_nbio_funcs nbio_v7_4_funcs;  extern const 
> struct amdgpu_nbio_ras_funcs nbio_v7_4_ras_funcs;
>
> --
> 2.31.1
>

Re: [PATCH] drm/amdgpu/nbio7.4: use original HDP_FLUSH bits for navi1x

2021-10-21 Thread Alex Deucher

On Fri, Oct 22, 2021 at 12:21 AM Alex Deucher  wrote:
>

Copy paste typo in the patch title fixed locally.

> The extended bits were not available for use on vega20 and
> presumably arcturus as well.
>
> Fixes: a0f9f854666834 ("drm/amdgpu/nbio7.4: don't use GPU_HDP_FLUSH bit 12")
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c |  5 -
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c| 15 +++
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h|  1 +
>  3 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> index 814e9620fac5..208a784475bd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> @@ -1125,10 +1125,13 @@ int amdgpu_discovery_set_ip_blocks(struct 
> amdgpu_device *adev)
> break;
> case IP_VERSION(7, 4, 0):
> case IP_VERSION(7, 4, 1):
> -   case IP_VERSION(7, 4, 4):
> adev->nbio.funcs = _v7_4_funcs;
> adev->nbio.hdp_flush_reg = _v7_4_hdp_flush_reg;
> break;
> +   case IP_VERSION(7, 4, 4):
> +   adev->nbio.funcs = _v7_4_funcs;
> +   adev->nbio.hdp_flush_reg = _v7_4_hdp_flush_reg_ald;
> +   break;
> case IP_VERSION(7, 2, 0):
> case IP_VERSION(7, 2, 1):
> case IP_VERSION(7, 5, 0):
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c 
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index 3b7775d74bb2..b8bd03d16dba 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -325,6 +325,21 @@ static u32 nbio_v7_4_get_pcie_data_offset(struct 
> amdgpu_device *adev)
>  }
>
>  const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg = {
> +   .ref_and_mask_cp0 = GPU_HDP_FLUSH_DONE__CP0_MASK,
> +   .ref_and_mask_cp1 = GPU_HDP_FLUSH_DONE__CP1_MASK,
> +   .ref_and_mask_cp2 = GPU_HDP_FLUSH_DONE__CP2_MASK,
> +   .ref_and_mask_cp3 = GPU_HDP_FLUSH_DONE__CP3_MASK,
> +   .ref_and_mask_cp4 = GPU_HDP_FLUSH_DONE__CP4_MASK,
> +   .ref_and_mask_cp5 = GPU_HDP_FLUSH_DONE__CP5_MASK,
> +   .ref_and_mask_cp6 = GPU_HDP_FLUSH_DONE__CP6_MASK,
> +   .ref_and_mask_cp7 = GPU_HDP_FLUSH_DONE__CP7_MASK,
> +   .ref_and_mask_cp8 = GPU_HDP_FLUSH_DONE__CP8_MASK,
> +   .ref_and_mask_cp9 = GPU_HDP_FLUSH_DONE__CP9_MASK,
> +   .ref_and_mask_sdma0 = GPU_HDP_FLUSH_DONE__SDMA0_MASK,
> +   .ref_and_mask_sdma1 = GPU_HDP_FLUSH_DONE__SDMA1_MASK,
> +};
> +
> +const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg_ald = {
> .ref_and_mask_cp0 = GPU_HDP_FLUSH_DONE__CP0_MASK,
> .ref_and_mask_cp1 = GPU_HDP_FLUSH_DONE__CP1_MASK,
> .ref_and_mask_cp2 = GPU_HDP_FLUSH_DONE__CP2_MASK,
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h 
> b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
> index b8216581ec8d..cc5692db6f98 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
> @@ -27,6 +27,7 @@
>  #include "soc15_common.h"
>
>  extern const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg;
> +extern const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg_ald;
>  extern const struct amdgpu_nbio_funcs nbio_v7_4_funcs;
>  extern const struct amdgpu_nbio_ras_funcs nbio_v7_4_ras_funcs;
>
> --
> 2.31.1
>

Re: [PATCH 1/2] drm/amdgpu/nbio7.4: don't use GPU_HDP_FLUSH bit 12

2021-10-21 Thread Alex Deucher

On Thu, Oct 21, 2021 at 11:53 PM Chen, Guchun  wrote:
>
> [Public]
>
> This patch caused ring test of SDMA failure on Vega20.

Fix sent out.  Sorry for the breakage.

Alex

>
> Oct 12 00:18:24 vega20-ebd-11 kernel: [   11.900968] IPv6: 
> ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.007480] AMD-Vi: AMD IOMMUv2 
> driver by Joerg Roedel 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.007482] AMD-Vi: AMD IOMMUv2 
> functionality not available on this system
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069082] [drm] amdgpu kernel 
> modesetting enabled.
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069226] amdgpu: CRAT table not 
> found
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069229] amdgpu: Virtual CRAT 
> table created for CPU
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069288] amdgpu: Topology: Add 
> CPU node
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069415] checking generic 
> (9000 30) vs hw (9000 1000)
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069416] fb0: switching to 
> amdgpudrmfb from EFI VGA
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069700] Console: switching to 
> colour dummy device 80x25
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069755] amdgpu :03:00.0: 
> vgaarb: deactivate vga console
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070047] amdgpu :03:00.0: 
> enabling device (0006 -> 0007)
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070241] [drm] initializing 
> kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x081E 0x06).
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070244] amdgpu :03:00.0: 
> amdgpu: Trusted Memory Zone (TMZ) feature not supported
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070257] [drm] register mmio 
> base: 0xA030
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070258] [drm] register mmio 
> size: 524288
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070263] [drm] add ip block 
> number 0 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070264] [drm] add ip block 
> number 1 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070265] [drm] add ip block 
> number 2 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070266] [drm] add ip block 
> number 3 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070266] [drm] add ip block 
> number 4 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070267] [drm] add ip block 
> number 5 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070267] [drm] add ip block 
> number 6 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070268] [drm] add ip block 
> number 7 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070269] [drm] add ip block 
> number 8 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070269] [drm] add ip block 
> number 9 
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070286] amdgpu :03:00.0: 
> amdgpu: Fetched VBIOS from VFCT
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070293] amdgpu: ATOM BIOS: 
> 113-D1640600-103
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072517] [drm] UVD(0) is enabled 
> in VM mode
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072519] [drm] UVD(1) is enabled 
> in VM mode
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072520] [drm] UVD(0) ENC is 
> enabled in VM mode
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072520] [drm] UVD(1) ENC is 
> enabled in VM mode
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072521] [drm] VCE enabled in VM 
> mode
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072632] amdgpu :03:00.0: 
> amdgpu: MEM ECC is active.
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072633] amdgpu :03:00.0: 
> amdgpu: SRAM ECC is not presented.
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072651] amdgpu :03:00.0: 
> amdgpu: RAS INFO: ras initialized successfully, hardware ability[105] 
> ras_mask[105]
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072657] [drm] vm size is 262144 
> GB, 4 levels, block size is 9-bit, fragment size is 9-bit
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072668] amdgpu :03:00.0: 
> amdgpu: VRAM: 16368M 0x0080 - 0x0083FEFF (16368M used)
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072669] amdgpu :03:00.0: 
> amdgpu: GART: 512M 0x - 0x1FFF
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072676] amdgpu :03:00.0: 
> amdgpu: AGP: 267894784M 0x0084 - 0x
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072683] [drm] Detected VRAM 
> RAM=16368M, BAR=256M
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072684] [drm] RAM width 4096bits 
> HBM
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072736] [drm] amdgpu: 16368M of 
> VRAM memory ready
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072738] [drm] amdgpu: 16368M of 
> GTT memory ready.
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072745] [drm] GART: num cpu 
> pages 131072, num gpu pages 131072
> Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072819] [drm] PCIE GART of 512M 
> enabled.
> Oct 12

[PATCH] drm/amdgpu/nbio7.4: use original HDP_FLUSH bits for navi1x

2021-10-21 Thread Alex Deucher

The extended bits were not available for use on vega20 and
presumably arcturus as well.

Fixes: a0f9f854666834 ("drm/amdgpu/nbio7.4: don't use GPU_HDP_FLUSH bit 12")
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c |  5 -
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c| 15 +++
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h|  1 +
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 814e9620fac5..208a784475bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -1125,10 +1125,13 @@ int amdgpu_discovery_set_ip_blocks(struct amdgpu_device 
*adev)
break;
case IP_VERSION(7, 4, 0):
case IP_VERSION(7, 4, 1):
-   case IP_VERSION(7, 4, 4):
adev->nbio.funcs = _v7_4_funcs;
adev->nbio.hdp_flush_reg = _v7_4_hdp_flush_reg;
break;
+   case IP_VERSION(7, 4, 4):
+   adev->nbio.funcs = _v7_4_funcs;
+   adev->nbio.hdp_flush_reg = _v7_4_hdp_flush_reg_ald;
+   break;
case IP_VERSION(7, 2, 0):
case IP_VERSION(7, 2, 1):
case IP_VERSION(7, 5, 0):
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c 
b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
index 3b7775d74bb2..b8bd03d16dba 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
@@ -325,6 +325,21 @@ static u32 nbio_v7_4_get_pcie_data_offset(struct 
amdgpu_device *adev)
 }
 
 const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg = {
+   .ref_and_mask_cp0 = GPU_HDP_FLUSH_DONE__CP0_MASK,
+   .ref_and_mask_cp1 = GPU_HDP_FLUSH_DONE__CP1_MASK,
+   .ref_and_mask_cp2 = GPU_HDP_FLUSH_DONE__CP2_MASK,
+   .ref_and_mask_cp3 = GPU_HDP_FLUSH_DONE__CP3_MASK,
+   .ref_and_mask_cp4 = GPU_HDP_FLUSH_DONE__CP4_MASK,
+   .ref_and_mask_cp5 = GPU_HDP_FLUSH_DONE__CP5_MASK,
+   .ref_and_mask_cp6 = GPU_HDP_FLUSH_DONE__CP6_MASK,
+   .ref_and_mask_cp7 = GPU_HDP_FLUSH_DONE__CP7_MASK,
+   .ref_and_mask_cp8 = GPU_HDP_FLUSH_DONE__CP8_MASK,
+   .ref_and_mask_cp9 = GPU_HDP_FLUSH_DONE__CP9_MASK,
+   .ref_and_mask_sdma0 = GPU_HDP_FLUSH_DONE__SDMA0_MASK,
+   .ref_and_mask_sdma1 = GPU_HDP_FLUSH_DONE__SDMA1_MASK,
+};
+
+const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg_ald = {
.ref_and_mask_cp0 = GPU_HDP_FLUSH_DONE__CP0_MASK,
.ref_and_mask_cp1 = GPU_HDP_FLUSH_DONE__CP1_MASK,
.ref_and_mask_cp2 = GPU_HDP_FLUSH_DONE__CP2_MASK,
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h 
b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
index b8216581ec8d..cc5692db6f98 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.h
@@ -27,6 +27,7 @@
 #include "soc15_common.h"
 
 extern const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg;
+extern const struct nbio_hdp_flush_reg nbio_v7_4_hdp_flush_reg_ald;
 extern const struct amdgpu_nbio_funcs nbio_v7_4_funcs;
 extern const struct amdgpu_nbio_ras_funcs nbio_v7_4_ras_funcs;
 
-- 
2.31.1

RE: [PATCH 1/2] drm/amdgpu/nbio7.4: don't use GPU_HDP_FLUSH bit 12

2021-10-21 Thread Chen, Guchun

[Public]

This patch caused ring test of SDMA failure on Vega20.

Oct 12 00:18:24 vega20-ebd-11 kernel: [   11.900968] IPv6: 
ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.007480] AMD-Vi: AMD IOMMUv2 driver 
by Joerg Roedel 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.007482] AMD-Vi: AMD IOMMUv2 
functionality not available on this system
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069082] [drm] amdgpu kernel 
modesetting enabled.
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069226] amdgpu: CRAT table not 
found
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069229] amdgpu: Virtual CRAT table 
created for CPU
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069288] amdgpu: Topology: Add CPU 
node
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069415] checking generic (9000 
30) vs hw (9000 1000)
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069416] fb0: switching to 
amdgpudrmfb from EFI VGA
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069700] Console: switching to 
colour dummy device 80x25
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.069755] amdgpu :03:00.0: 
vgaarb: deactivate vga console
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070047] amdgpu :03:00.0: 
enabling device (0006 -> 0007)
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070241] [drm] initializing kernel 
modesetting (VEGA20 0x1002:0x66A1 0x1002:0x081E 0x06).
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070244] amdgpu :03:00.0: 
amdgpu: Trusted Memory Zone (TMZ) feature not supported
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070257] [drm] register mmio base: 
0xA030
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070258] [drm] register mmio size: 
524288
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070263] [drm] add ip block number 
0 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070264] [drm] add ip block number 
1 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070265] [drm] add ip block number 
2 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070266] [drm] add ip block number 
3 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070266] [drm] add ip block number 
4 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070267] [drm] add ip block number 
5 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070267] [drm] add ip block number 
6 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070268] [drm] add ip block number 
7 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070269] [drm] add ip block number 
8 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070269] [drm] add ip block number 
9 
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070286] amdgpu :03:00.0: 
amdgpu: Fetched VBIOS from VFCT
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.070293] amdgpu: ATOM BIOS: 
113-D1640600-103
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072517] [drm] UVD(0) is enabled in 
VM mode
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072519] [drm] UVD(1) is enabled in 
VM mode
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072520] [drm] UVD(0) ENC is 
enabled in VM mode
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072520] [drm] UVD(1) ENC is 
enabled in VM mode
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072521] [drm] VCE enabled in VM 
mode
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072632] amdgpu :03:00.0: 
amdgpu: MEM ECC is active.
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072633] amdgpu :03:00.0: 
amdgpu: SRAM ECC is not presented.
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072651] amdgpu :03:00.0: 
amdgpu: RAS INFO: ras initialized successfully, hardware ability[105] 
ras_mask[105]
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072657] [drm] vm size is 262144 
GB, 4 levels, block size is 9-bit, fragment size is 9-bit
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072668] amdgpu :03:00.0: 
amdgpu: VRAM: 16368M 0x0080 - 0x0083FEFF (16368M used)
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072669] amdgpu :03:00.0: 
amdgpu: GART: 512M 0x - 0x1FFF
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072676] amdgpu :03:00.0: 
amdgpu: AGP: 267894784M 0x0084 - 0x
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072683] [drm] Detected VRAM 
RAM=16368M, BAR=256M
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072684] [drm] RAM width 4096bits 
HBM
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072736] [drm] amdgpu: 16368M of 
VRAM memory ready
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072738] [drm] amdgpu: 16368M of 
GTT memory ready.
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072745] [drm] GART: num cpu pages 
131072, num gpu pages 131072
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072819] [drm] PCIE GART of 512M 
enabled.
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.072820] [drm] PTB located at 
0x00800030
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.075598] amdgpu :03:00.0: 
amdgpu: PSP runtime database doesn't exist
Oct 12 00:18:39 vega20-ebd-11 kernel: [   27.075605] amdgpu: hwmgr_sw_init smu 
backed is

Re: PROBLEM: Laptop kills USB-C hubs

2021-10-21 Thread Alex Deucher

On Thu, Oct 21, 2021 at 6:55 PM Gabriel  wrote:
>
> [1.] One line summary of the problem:
>
> Laptop kills USB-C hubs.
>
>
> [2.] Full description of the problem/report:
>
> I own a ThinkPad T14s (AMD) and used a USB-C Hub [a] without any
> problems for half a year. Then, I bought a new ThinkPad E15 Gen3 (AMD)
> and used it with my USB-C hub, which worked fine at the beginning. After
> a few days, the USB-C hub stopped working. I thought nothing of it and
> ordered a new one [b].
> The new one worked fine for 1-2 days, until the HDMI ports stopped
> working while being connected to the new laptop. USB/Ethernet ports
> continued to work.
> I ordered the USB-C hub [b] again. Same story, except it died
> completely. All ports stopped working.
>
> I noticed a kernel warning [6.], which is the same for all three USB-C
> hub deaths. This only happened after the system woke up from susped mode.
>
> I dont't know if this is a hardware or kernel issue. Unfortunately I
> can't debug this any further, because I don't have any excess USB-C hubs
> to brick.

There was a recent regression in the driver related to USB-C DP
alt-mode not working after resume from suspend in some cases.  You can
find more details and the fix on this bug:
https://gitlab.freedesktop.org/drm/amd/-/issues/1735
I haven't heard of any damage to hardware however.

Alex

RE: [PATCH] drm/amdgpu/smu11.0: add missing IP version check

2021-10-21 Thread Chen, Guchun

[Public]

Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Friday, October 22, 2021 11:19 AM
To: Deucher, Alexander 
Cc: amd-gfx list 
Subject: Re: [PATCH] drm/amdgpu/smu11.0: add missing IP version check

Ping?

On Tue, Oct 19, 2021 at 11:31 AM Alex Deucher  wrote:
>
> Add missing check in smu_v11_0_init_display_count(),
>
> Fixes: af3b89d3a639d5 ("drm/amdgpu/smu11.0: convert to IP version checking")
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> index 5c1703cc25fd..28b7c0562b99 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> @@ -755,6 +755,7 @@ int smu_v11_0_init_display_count(struct smu_context *smu, 
> uint32_t count)
>  */
> if (adev->ip_versions[MP1_HWIP][0] == IP_VERSION(11, 0, 11) ||
> adev->ip_versions[MP1_HWIP][0] == IP_VERSION(11, 5, 0) ||
> +   adev->ip_versions[MP1_HWIP][0] == IP_VERSION(11, 0, 12) ||
> adev->ip_versions[MP1_HWIP][0] == IP_VERSION(11, 0, 13))
> return 0;
>
> --
> 2.31.1
>

[PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-21 Thread Jingwen Chen

[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
is put one more time than other jobs.

[How]
Adding dma_fence_get before resbumit job in
amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs

Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 41ce86244144..975f069f6fe8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
 
/* clear job's guilty and depend the folowing step to decide 
the real one */
drm_sched_reset_karma(s_job);
+   /* for the real bad job, it will be resubmitted twice, adding a 
dma_fence_get
+* to make sure fence is balanced */
+   dma_fence_get(s_job->s_fence->parent);
drm_sched_resubmit_jobs_ext(>sched, 1);
 
ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, 
ring->sched.timeout);
@@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
 
/* got the hw fence, signal finished fence */
atomic_dec(ring->sched.score);
+   dma_fence_put(s_job->s_fence->parent);
dma_fence_get(_job->s_fence->finished);
dma_fence_signal(_job->s_fence->finished);
dma_fence_put(_job->s_fence->finished);
-- 
2.30.2

Re: [PATCH] drm/amdgpu/smu11.0: add missing IP version check

2021-10-21 Thread Alex Deucher

Ping?

On Tue, Oct 19, 2021 at 11:31 AM Alex Deucher  wrote:
>
> Add missing check in smu_v11_0_init_display_count(),
>
> Fixes: af3b89d3a639d5 ("drm/amdgpu/smu11.0: convert to IP version checking")
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> index 5c1703cc25fd..28b7c0562b99 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> @@ -755,6 +755,7 @@ int smu_v11_0_init_display_count(struct smu_context *smu, 
> uint32_t count)
>  */
> if (adev->ip_versions[MP1_HWIP][0] == IP_VERSION(11, 0, 11) ||
> adev->ip_versions[MP1_HWIP][0] == IP_VERSION(11, 5, 0) ||
> +   adev->ip_versions[MP1_HWIP][0] == IP_VERSION(11, 0, 12) ||
> adev->ip_versions[MP1_HWIP][0] == IP_VERSION(11, 0, 13))
> return 0;
>
> --
> 2.31.1
>

Re: [RFC PATCH 0/4] drm/dp: Use DP2.0 DPCD 248h updated register/field names for DP PHY CTS

2021-10-21 Thread Almahallawy, Khaled

On Thu, 2021-10-21 at 13:00 +0300, Jani Nikula wrote:
> On Wed, 20 Oct 2021, Khaled Almahallawy  > wrote:
> > This series updates DPCD 248h register name and PHY test patterns
> > names to follow DP 2.0 Specs.
> > Also updates the DP PHY CTS codes of the affected drivers (i915,
> > amd, msm)
> > No functional changes expected.
> >  
> > Reference: “DPCD 248h/10Bh/10Ch/10Dh/10Eh Name/Description
> > Consistency”
> > https://groups.vesa.org/wg/AllMem/documentComment/2738
> 
> You can't do renames like this piece by piece. Every commit must
> build.

Noted, I apologize for messing that up.
 
I will send v2 to make sure all renames of DP_PHY_TEST_PATTERN
done in one patch with the other changes you requested.

Thank you for your review
Khaled

> 
> Incidentally, this is one of the reasons we often don't bother with
> renames to follow spec changes, but rather stick to the original
> names.
> 
> However, in this case you could switch all drivers to the different
> test
> pattern macros piece by piece, as they're already there.
> 
> 
> BR,
> Jani.
> 
> 
> > Khaled Almahallawy (4):
> >   drm/dp: Rename DPCD 248h according to DP 2.0 specs
> >   drm/i915/dp: Use DP 2.0 LINK_QUAL_PATTERN_* Phy test pattern
> > definitions
> >   drm/amd/dc: Use DPCD 248h DP 2.0 new name
> >   drm/msm/dp: Use DPCD 248h DP 2.0 new names/definitions
> > 
> >  drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c |  2 +-
> >  drivers/gpu/drm/drm_dp_helper.c  |  6 +++---
> >  drivers/gpu/drm/i915/display/intel_dp.c  | 12 ++--
> >  drivers/gpu/drm/msm/dp/dp_catalog.c  | 12 ++--
> >  drivers/gpu/drm/msm/dp/dp_ctrl.c | 12 ++--
> >  drivers/gpu/drm/msm/dp/dp_link.c | 16 --
> > --
> >  include/drm/drm_dp_helper.h  | 13 +++-
> > -
> >  7 files changed, 33 insertions(+), 40 deletions(-)

RE: [PATCH 1/2] drm/amdgpu: Workaround harvesting info for some navy flounder boards

2021-10-21 Thread Chen, Guchun

[Public]

This series are: Reviewed-and-tested-by: Guchun Chen  , on 
top of "drm/amdgpu/vcn3.0: handle harvesting in firmware setup".

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Chen, Guchun
Sent: Friday, October 22, 2021 8:21 AM
To: Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: RE: [PATCH 1/2] drm/amdgpu: Workaround harvesting info for some navy 
flounder boards

[Public]

I will try it.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Friday, October 22, 2021 5:52 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH 1/2] drm/amdgpu: Workaround harvesting info for some navy 
flounder boards

Some navy flounder boards do not properly mark harvested VCN instances.  Fix 
that here.

v2: use IP versions

Fixes: 1b592d00b4ac83 ("drm/amdgpu/vcn: remove manual instance setting")
Bug: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1743data=04%7C01%7Cguchun.chen%40amd.com%7C80a3316b28a64ff87b1e08d994f1d6ec%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637704588695008369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Oc66Y7qakmiS3UKOytgTY418mtBRs%2BCEnesLpwTAyIA%3Dreserved=0
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index dfb92f229748..814e9620fac5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -507,6 +507,10 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device 
*adev)
break;
}
}
+   /* some IP discovery tables on Navy Flounder don't have this set 
correctly */
+   if ((adev->ip_versions[UVD_HWIP][1] == IP_VERSION(3, 0, 1)) &&
+   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
+   adev->vcn.harvest_config |= AMDGPU_VCN_HARVEST_VCN1;
if (vcn_harvest_count == adev->vcn.num_vcn_inst) {
adev->harvest_ip_mask |= AMD_HARVEST_IP_VCN_MASK;
adev->harvest_ip_mask |= AMD_HARVEST_IP_JPEG_MASK;
--
2.31.1

RE: [PATCH 1/2] drm/amdgpu: Workaround harvesting info for some navy flounder boards

2021-10-21 Thread Chen, Guchun

[Public]

I will try it.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Friday, October 22, 2021 5:52 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH 1/2] drm/amdgpu: Workaround harvesting info for some navy 
flounder boards

Some navy flounder boards do not properly mark harvested VCN instances.  Fix 
that here.

v2: use IP versions

Fixes: 1b592d00b4ac83 ("drm/amdgpu/vcn: remove manual instance setting")
Bug: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1743data=04%7C01%7Cguchun.chen%40amd.com%7Ca8087124988c4196bf8008d994dd00a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637704499192475535%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=%2B96nBOC0y9B%2BKQKAPGQHXcOlbv3EhPtKK93tKIXI3do%3Dreserved=0
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index dfb92f229748..814e9620fac5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -507,6 +507,10 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device 
*adev)
break;
}
}
+   /* some IP discovery tables on Navy Flounder don't have this set 
correctly */
+   if ((adev->ip_versions[UVD_HWIP][1] == IP_VERSION(3, 0, 1)) &&
+   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
+   adev->vcn.harvest_config |= AMDGPU_VCN_HARVEST_VCN1;
if (vcn_harvest_count == adev->vcn.num_vcn_inst) {
adev->harvest_ip_mask |= AMD_HARVEST_IP_VCN_MASK;
adev->harvest_ip_mask |= AMD_HARVEST_IP_JPEG_MASK;
--
2.31.1

RE: FW: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()

2021-10-21 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Grodzovsky, Andrey 
>Sent: Thursday, October 21, 2021 11:18 PM
>To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>Subject: Re: FW: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in
>amdgpu_device_fini_sw()
>
>On 2021-10-21 3:19 a.m., Yu, Lang wrote:
>
>> [AMD Official Use Only]
>>
>>
>>
>>> -Original Message-
>>> From: Yu, Lang 
>>> Sent: Thursday, October 21, 2021 3:18 PM
>>> To: Grodzovsky, Andrey 
>>> Cc: Deucher, Alexander ; Koenig, Christian
>>> ; Huang, Ray ; Yu, Lang
>>> 
>>> Subject: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in
>>> amdgpu_device_fini_sw()
>>>
>>> amdgpu_fence_driver_sw_fini() should be executed before
>>> amdgpu_device_ip_fini(), otherwise fence driver resource won't be
>>> properly freed as adev->rings have been tore down.
>
>
>Cam you clarify more where exactly the memleak happens ?
>
>Andrey

See amdgpu_fence_driver_sw_fini(), ring->fence_drv.fences will only be freed
when adev->rings[i] is not NULL.

void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
{
unsigned int i, j;

for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];

if (!ring || !ring->fence_drv.initialized)
continue;

if (!ring->no_scheduler)
drm_sched_fini(>sched);

for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
dma_fence_put(ring->fence_drv.fences[j]);
kfree(ring->fence_drv.fences);
ring->fence_drv.fences = NULL;
ring->fence_drv.initialized = false;
}
}

If amdgpu_device_ip_fini() is executed before amdgpu_fence_driver_sw_fini(), 
amdgpu_device_ip_fini() will call gfx_vX_0_sw_fini() 
then call amdgpu_ring_fini() and set adev->rings[i] to NULL.
Nothing will be freed in amdgpu_fence_driver_sw_fini().
ring->fence_drv.fences  memory leak happened!

void amdgpu_ring_fini(struct amdgpu_ring *ring)
{
..
ring->adev->rings[ring->idx] = NULL;
}

Regards,
Lang

>
>
>>>
>>> Fixes: 72c8c97b1522 ("drm/amdgpu: Split amdgpu_device_fini into early
>>> and late")
>>>
>>> Signed-off-by: Lang Yu 
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 41ce86244144..5654c4790773 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3843,8 +3843,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device
>>> *adev)
>>>
>>> void amdgpu_device_fini_sw(struct amdgpu_device *adev)  {
>>> -   amdgpu_device_ip_fini(adev);
>>> amdgpu_fence_driver_sw_fini(adev);
>>> +   amdgpu_device_ip_fini(adev);
>>> release_firmware(adev->firmware.gpu_info_fw);
>>> adev->firmware.gpu_info_fw = NULL;
>>> adev->accel_working = false;
>>> --
>>> 2.25.1

PROBLEM: Laptop kills USB-C hubs

2021-10-21 Thread Gabriel


[1.] One line summary of the problem:

Laptop kills USB-C hubs.


[2.] Full description of the problem/report:

I own a ThinkPad T14s (AMD) and used a USB-C Hub [a] without any 
problems for half a year. Then, I bought a new ThinkPad E15 Gen3 (AMD) 
and used it with my USB-C hub, which worked fine at the beginning. After 
a few days, the USB-C hub stopped working. I thought nothing of it and 
ordered a new one [b].
The new one worked fine for 1-2 days, until the HDMI ports stopped 
working while being connected to the new laptop. USB/Ethernet ports 
continued to work.
I ordered the USB-C hub [b] again. Same story, except it died 
completely. All ports stopped working.


I noticed a kernel warning [6.], which is the same for all three USB-C 
hub deaths. This only happened after the system woke up from susped mode.


I dont't know if this is a hardware or kernel issue. Unfortunately I 
can't debug this any further, because I don't have any excess USB-C hubs 
to brick.


[a] https://uniaccessories [dot] com/products/usb-c_8in1_hub
[b] https://www [dot] amazon [dot] 
de/gp/product/B08LDGYM2W/ref=ppx_yo_dt_b_asin_title_o01_s00?ie=UTF8=1



[3.] Keywords (i.e., modules, networking, kernel):

amd display


[4.] Kernel information
[4.1.] Kernel version (from /proc/version):

Linux version 5.14.10-1-MANJARO (builduser@fv-az72-723) (gcc (GCC) 
11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Thu Oct 7 06:43:34 
UTC 2021



[4.2.] Kernel .config file:
[5.] Most recent kernel version which did not have the bug:
[6.] Output of Oops.. message (if applicable) with symbolic information
 resolved (see Documentation/admin-guide/bug-hunting.rst)

[Do Okt 21 22:29:12 2021] WARNING: CPU: 13 PID: 3789 at 
drivers/gpu/drm/amd/amdgpu/../display/dc/dcn21/dcn21_link_encoder.c:216 
dcn21_link_encoder_acquire_phy+0x11d/0x160 [amdgpu]
[Do Okt 21 22:29:12 2021] Modules linked in: ccm rfcomm snd_usb_audio 
snd_usbmidi_lib snd_rawmidi snd_seq_device usbhid r8153_ecm cdc_ether 
usbnet r8152 mii cmac algif_hash algif_skcipher af_alg bnep btusb btrtl 
btbcm btintel uvcvideo bluetooth videobuf2_vmalloc videobuf2_memops 
videobuf2_v4l2 videobuf2_common videodev ecdh_generic ecc mc amdgpu 
snd_ctl_led snd_hda_codec_realtek joydev snd_hda_codec_generic 
snd_hda_codec_hdmi mousedev gpu_sched i2c_algo_bit drm_ttm_helper 
snd_hda_intel ttm snd_intel_dspcfg snd_intel_sdw_acpi drm_kms_helper 
snd_hda_codec snd_hda_core cec snd_hwdep agpgart snd_pcm syscopyarea 
intel_rapl_msr sysfillrect sysimgblt snd_rn_pci_acp3x intel_rapl_common 
fb_sys_fops snd_timer snd_pci_acp3x qrtr edac_mce_amd ns kvm_amd 
rtw88_8822ce rtw88_8822c ccp rtw88_pci rtw88_core kvm squashfs irqbypass 
mac80211 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel 
ucsi_acpi crypto_simd r8169 wmi_bmof thinkpad_acpi cryptd typec_ucsi 
realtek vfat mdio_devres platform_profile
[Do Okt 21 22:29:12 2021]  sp5100_tco cfg80211 fat typec rapl psmouse 
libphy ledtrig_audio pcspkr tpm_crb libarc4 k10temp i2c_piix4 roles 
rfkill wmi snd video soundcore tpm_tis tpm_tis_core tpm i2c_scmi 
acpi_cpufreq rng_core pinctrl_amd mac_hid loop drm uinput sg fuse 
crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 
serio_raw atkbd libps2 i8042 crc32c_intel xhci_pci serio
[Do Okt 21 22:29:12 2021] CPU: 13 PID: 3789 Comm: kworker/u32:56 Not 
tainted 5.14.10-1-MANJARO #1
[Do Okt 21 22:29:12 2021] Hardware name: LENOVO 20YHS00A00/20YHS00A00, 
BIOS R1OET27W (1.06 ) 05/17/2021

[Do Okt 21 22:29:12 2021] Workqueue: events_unbound async_run_entry_fn
[Do Okt 21 22:29:12 2021] RIP: 
0010:dcn21_link_encoder_acquire_phy+0x11d/0x160 [amdgpu]
[Do Okt 21 22:29:12 2021] Code: 00 00 00 0f b6 89 8b 00 00 00 e8 ae 69 
05 00 b8 01 00 00 00 48 8b 54 24 08 65 48 2b 14 25 28 00 00 00 75 43 48 
83 c4 10 5b c3 <0f> 0b 31 c0 eb e4 0f 0b 48 8b 53 60 48 8b 43 68 41 b9 
01 00 00 00

[Do Okt 21 22:29:12 2021] RSP: 0018:b2e849e2f5c0 EFLAGS: 00010246
[Do Okt 21 22:29:12 2021] RAX: 0016 RBX: 982fe0a96800 
RCX: 0011
[Do Okt 21 22:29:12 2021] RDX:  RSI: 608e 
RDI: 982fdfa4
[Do Okt 21 22:29:12 2021] RBP: 982fe0a96800 R08: b2e849e2f5c4 
R09: 0002
[Do Okt 21 22:29:12 2021] R10: 982fe282 R11: 982fe0a97d00 
R12: b2e849e2f690
[Do Okt 21 22:29:12 2021] R13: 0008 R14: 982fc90d0de0 
R15: 982fe0a96800
[Do Okt 21 22:29:12 2021] FS:  () 
GS:98329f14() knlGS:

[Do Okt 21 22:29:12 2021] CS:  0010 DS:  ES:  CR0: 80050033
[Do Okt 21 22:29:12 2021] CR2:  CR3: 000251c1 
CR4: 00350ee0

[Do Okt 21 22:29:12 2021] Call Trace:
[Do Okt 21 22:29:12 2021] 
dcn21_link_encoder_enable_dp_mst_output+0x18/0x40 [amdgpu]

[Do Okt 21 22:29:12 2021]  dp_enable_link_phy+0x1ce/0x2d0 [amdgpu]
[Do Okt 21 22:29:12 2021] 
perform_link_training_with_retries+0x108/0x240 [amdgpu]

[Do Okt 21 22:29:12 2021]

[PATCH 1/2] drm/amdgpu: Workaround harvesting info for some navy flounder boards

2021-10-21 Thread Alex Deucher

Some navy flounder boards do not properly mark harvested
VCN instances.  Fix that here.

v2: use IP versions

Fixes: 1b592d00b4ac83 ("drm/amdgpu/vcn: remove manual instance setting")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index dfb92f229748..814e9620fac5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -507,6 +507,10 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device 
*adev)
break;
}
}
+   /* some IP discovery tables on Navy Flounder don't have this set 
correctly */
+   if ((adev->ip_versions[UVD_HWIP][1] == IP_VERSION(3, 0, 1)) &&
+   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
+   adev->vcn.harvest_config |= AMDGPU_VCN_HARVEST_VCN1;
if (vcn_harvest_count == adev->vcn.num_vcn_inst) {
adev->harvest_ip_mask |= AMD_HARVEST_IP_VCN_MASK;
adev->harvest_ip_mask |= AMD_HARVEST_IP_JPEG_MASK;
-- 
2.31.1

[PATCH 2/2] drm/amdgpu/swsmu: handle VCN harvesting for VCN SMU setup

2021-10-21 Thread Alex Deucher

Check if VCN instances are harvested when controlling
VCN power gating and setting up VCN clocks.

Fixes: 1b592d00b4ac83 ("drm/amdgpu/vcn: remove manual instance setting")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
Signed-off-by: Alex Deucher 
---
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 95 +--
 1 file changed, 24 insertions(+), 71 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 15e66e1912de..a4108025fe29 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -670,7 +670,7 @@ static int sienna_cichlid_set_default_dpm_table(struct 
smu_context *smu)
struct smu_11_0_dpm_context *dpm_context = smu->smu_dpm.dpm_context;
struct smu_11_0_dpm_table *dpm_table;
struct amdgpu_device *adev = smu->adev;
-   int ret = 0;
+   int i, ret = 0;
DpmDescriptor_t *table_member;
 
/* socclk dpm table setup */
@@ -746,78 +746,45 @@ static int sienna_cichlid_set_default_dpm_table(struct 
smu_context *smu)
dpm_table->max = dpm_table->dpm_levels[0].value;
}
 
-   /* vclk0 dpm table setup */
-   dpm_table = _context->dpm_tables.vclk_table;
-   if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) {
-   ret = smu_v11_0_set_single_dpm_table(smu,
-SMU_VCLK,
-dpm_table);
-   if (ret)
-   return ret;
-   dpm_table->is_fine_grained =
-   !table_member[PPCLK_VCLK_0].SnapToDiscrete;
-   } else {
-   dpm_table->count = 1;
-   dpm_table->dpm_levels[0].value = 
smu->smu_table.boot_values.vclk / 100;
-   dpm_table->dpm_levels[0].enabled = true;
-   dpm_table->min = dpm_table->dpm_levels[0].value;
-   dpm_table->max = dpm_table->dpm_levels[0].value;
-   }
+   /* vclk0/1 dpm table setup */
+   for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
+   if (adev->vcn.harvest_config & (1 << i))
+   continue;
 
-   /* vclk1 dpm table setup */
-   if (adev->vcn.num_vcn_inst > 1) {
-   dpm_table = _context->dpm_tables.vclk1_table;
+   dpm_table = _context->dpm_tables.vclk_table;
if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) 
{
ret = smu_v11_0_set_single_dpm_table(smu,
-SMU_VCLK1,
+i ? SMU_VCLK1 : 
SMU_VCLK,
 dpm_table);
if (ret)
return ret;
dpm_table->is_fine_grained =
-   !table_member[PPCLK_VCLK_1].SnapToDiscrete;
+   !table_member[i ? PPCLK_VCLK_1 : 
PPCLK_VCLK_0].SnapToDiscrete;
} else {
dpm_table->count = 1;
-   dpm_table->dpm_levels[0].value =
-   smu->smu_table.boot_values.vclk / 100;
+   dpm_table->dpm_levels[0].value = 
smu->smu_table.boot_values.vclk / 100;
dpm_table->dpm_levels[0].enabled = true;
dpm_table->min = dpm_table->dpm_levels[0].value;
dpm_table->max = dpm_table->dpm_levels[0].value;
}
}
 
-   /* dclk0 dpm table setup */
-   dpm_table = _context->dpm_tables.dclk_table;
-   if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) {
-   ret = smu_v11_0_set_single_dpm_table(smu,
-SMU_DCLK,
-dpm_table);
-   if (ret)
-   return ret;
-   dpm_table->is_fine_grained =
-   !table_member[PPCLK_DCLK_0].SnapToDiscrete;
-   } else {
-   dpm_table->count = 1;
-   dpm_table->dpm_levels[0].value = 
smu->smu_table.boot_values.dclk / 100;
-   dpm_table->dpm_levels[0].enabled = true;
-   dpm_table->min = dpm_table->dpm_levels[0].value;
-   dpm_table->max = dpm_table->dpm_levels[0].value;
-   }
-
-   /* dclk1 dpm table setup */
-   if (adev->vcn.num_vcn_inst > 1) {
-   dpm_table = _context->dpm_tables.dclk1_table;
+   /* dclk0/1 dpm table setup */
+   for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
+   if (adev->vcn.harvest_config & (1 << i))
+   continue;
+   dpm_table = _context->dpm_tables.dclk_table;
if

[pull] amdgpu drm-fixes-5.15

2021-10-21 Thread Alex Deucher

Hi Dave, Daniel,

Fixes for 5.15.

The following changes since commit 519d81956ee277b4419c723adfb154603c2565ba:

  Linux 5.15-rc6 (2021-10-17 20:00:13 -1000)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-5.15-2021-10-21

for you to fetch changes up to 53c2ff8bcb06acd07e24a62e7f5a0247bd7c6f67:

  drm/amdgpu: support B0 external revision id for yellow carp (2021-10-20 
15:27:31 -0400)


amd-drm-fixes-5.15-2021-10-21:

amdgpu:
- Fix a potential out of bounds write in debugfs
- Fix revision handling for Yellow Carp
- Display fixes for Yellow Carp


Aaron Liu (1):
  drm/amdgpu: support B0 external revision id for yellow carp

Eric Yang (1):
  drm/amd/display: increase Z9 latency to workaround underflow in Z9

Jake Wang (1):
  drm/amd/display: Moved dccg init to after bios golden init

Nicholas Kazlauskas (2):
  drm/amd/display: Fix prefetch bandwidth calculation for DCN3.1
  drm/amd/display: Require immediate flip support for DCN3.1 planes

Nikola Cornij (2):
  drm/amd/display: Limit display scaling to up to true 4k for DCN 3.1
  drm/amd/display: Increase watermark latencies for DCN3.1

Thelford Williams (1):
  drm/amdgpu: fix out of bounds write

 drivers/gpu/drm/amd/amdgpu/nv.c  |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs.c|  2 +-
 .../gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c | 16 
 drivers/gpu/drm/amd/display/dc/dcn31/dcn31_hwseq.c   |  7 +++
 drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c| 13 ++---
 .../drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c   |  6 +++---
 drivers/gpu/drm/amd/display/include/dal_asic_id.h|  2 +-
 7 files changed, 27 insertions(+), 21 deletions(-)

Re: [PATCH] amd/display: remove ChromeOS workaround

2021-10-21 Thread Harry Wentland




On 2021-10-21 13:55, Rodrigo Siqueira Jordao wrote:
> Hi Simon,
> 
> I tested this patch and it lgtm. I also agree to revert it.
> 
> Btw, did you send the revert patch for "amd/display: only require overlay 
> plane to cover whole CRTC on ChromeOS"? I think we need to revert it as well.
> 

Agreed that this patch is good but we'll need to also revert the is_chromeos 
w/a.

This patch is
Reviewed-by: Harry Wentland 

Harry

> Sean, Mark
> 
> For ChromeOS, we should ignore this patch. Do we need to take any action to 
> avoid landing this patch on ChromeOS tree?
> 
> Thanks
> Siqueira
> 
> On 2021-10-14 11:35 a.m., Simon Ser wrote:
>> This reverts commits ddab8bd788f5 ("drm/amd/display: Fix two cursor 
>> duplication
>> when using overlay") and e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay
>> validation by considering cursors"").
>>
>> tl;dr ChromeOS uses the atomic interface for everything except the cursor. 
>> This
>> is incorrect and forces amdgpu to disable some hardware features. Let's 
>> revert
>> the ChromeOS-specific workaround in mainline and allow the Chrome team to 
>> keep
>> it internally in their own tree.
>>
>> See [1] for more details. This patch is an alternative to [2], which added
>> ChromeOS detection.
>>
>> [1]: 
>> https://lore.kernel.org/amd-gfx/JIQ_93_cHcshiIDsrMU1huBzx9P9LVQxucx8hQArpQu7Wk5DrCl_vTXj_Q20m_L-8C8A5dSpNcSJ8ehfcCrsQpfB5QG_Spn14EYkH9chtg0=@emersion.fr/>>
>>  [2]: 
>> https://lore.kernel.org/amd-gfx/20211011151609.452132-1-cont...@emersion.fr/>>
>>  Signed-off-by: Simon Ser 
>> Cc: Alex Deucher 
>> Cc: Harry Wentland 
>> Cc: Nicholas Kazlauskas 
>> Cc: Bas Nieuwenhuizen 
>> Cc: Rodrigo Siqueira 
>> Cc: Sean Paul 
>> Fixes: ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication when using 
>> overlay")
>> Fixes: e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay validation by 
>> considering cursors"")
>> ---
>>   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 51 ---
>>   1 file changed, 51 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
>> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index 20065a145851..014c5a9fe461 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -10628,53 +10628,6 @@ static int add_affected_mst_dsc_crtcs(struct 
>> drm_atomic_state *state, struct drm
>>   }
>>   #endif
>>   -static int validate_overlay(struct drm_atomic_state *state)
>> -{
>> -    int i;
>> -    struct drm_plane *plane;
>> -    struct drm_plane_state *new_plane_state;
>> -    struct drm_plane_state *primary_state, *overlay_state = NULL;
>> -
>> -    /* Check if primary plane is contained inside overlay */
>> -    for_each_new_plane_in_state_reverse(state, plane, new_plane_state, i) {
>> -    if (plane->type == DRM_PLANE_TYPE_OVERLAY) {
>> -    if (drm_atomic_plane_disabling(plane->state, new_plane_state))
>> -    return 0;
>> -
>> -    overlay_state = new_plane_state;
>> -    continue;
>> -    }
>> -    }
>> -
>> -    /* check if we're making changes to the overlay plane */
>> -    if (!overlay_state)
>> -    return 0;
>> -
>> -    /* check if overlay plane is enabled */
>> -    if (!overlay_state->crtc)
>> -    return 0;
>> -
>> -    /* find the primary plane for the CRTC that the overlay is enabled on */
>> -    primary_state = drm_atomic_get_plane_state(state, 
>> overlay_state->crtc->primary);
>> -    if (IS_ERR(primary_state))
>> -    return PTR_ERR(primary_state);
>> -
>> -    /* check if primary plane is enabled */
>> -    if (!primary_state->crtc)
>> -    return 0;
>> -
>> -    /* Perform the bounds check to ensure the overlay plane covers the 
>> primary */
>> -    if (primary_state->crtc_x < overlay_state->crtc_x ||
>> -    primary_state->crtc_y < overlay_state->crtc_y ||
>> -    primary_state->crtc_x + primary_state->crtc_w > 
>> overlay_state->crtc_x + overlay_state->crtc_w ||
>> -    primary_state->crtc_y + primary_state->crtc_h > 
>> overlay_state->crtc_y + overlay_state->crtc_h) {
>> -    DRM_DEBUG_ATOMIC("Overlay plane is enabled with hardware cursor but 
>> does not fully cover primary plane\n");
>> -    return -EINVAL;
>> -    }
>> -
>> -    return 0;
>> -}
>> -
>>   /**
>>    * amdgpu_dm_atomic_check() - Atomic check implementation for AMDgpu DM.
>>    * @dev: The DRM device
>> @@ -10856,10 +10809,6 @@ static int amdgpu_dm_atomic_check(struct drm_device 
>> *dev,
>>   goto fail;
>>   }
>>   -    ret = validate_overlay(state);
>> -    if (ret)
>> -    goto fail;
>> -
>>   /* Add new/modified planes */
>>   for_each_oldnew_plane_in_state_reverse(state, plane, old_plane_state, 
>> new_plane_state, i) {
>>   ret = dm_update_plane_state(dc, state, plane,
>>
>

Re: [PATCH v5] amd/display: only require overlay plane to cover whole CRTC on ChromeOS

2021-10-21 Thread Rodrigo Siqueira Jordao




On 2021-10-11 11:16 a.m., Simon Ser wrote:

Commit ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication when
using overlay") changed the atomic validation code to forbid the
overlay plane from being used if it doesn't cover the whole CRTC. The
motivation is that ChromeOS uses the atomic API for everything except
the cursor plane (which uses the legacy API). Thus amdgpu must always
be prepared to enable/disable/move the cursor plane at any time without
failing (or else ChromeOS will trip over).

As discussed in [1], there's no reason why the ChromeOS limitation
should prevent other fully atomic users from taking advantage of the
overlay plane. Let's limit the check to ChromeOS.

v4: fix ChromeOS detection (Harry)

v5: fix conflict with linux-next

[1]: https://lore.kernel.org/amd-gfx/JIQ_93_cHcshiIDsrMU1huBzx9P9LVQxucx8hQArpQu7Wk5DrCl_vTXj_Q20m_L-8C8A5dSpNcSJ8ehfcCrsQpfB5QG_Spn14EYkH9chtg0=@emersion.fr/>> 
Signed-off-by: Simon Ser 

Cc: Alex Deucher 
Cc: Harry Wentland 
Cc: Nicholas Kazlauskas 
Cc: Bas Nieuwenhuizen 
Cc: Rodrigo Siqueira 
Cc: Sean Paul 
Fixes: ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication when using 
overlay")
---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 29 +++
  1 file changed, 29 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index f35561b5a465..2eeda1fec506 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -10594,6 +10594,31 @@ static int add_affected_mst_dsc_crtcs(struct 
drm_atomic_state *state, struct drm
  }
  #endif
  
+static bool is_chromeos(void)

+{
+   struct mm_struct *mm = current->mm;
+   struct file *exe_file;
+   bool ret;
+
+   /* ChromeOS renames its thread to DrmThread. Also check the executable
+* name. */
+   if (strcmp(current->comm, "DrmThread") != 0 || !mm)
+   return false;
+
+   rcu_read_lock();
+   exe_file = rcu_dereference(mm->exe_file);
+   if (exe_file && !get_file_rcu(exe_file))
+   exe_file = NULL;
+   rcu_read_unlock();
+
+   if (!exe_file)
+   return false;
+   ret = strcmp(exe_file->f_path.dentry->d_name.name, "chrome") == 0;
+   fput(exe_file);
+
+   return ret;
+}
+
  static int validate_overlay(struct drm_atomic_state *state)
  {
int i;
@@ -10601,6 +10626,10 @@ static int validate_overlay(struct drm_atomic_state 
*state)
struct drm_plane_state *new_plane_state;
struct drm_plane_state *primary_state, *overlay_state = NULL;
  
+	/* This is a workaround for ChromeOS only */

+   if (!is_chromeos())
+   return 0;
+
/* Check if primary plane is contained inside overlay */
for_each_new_plane_in_state_reverse(state, plane, new_plane_state, i) {
if (plane->type == DRM_PLANE_TYPE_OVERLAY) {



Hi Mark,

I tested this patch on ChromeOS, and this can be helpful in two ways:

1. When ChromeOS GUI is running, the workaround for fixing the two 
cursor issues works as expected.
2. When we turn off ChromeOS GUI, this patch also works by making some 
of the overlay tests pass.


I think we should cherry-pick this patch to the ChromeOS tree. Is it ok 
for you?


Thanks
Siqueira

Re: [PATCH] amd/display: remove ChromeOS workaround

2021-10-21 Thread Rodrigo Siqueira Jordao


Hi Simon,

I tested this patch and it lgtm. I also agree to revert it.

Btw, did you send the revert patch for "amd/display: only require 
overlay plane to cover whole CRTC on ChromeOS"? I think we need to 
revert it as well.


Sean, Mark

For ChromeOS, we should ignore this patch. Do we need to take any action 
to avoid landing this patch on ChromeOS tree?


Thanks
Siqueira

On 2021-10-14 11:35 a.m., Simon Ser wrote:

This reverts commits ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication
when using overlay") and e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay
validation by considering cursors"").

tl;dr ChromeOS uses the atomic interface for everything except the cursor. This
is incorrect and forces amdgpu to disable some hardware features. Let's revert
the ChromeOS-specific workaround in mainline and allow the Chrome team to keep
it internally in their own tree.

See [1] for more details. This patch is an alternative to [2], which added
ChromeOS detection.

[1]: https://lore.kernel.org/amd-gfx/JIQ_93_cHcshiIDsrMU1huBzx9P9LVQxucx8hQArpQu7Wk5DrCl_vTXj_Q20m_L-8C8A5dSpNcSJ8ehfcCrsQpfB5QG_Spn14EYkH9chtg0=@emersion.fr/>> [2]: https://lore.kernel.org/amd-gfx/20211011151609.452132-1-cont...@emersion.fr/>> 
Signed-off-by: Simon Ser 

Cc: Alex Deucher 
Cc: Harry Wentland 
Cc: Nicholas Kazlauskas 
Cc: Bas Nieuwenhuizen 
Cc: Rodrigo Siqueira 
Cc: Sean Paul 
Fixes: ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication when using 
overlay")
Fixes: e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay validation by considering 
cursors"")
---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 51 ---
  1 file changed, 51 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 20065a145851..014c5a9fe461 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -10628,53 +10628,6 @@ static int add_affected_mst_dsc_crtcs(struct 
drm_atomic_state *state, struct drm
  }
  #endif
  
-static int validate_overlay(struct drm_atomic_state *state)

-{
-   int i;
-   struct drm_plane *plane;
-   struct drm_plane_state *new_plane_state;
-   struct drm_plane_state *primary_state, *overlay_state = NULL;
-
-   /* Check if primary plane is contained inside overlay */
-   for_each_new_plane_in_state_reverse(state, plane, new_plane_state, i) {
-   if (plane->type == DRM_PLANE_TYPE_OVERLAY) {
-   if (drm_atomic_plane_disabling(plane->state, 
new_plane_state))
-   return 0;
-
-   overlay_state = new_plane_state;
-   continue;
-   }
-   }
-
-   /* check if we're making changes to the overlay plane */
-   if (!overlay_state)
-   return 0;
-
-   /* check if overlay plane is enabled */
-   if (!overlay_state->crtc)
-   return 0;
-
-   /* find the primary plane for the CRTC that the overlay is enabled on */
-   primary_state = drm_atomic_get_plane_state(state, 
overlay_state->crtc->primary);
-   if (IS_ERR(primary_state))
-   return PTR_ERR(primary_state);
-
-   /* check if primary plane is enabled */
-   if (!primary_state->crtc)
-   return 0;
-
-   /* Perform the bounds check to ensure the overlay plane covers the 
primary */
-   if (primary_state->crtc_x < overlay_state->crtc_x ||
-   primary_state->crtc_y < overlay_state->crtc_y ||
-   primary_state->crtc_x + primary_state->crtc_w > overlay_state->crtc_x 
+ overlay_state->crtc_w ||
-   primary_state->crtc_y + primary_state->crtc_h > overlay_state->crtc_y 
+ overlay_state->crtc_h) {
-   DRM_DEBUG_ATOMIC("Overlay plane is enabled with hardware cursor but 
does not fully cover primary plane\n");
-   return -EINVAL;
-   }
-
-   return 0;
-}
-
  /**
   * amdgpu_dm_atomic_check() - Atomic check implementation for AMDgpu DM.
   * @dev: The DRM device
@@ -10856,10 +10809,6 @@ static int amdgpu_dm_atomic_check(struct drm_device 
*dev,
goto fail;
}
  
-	ret = validate_overlay(state);

-   if (ret)
-   goto fail;
-
/* Add new/modified planes */
for_each_oldnew_plane_in_state_reverse(state, plane, old_plane_state, 
new_plane_state, i) {
ret = dm_update_plane_state(dc, state, plane,

Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Luben Tuikov

On 2021-10-21 13:26, Kent Russell wrote:
> dmesg doesn't warn when the number of bad pages approaches the
> threshold for page retirement. WARN when the number of bad pages
> is at 90% or greater for easier checks and planning, instead of waiting
> until the GPU is full of bad pages.
>
> Cc: Luben Tuikov 
> Cc: Mukul Joshi 
> Signed-off-by: Kent Russell 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index f4c05ff4b26c..8309eea09df3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1077,6 +1077,13 @@ int amdgpu_ras_eeprom_init(struct 
> amdgpu_ras_eeprom_control *control,
>   if (res)
>   DRM_ERROR("RAS table incorrect checksum or error:%d\n",
> res);
> +
> + /* Warn if we are at 90% of the threshold or above
> +  */
> + if (10 * control->ras_num_recs >= ras->bad_page_cnt_threshold * 
> 9)

Change this to " >= 9 * ras->bad_page_cnt_threshold ". With that fixed, this 
patch is:

Reviewed-by: Luben Tuikov 

Regards,
Luben

> + dev_warn(adev->dev, "RAS records:%u exceeds 90%% of 
> threshold:%d",
> + control->ras_num_recs,
> + ras->bad_page_cnt_threshold);
>   } else if (hdr->header == RAS_TABLE_HDR_BAD &&
>  amdgpu_bad_page_threshold != 0) {
>   res = __verify_ras_table_checksum(control);

[PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold

2021-10-21 Thread Kent Russell

When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).

If the bad_page_threshold kernel parameter is set to -2,
continue to initialize the GPU, while printing a warning to dmesg that
this action has been done

Cc: Luben Tuikov 
Cc: Mukul Joshi 
Signed-off-by: Kent Russell 
Acked-by: Felix Kuehling 
Reviewed-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 96bd63aeeddd..eee3cf874e7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 
0444);
  * result in the GPU entering bad status when the number of total
  * faulty pages by ECC exceeds the threshold value.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default 
value), 0 = disable bad page retirement)");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default 
value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup 
(8 if set to greater than 8 or less than 0, only affect gfx 8+)");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 8309eea09df3..0428a1d3d22a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1105,11 +1105,15 @@ int amdgpu_ras_eeprom_init(struct 
amdgpu_ras_eeprom_control *control,
res = amdgpu_ras_eeprom_correct_header_tag(control,
   
RAS_TABLE_HDR_VAL);
} else {
-   *exceed_err_limit = true;
-   dev_err(adev->dev,
-   "RAS records:%d exceed threshold:%d, "
-   "GPU will not be initialized. Replace this GPU 
or increase the threshold",
+   dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
control->ras_num_recs, 
ras->bad_page_cnt_threshold);
+   if (amdgpu_bad_page_threshold == -2) {
+   dev_warn(adev->dev, "GPU will be initialized 
due to bad_page_threshold = -2.");
+   res = 0;
+   } else {
+   *exceed_err_limit = true;
+   dev_err(adev->dev, "GPU will not be 
initialized. Replace this GPU or increase the threshold.");
+   }
}
} else {
DRM_INFO("Creating a new EEPROM table");
-- 
2.25.1

[PATCH 3/3] drm/amdgpu: Make EEPROM messages dev_ instead of DRM_

2021-10-21 Thread Kent Russell

Since the EEPROM is specific to the device for each of these messages,
use the dev_* macro instead of DRM_* to make it easier to identify the
GPU that correlates to the EEPROM messages.

Signed-off-by: Kent Russell 
---
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 40 +--
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 0428a1d3d22a..3792a69b876f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -201,9 +201,9 @@ static int __write_table_header(struct 
amdgpu_ras_eeprom_control *control)
up_read(>reset_sem);
 
if (res < 0) {
-   DRM_ERROR("Failed to write EEPROM table header:%d", res);
+   dev_err(adev->dev, "Failed to write EEPROM table header:%d", 
res);
} else if (res < RAS_TABLE_HEADER_SIZE) {
-   DRM_ERROR("Short write:%d out of %d\n",
+   dev_err(adev->dev, "Short write:%d out of %d\n",
  res, RAS_TABLE_HEADER_SIZE);
res = -EIO;
} else {
@@ -395,12 +395,12 @@ static int __amdgpu_ras_eeprom_write(struct 
amdgpu_ras_eeprom_control *control,
  buf, buf_size);
up_read(>reset_sem);
if (res < 0) {
-   DRM_ERROR("Writing %d EEPROM table records error:%d",
+   dev_err(adev->dev, "Writing %d EEPROM table records error:%d",
  num, res);
} else if (res < buf_size) {
/* Short write, return error.
 */
-   DRM_ERROR("Wrote %d records out of %d",
+   dev_err(adev->dev, "Wrote %d records out of %d",
  res / RAS_TABLE_RECORD_SIZE, num);
res = -EIO;
} else {
@@ -541,7 +541,7 @@ amdgpu_ras_eeprom_update_header(struct 
amdgpu_ras_eeprom_control *control)
buf_size = control->ras_num_recs * RAS_TABLE_RECORD_SIZE;
buf = kcalloc(control->ras_num_recs, RAS_TABLE_RECORD_SIZE, GFP_KERNEL);
if (!buf) {
-   DRM_ERROR("allocating memory for table of size %d bytes 
failed\n",
+   dev_err(adev->dev, "allocating memory for table of size %d 
bytes failed\n",
  control->tbl_hdr.tbl_size);
res = -ENOMEM;
goto Out;
@@ -554,11 +554,11 @@ amdgpu_ras_eeprom_update_header(struct 
amdgpu_ras_eeprom_control *control)
 buf, buf_size);
up_read(>reset_sem);
if (res < 0) {
-   DRM_ERROR("EEPROM failed reading records:%d\n",
+   dev_err(adev->dev, "EEPROM failed reading records:%d\n",
  res);
goto Out;
} else if (res < buf_size) {
-   DRM_ERROR("EEPROM read %d out of %d bytes\n",
+   dev_err(adev->dev, "EEPROM read %d out of %d bytes\n",
  res, buf_size);
res = -EIO;
goto Out;
@@ -604,10 +604,10 @@ int amdgpu_ras_eeprom_append(struct 
amdgpu_ras_eeprom_control *control,
return 0;
 
if (num == 0) {
-   DRM_ERROR("will not append 0 records\n");
+   dev_err(adev->dev, "will not append 0 records\n");
return -EINVAL;
} else if (num > control->ras_max_record_count) {
-   DRM_ERROR("cannot append %d records than the size of table 
%d\n",
+   dev_err(adev->dev, "cannot append %d records than the size of 
table %d\n",
  num, control->ras_max_record_count);
return -EINVAL;
}
@@ -650,12 +650,12 @@ static int __amdgpu_ras_eeprom_read(struct 
amdgpu_ras_eeprom_control *control,
 buf, buf_size);
up_read(>reset_sem);
if (res < 0) {
-   DRM_ERROR("Reading %d EEPROM table records error:%d",
+   dev_err(adev->dev, "Reading %d EEPROM table records error:%d",
  num, res);
} else if (res < buf_size) {
/* Short read, return error.
 */
-   DRM_ERROR("Read %d records out of %d",
+   dev_err(adev->dev, "Read %d records out of %d",
  res / RAS_TABLE_RECORD_SIZE, num);
res = -EIO;
} else {
@@ -689,10 +689,10 @@ int amdgpu_ras_eeprom_read(struct 
amdgpu_ras_eeprom_control *control,
return 0;
 
if (num == 0) {
-   DRM_ERROR("will not read 0 records\n");
+   dev_err(adev->dev, "will not read 0 records\n");
return -EINVAL;
} else if (num > control->ras_num_recs) {
-   DRM_ERROR("too many records to read:%d available:%d\n",
+   dev_err(adev->dev, "too many records to read:%d available:%d\n",

[PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Kent Russell

dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.

Cc: Luben Tuikov 
Cc: Mukul Joshi 
Signed-off-by: Kent Russell 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..8309eea09df3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,13 @@ int amdgpu_ras_eeprom_init(struct 
amdgpu_ras_eeprom_control *control,
if (res)
DRM_ERROR("RAS table incorrect checksum or error:%d\n",
  res);
+
+   /* Warn if we are at 90% of the threshold or above
+*/
+   if (10 * control->ras_num_recs >= ras->bad_page_cnt_threshold * 
9)
+   dev_warn(adev->dev, "RAS records:%u exceeds 90%% of 
threshold:%d",
+   control->ras_num_recs,
+   ras->bad_page_cnt_threshold);
} else if (hdr->header == RAS_TABLE_HDR_BAD &&
   amdgpu_bad_page_threshold != 0) {
res = __verify_ras_table_checksum(control);
-- 
2.25.1

Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Felix Kuehling

Am 2021-10-21 um 12:35 p.m. schrieb Luben Tuikov:
> On 2021-10-21 11:57, Kent Russell wrote:
>> dmesg doesn't warn when the number of bad pages approaches the
>> threshold for page retirement. WARN when the number of bad pages
>> is at 90% or greater for easier checks and planning, instead of waiting
>> until the GPU is full of bad pages.
>>
>> Cc: Luben Tuikov 
>> Cc: Mukul Joshi 
>> Signed-off-by: Kent Russell 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> index f4c05ff4b26c..ce5089216474 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct 
>> amdgpu_ras_eeprom_control *control,
>>  if (res)
>>  DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>>res);
>> +
>> +/* Warn if we are at 90% of the threshold or above */
> The kernel uses a couple of styles, this is one of them:
>
> /* Warn ...
>  */

That's a waste of space. That means there can be no single-line comments?

checkpatch.pl complains about multi-line comments that don't have the
final */ on their own line. But a single line comment is OK. Let's stick
with the coding style recommended by checkpatch.pl. If we make up our
own arbitrary rules for different reviewers and different parts of the
code, I think it becomes a mine-field of pointless cosmetic fixes for
anyone trespassing on unfamiliar code.


> if (...)
>
> Please use this style as it is used extensively in the amdgpu_ras_eeprom.c 
> file.
>
>> +if ((10 * control->ras_num_recs) >= 
>> (ras->bad_page_cnt_threshold * 9))
> You don't need the extra parenthesis around multiplication--it has higher 
> precedence than relational operators--drop the extra parenthesis.

I agree. With that fixed, the patch is

Reviewed-by: Felix Kuehling 


>
> Regards,
> Luben
>
>> +DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
>> +control->ras_num_recs,
>> +ras->bad_page_cnt_threshold);
>>  } else if (hdr->header == RAS_TABLE_HDR_BAD &&
>> amdgpu_bad_page_threshold != 0) {
>>  res = __verify_ras_table_checksum(control);

Re: [PATCH 2/2] drm/amdkfd: Remove cu mask from struct queue_properties

2021-10-21 Thread Felix Kuehling

Am 2021-10-15 um 4:48 a.m. schrieb Lang Yu:
> +enum queue_update_flag {
> + UPDATE_FLAG_PROPERTITY = 0,
> + UPDATE_FLAG_CU_MASK,
> +};
> +
> +struct queue_update_info {
> + union {
> + struct queue_properties properties;
> + struct {
> + uint32_t count; /* Must be a multiple of 32 */
> + uint32_t *ptr;
> + } cu_mask;
> + };
> +
> + enum queue_update_flag update_flag;
> +};
> +

This doesn't make sense to me. As I understand it, queue_update_info is
for information that is not stored in queue_properties but only in the
MQDs. Therefore, it should not include the queue_properties.

All the low level functions in the MQD managers get both the
queue_properties and the queue_update_info. So trying to wrap both in
the same union doesn't make sense there either.

I think you only need this because you tried to generalize
pqm_update_queue to handle both updates to queue_properties and CU mask
updates with a single argument. IMO this does not make the interface any
clearer. I think it would be more straight-forward to keep a separate
pqm_set_cu_mask function that takes a queue_update_info parameter. If
you're looking for more generic names, I suggest the following:

  * Rename pqm_update_queue to pqm_update_queue_properties
  * Rename struct queue_update_info to struct mqd_update_info
  * Rename pqm_set_cu_mask to pqm_update_mqd. For now this is only used
for CU mask (the union has only one struct member for now). It may
be used for other MQD properties that don't need to be stored in
queue_properties in the future

Regards,
  Felix

Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold

2021-10-21 Thread Luben Tuikov


  
On 2021-10-21 12:49, Russell, Kent
  wrote:


  [AMD Official Use Only]




  
-Original Message-
From: Tuikov, Luben 
Sent: Thursday, October 21, 2021 12:47 PM
To: Russell, Kent ; amd-gfx@lists.freedesktop.org
Cc: Joshi, Mukul ; Kuehling, Felix 
Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page
threshold

On 2021-10-21 12:42, Russell, Kent wrote:


  [AMD Official Use Only]




  
-Original Message-
From: Tuikov, Luben 
Sent: Thursday, October 21, 2021 12:21 PM
To: Russell, Kent ; amd-gfx@lists.freedesktop.org
Cc: Joshi, Mukul ; Kuehling, Felix ;
Tuikov, Luben 
Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad

  

page


  
threshold

On 2021-10-21 11:57, Kent Russell wrote:


  When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).

If the bad_page_threshold kernel parameter is set to -2,
continue to initialize the GPU, while printing a warning to dmesg that
this action has been done

Cc: Luben Tuikov 
Cc: Mukul Joshi 
Signed-off-by: Kent Russell 
Acked-by: Felix Kuehling 
Reviewed-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h


b/drivers/gpu/drm/amd/amdgpu/amdgpu.h


  index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c


b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c


  index 96bd63aeeddd..eee3cf874e7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -877,7 +877,7 @@ module_param_named(reset_method,


  

amdgpu_reset_method,


  
int, 0444);


* result in the GPU entering bad status when the number of total
  * faulty pages by ECC exceeds the threshold value.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default


value), 0 = disable bad page retirement)");


  +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default


value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");


   module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int,


  

0444);


  

  
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to


  

setup


  
(8 if set to greater than 8 or less than 0, only affect gfx 8+)");


  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c


b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c


  index ce5089216474..bd6ed43b0df2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct


amdgpu_ras_eeprom_control *control,


   			res = amdgpu_ras_eeprom_correct_header_tag(control,
    RAS_TABLE_HDR_VAL);
 		} else {
-			*exceed_err_limit = true;
-			dev_err(adev->dev,
-"RAS records:%d exceed threshold:%d, "
-"GPU will not be initialized. Replace this GPU or increase the


threshold",


  +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
 control->ras_num_recs, ras->bad_page_cnt_threshold);


I thought this would all go in a single set of patches. I wasn't aware a singleton patch

  

went


  
in already which changed just this line--this change was always a part of a patch set.


  
  Ah sorry. When you reviewed the original patch2 clarifying the message, I merged it and


then re-submitted the remaining 3 (which pared down to 2) for

Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Luben Tuikov

On 2021-10-21 12:47, Russell, Kent wrote:
> [AMD Official Use Only]
>
>
>
>> -Original Message-
>> From: Tuikov, Luben 
>> Sent: Thursday, October 21, 2021 12:45 PM
>> To: Russell, Kent ; amd-gfx@lists.freedesktop.org
>> Cc: Joshi, Mukul 
>> Subject: Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% 
>> threshold
>>
>> On 2021-10-21 12:35, Luben Tuikov wrote:
>>> On 2021-10-21 11:57, Kent Russell wrote:
 dmesg doesn't warn when the number of bad pages approaches the
 threshold for page retirement. WARN when the number of bad pages
 is at 90% or greater for easier checks and planning, instead of waiting
 until the GPU is full of bad pages.

 Cc: Luben Tuikov 
 Cc: Mukul Joshi 
 Signed-off-by: Kent Russell 
 ---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
  1 file changed, 6 insertions(+)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
 index f4c05ff4b26c..ce5089216474 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
 @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct
>> amdgpu_ras_eeprom_control *control,
if (res)
DRM_ERROR("RAS table incorrect checksum or error:%d\n",
  res);
 +
 +  /* Warn if we are at 90% of the threshold or above */
>>> The kernel uses a couple of styles, this is one of them:
>>>
>>> /* Warn ...
>>>  */
>>> if (...)
>>>
>>> Please use this style as it is used extensively in the amdgpu_ras_eeprom.c 
>>> file.
>>>
 +  if ((10 * control->ras_num_recs) >= 
 (ras->bad_page_cnt_threshold * 9))
>>> You don't need the extra parenthesis around multiplication--it has higher 
>>> precedence
>> than relational operators--drop the extra parenthesis.
>>> Regards,
>>> Luben
>>>
 +  DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
 +  control->ras_num_recs,
 +  ras->bad_page_cnt_threshold);
>> One more note: The code uses "dev_err()" for this very similar message:
>>
>>         dev_err(adev->dev,
>>             "RAS records:%d exceed threshold:%d, "
>>             "GPU will not be initialized. Replace this GPU or increase 
>> the threshold",
>>             control->ras_num_recs, ras->bad_page_cnt_threshold);
>>
>> Since your message is essentially the same, sans the "90% of threshold", 
>> perhaps you want
>> to use dev_warn(), instead of "DRM_WARN()".
> Agreed. Lijo had a similar comment. I may follow up with another patch to 
> change all of these table-specific DRM_* messages to dev_*

You can still do that, but for this patch, make the changes I requested and 
change this to dev_warn().

Regards,
Luben

>
>  Kent
>
>> Regards,
>> Luben
>>
} else if (hdr->header == RAS_TABLE_HDR_BAD &&
   amdgpu_bad_page_threshold != 0) {
res = __verify_ras_table_checksum(control);

RE: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold

2021-10-21 Thread Russell, Kent

[AMD Official Use Only]



> -Original Message-
> From: Tuikov, Luben 
> Sent: Thursday, October 21, 2021 12:47 PM
> To: Russell, Kent ; amd-gfx@lists.freedesktop.org
> Cc: Joshi, Mukul ; Kuehling, Felix 
> 
> Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for 
> ignoring bad page
> threshold
> 
> On 2021-10-21 12:42, Russell, Kent wrote:
> > [AMD Official Use Only]
> >
> >
> >
> >> -Original Message-
> >> From: Tuikov, Luben 
> >> Sent: Thursday, October 21, 2021 12:21 PM
> >> To: Russell, Kent ; amd-gfx@lists.freedesktop.org
> >> Cc: Joshi, Mukul ; Kuehling, Felix 
> >> ;
> >> Tuikov, Luben 
> >> Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for 
> >> ignoring bad
> page
> >> threshold
> >>
> >> On 2021-10-21 11:57, Kent Russell wrote:
> >>> When a GPU hits the bad_page_threshold, it will not be initialized by
> >>> the amdgpu driver. This means that the table cannot be cleared, nor can
> >>> information gathering be performed (getting serial number, BDF, etc).
> >>>
> >>> If the bad_page_threshold kernel parameter is set to -2,
> >>> continue to initialize the GPU, while printing a warning to dmesg that
> >>> this action has been done
> >>>
> >>> Cc: Luben Tuikov 
> >>> Cc: Mukul Joshi 
> >>> Signed-off-by: Kent Russell 
> >>> Acked-by: Felix Kuehling 
> >>> Reviewed-by: Luben Tuikov 
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 
> >>>  3 files changed, 10 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>> index d58e37fd01f4..b85b67a88a3d 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
> >>>  extern int amdgpu_ras_enable;
> >>>  extern uint amdgpu_ras_mask;
> >>>  extern int amdgpu_bad_page_threshold;
> >>> +extern bool amdgpu_ignore_bad_page_threshold;
> >>>  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
> >>>  extern int amdgpu_async_gfx_ring;
> >>>  extern int amdgpu_mcbp;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> index 96bd63aeeddd..eee3cf874e7a 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> @@ -877,7 +877,7 @@ module_param_named(reset_method,
> amdgpu_reset_method,
> >> int, 0444);
> >>>   * result in the GPU entering bad status when the number of total
> >>>   * faulty pages by ECC exceeds the threshold value.
> >>>   */
> >>> -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = 
> >>> auto(default
> >> value), 0 = disable bad page retirement)");
> >>> +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = 
> >>> auto(default
> >> value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
> >>>  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int,
> 0444);
> >>>
> >>>  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to
> setup
> >> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >>> index ce5089216474..bd6ed43b0df2 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >>> @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct
> >> amdgpu_ras_eeprom_control *control,
> >>>   res = amdgpu_ras_eeprom_correct_header_tag(control,
> >>>  
> >>> RAS_TABLE_HDR_VAL);
> >>>   } else {
> >>> - *exceed_err_limit = true;
> >>> - dev_err(adev->dev,
> >>> - "RAS records:%d exceed threshold:%d, "
> >>> - "GPU will not be initialized. Replace this GPU 
> >>> or increase the
> >> threshold",
> >>> + dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> >>>   control->ras_num_recs, 
> >>> ras->bad_page_cnt_threshold);
> >> I thought this would all go in a single set of patches. I wasn't aware a 
> >> singleton patch
> went
> >> in already which changed just this line--this change was always a part of 
> >> a patch set.
> >>
> > Ah sorry. When you reviewed the original patch2 clarifying the message, I 
> > merged it and
> then re-submitted the remaining 3 (which pared down to 2) for review. Sorry 
> for the
> confusion, I was trying to minimize the number of moving parts.
> 
> Admittedly, now you have 3 patches, one singleton and two coming in. Would've 
> probably
> be best to submit only the current two.
> 
> No worries for now--for the future.

Thanks. For

RE: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Russell, Kent

[AMD Official Use Only]



> -Original Message-
> From: Tuikov, Luben 
> Sent: Thursday, October 21, 2021 12:45 PM
> To: Russell, Kent ; amd-gfx@lists.freedesktop.org
> Cc: Joshi, Mukul 
> Subject: Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% 
> threshold
> 
> On 2021-10-21 12:35, Luben Tuikov wrote:
> > On 2021-10-21 11:57, Kent Russell wrote:
> >> dmesg doesn't warn when the number of bad pages approaches the
> >> threshold for page retirement. WARN when the number of bad pages
> >> is at 90% or greater for easier checks and planning, instead of waiting
> >> until the GPU is full of bad pages.
> >>
> >> Cc: Luben Tuikov 
> >> Cc: Mukul Joshi 
> >> Signed-off-by: Kent Russell 
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> index f4c05ff4b26c..ce5089216474 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >>if (res)
> >>DRM_ERROR("RAS table incorrect checksum or error:%d\n",
> >>  res);
> >> +
> >> +  /* Warn if we are at 90% of the threshold or above */
> > The kernel uses a couple of styles, this is one of them:
> >
> > /* Warn ...
> >  */
> > if (...)
> >
> > Please use this style as it is used extensively in the amdgpu_ras_eeprom.c 
> > file.
> >
> >> +  if ((10 * control->ras_num_recs) >= 
> >> (ras->bad_page_cnt_threshold * 9))
> > You don't need the extra parenthesis around multiplication--it has higher 
> > precedence
> than relational operators--drop the extra parenthesis.
> >
> > Regards,
> > Luben
> >
> >> +  DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
> >> +  control->ras_num_recs,
> >> +  ras->bad_page_cnt_threshold);
> 
> One more note: The code uses "dev_err()" for this very similar message:
> 
>         dev_err(adev->dev,
>             "RAS records:%d exceed threshold:%d, "
>             "GPU will not be initialized. Replace this GPU or increase 
> the threshold",
>             control->ras_num_recs, ras->bad_page_cnt_threshold);
> 
> Since your message is essentially the same, sans the "90% of threshold", 
> perhaps you want
> to use dev_warn(), instead of "DRM_WARN()".

Agreed. Lijo had a similar comment. I may follow up with another patch to 
change all of these table-specific DRM_* messages to dev_*

 Kent

> 
> Regards,
> Luben
> 
> >>} else if (hdr->header == RAS_TABLE_HDR_BAD &&
> >>   amdgpu_bad_page_threshold != 0) {
> >>res = __verify_ras_table_checksum(control);

Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold

2021-10-21 Thread Luben Tuikov

On 2021-10-21 12:42, Russell, Kent wrote:
> [AMD Official Use Only]
>
>
>
>> -Original Message-
>> From: Tuikov, Luben 
>> Sent: Thursday, October 21, 2021 12:21 PM
>> To: Russell, Kent ; amd-gfx@lists.freedesktop.org
>> Cc: Joshi, Mukul ; Kuehling, Felix 
>> ;
>> Tuikov, Luben 
>> Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for 
>> ignoring bad page
>> threshold
>>
>> On 2021-10-21 11:57, Kent Russell wrote:
>>> When a GPU hits the bad_page_threshold, it will not be initialized by
>>> the amdgpu driver. This means that the table cannot be cleared, nor can
>>> information gathering be performed (getting serial number, BDF, etc).
>>>
>>> If the bad_page_threshold kernel parameter is set to -2,
>>> continue to initialize the GPU, while printing a warning to dmesg that
>>> this action has been done
>>>
>>> Cc: Luben Tuikov 
>>> Cc: Mukul Joshi 
>>> Signed-off-by: Kent Russell 
>>> Acked-by: Felix Kuehling 
>>> Reviewed-by: Luben Tuikov 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 
>>>  3 files changed, 10 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index d58e37fd01f4..b85b67a88a3d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>>>  extern int amdgpu_ras_enable;
>>>  extern uint amdgpu_ras_mask;
>>>  extern int amdgpu_bad_page_threshold;
>>> +extern bool amdgpu_ignore_bad_page_threshold;
>>>  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>>>  extern int amdgpu_async_gfx_ring;
>>>  extern int amdgpu_mcbp;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 96bd63aeeddd..eee3cf874e7a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method,
>> int, 0444);
>>>   * result in the GPU entering bad status when the number of total
>>>   * faulty pages by ECC exceeds the threshold value.
>>>   */
>>> -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
>> value), 0 = disable bad page retirement)");
>>> +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
>> value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
>>>  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 
>>> 0444);
>>>
>>>  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to 
>>> setup
>> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> index ce5089216474..bd6ed43b0df2 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct
>> amdgpu_ras_eeprom_control *control,
>>> res = amdgpu_ras_eeprom_correct_header_tag(control,
>>>
>>> RAS_TABLE_HDR_VAL);
>>> } else {
>>> -   *exceed_err_limit = true;
>>> -   dev_err(adev->dev,
>>> -   "RAS records:%d exceed threshold:%d, "
>>> -   "GPU will not be initialized. Replace this GPU 
>>> or increase the
>> threshold",
>>> +   dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>>> control->ras_num_recs, 
>>> ras->bad_page_cnt_threshold);
>> I thought this would all go in a single set of patches. I wasn't aware a 
>> singleton patch went
>> in already which changed just this line--this change was always a part of a 
>> patch set.
>>
> Ah sorry. When you reviewed the original patch2 clarifying the message, I 
> merged it and then re-submitted the remaining 3 (which pared down to 2) for 
> review. Sorry for the confusion, I was trying to minimize the number of 
> moving parts.

Admittedly, now you have 3 patches, one singleton and two coming in. Would've 
probably be best to submit only the current two.

No worries for now--for the future.

Regards,
Luben

>
>  Kent
>
>> Regards,
>> Luben
>>
>>> +   if (amdgpu_bad_page_threshold == -2) {
>>> +   dev_warn(adev->dev, "GPU will be initialized 
>>> due to
>> bad_page_threshold = -2.");
>>> +   res = 0;
>>> +   } else {
>>> +   *exceed_err_limit = true;
>>> +   dev_err(adev->dev, "GPU will not be 
>>> initialized. Replace this
>> GPU or increase the threshold.");
>>> +   }
>>>

Re: [PATCH 1/2] drm/amdkfd: Add an optional argument into update queue operation

2021-10-21 Thread Felix Kuehling



Am 2021-10-15 um 4:48 a.m. schrieb Lang Yu:
> Currently, queue is updated with data stored in queue_properties.
> And all allocated resource in queue_properties will not be freed
> until the queue is destroyed.
>
> But some properties(e.g., cu mask) bring some memory management
> headaches(e.g., memory leak) and make code complex. Actually they
> don't have to persist in queue_properties.
>
> So add an argument into update queue to pass such properties and
> remove them from queue_properties.
>
> Signed-off-by: Lang Yu 
> ---
>  .../drm/amd/amdkfd/kfd_device_queue_manager.c |  4 ++--
>  .../drm/amd/amdkfd/kfd_device_queue_manager.h |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |  2 +-
>  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  | 18 +++
>  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |  8 +++
>  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |  8 +++
>  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   | 22 +--
>  .../amd/amdkfd/kfd_process_queue_manager.c|  6 ++---
>  8 files changed, 35 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index f8fce9d05f50..7f6f4937eedb 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -557,7 +557,7 @@ static int destroy_queue_nocpsch(struct 
> device_queue_manager *dqm,
>   return retval;
>  }
>  
> -static int update_queue(struct device_queue_manager *dqm, struct queue *q)
> +static int update_queue(struct device_queue_manager *dqm, struct queue *q, 
> void *args)

Please don't use a void * here. If you don't want to declare the struct
queue_update_info in this patch, you can just declare it as an abstract
type:

struct queue_update_info;

You can cast NULL to (struct queue_update_info *) without requiring the
structure definition.

Regards,
  Felix


>  {
>   int retval = 0;
>   struct mqd_manager *mqd_mgr;
> @@ -605,7 +605,7 @@ static int update_queue(struct device_queue_manager *dqm, 
> struct queue *q)
>   }
>   }
>  
> - mqd_mgr->update_mqd(mqd_mgr, q->mqd, >properties);
> + mqd_mgr->update_mqd(mqd_mgr, q->mqd, >properties, args);
>  
>   /*
>* check active state vs. the previous state and modify
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> index c8719682c4da..08cfc2a2fdbb 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> @@ -93,7 +93,7 @@ struct device_queue_manager_ops {
>   struct queue *q);
>  
>   int (*update_queue)(struct device_queue_manager *dqm,
> - struct queue *q);
> + struct queue *q, void *args);
>  
>   int (*register_process)(struct device_queue_manager *dqm,
>   struct qcm_process_device *qpd);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h 
> b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h
> index 6e6918ccedfd..6ddf93629b8c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h
> @@ -80,7 +80,7 @@ struct mqd_manager {
>   struct mm_struct *mms);
>  
>   void(*update_mqd)(struct mqd_manager *mm, void *mqd,
> - struct queue_properties *q);
> + struct queue_properties *q, void *args);
>  
>   int (*destroy_mqd)(struct mqd_manager *mm, void *mqd,
>   enum kfd_preempt_type type,
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c
> index 064914e1e8d6..8bb2fd4cba41 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c
> @@ -135,7 +135,7 @@ static void init_mqd(struct mqd_manager *mm, void **mqd,
>   *mqd = m;
>   if (gart_addr)
>   *gart_addr = addr;
> - mm->update_mqd(mm, m, q);
> + mm->update_mqd(mm, m, q, NULL);
>  }
>  
>  static void init_mqd_sdma(struct mqd_manager *mm, void **mqd,
> @@ -152,7 +152,7 @@ static void init_mqd_sdma(struct mqd_manager *mm, void 
> **mqd,
>   if (gart_addr)
>   *gart_addr = mqd_mem_obj->gpu_addr;
>  
> - mm->update_mqd(mm, m, q);
> + mm->update_mqd(mm, m, q, NULL);
>  }
>  
>  static void free_mqd(struct mqd_manager *mm, void *mqd,
> @@ -185,7 +185,7 @@ static int load_mqd_sdma(struct mqd_manager *mm, void 
> *mqd,
>  }
>  
>  static void __update_mqd(struct mqd_manager *mm, void *mqd,
> - struct queue_properties *q, unsigned int atc_bit)
> + struct queue_properties *q, void *args, unsigned int 
> atc_bit)

Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Luben Tuikov

On 2021-10-21 12:35, Luben Tuikov wrote:
> On 2021-10-21 11:57, Kent Russell wrote:
>> dmesg doesn't warn when the number of bad pages approaches the
>> threshold for page retirement. WARN when the number of bad pages
>> is at 90% or greater for easier checks and planning, instead of waiting
>> until the GPU is full of bad pages.
>>
>> Cc: Luben Tuikov 
>> Cc: Mukul Joshi 
>> Signed-off-by: Kent Russell 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> index f4c05ff4b26c..ce5089216474 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct 
>> amdgpu_ras_eeprom_control *control,
>>  if (res)
>>  DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>>res);
>> +
>> +/* Warn if we are at 90% of the threshold or above */
> The kernel uses a couple of styles, this is one of them:
>
> /* Warn ...
>  */
> if (...)
>
> Please use this style as it is used extensively in the amdgpu_ras_eeprom.c 
> file.
>
>> +if ((10 * control->ras_num_recs) >= 
>> (ras->bad_page_cnt_threshold * 9))
> You don't need the extra parenthesis around multiplication--it has higher 
> precedence than relational operators--drop the extra parenthesis.
>
> Regards,
> Luben
>
>> +DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
>> +control->ras_num_recs,
>> +ras->bad_page_cnt_threshold);

One more note: The code uses "dev_err()" for this very similar message:

        dev_err(adev->dev,
            "RAS records:%d exceed threshold:%d, "
            "GPU will not be initialized. Replace this GPU or increase the 
threshold",
            control->ras_num_recs, ras->bad_page_cnt_threshold);

Since your message is essentially the same, sans the "90% of threshold", 
perhaps you want to use dev_warn(), instead of "DRM_WARN()".

Regards,
Luben

>>  } else if (hdr->header == RAS_TABLE_HDR_BAD &&
>> amdgpu_bad_page_threshold != 0) {
>>  res = __verify_ras_table_checksum(control);

RE: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold

2021-10-21 Thread Russell, Kent

[AMD Official Use Only]



> -Original Message-
> From: Tuikov, Luben 
> Sent: Thursday, October 21, 2021 12:21 PM
> To: Russell, Kent ; amd-gfx@lists.freedesktop.org
> Cc: Joshi, Mukul ; Kuehling, Felix 
> ;
> Tuikov, Luben 
> Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for 
> ignoring bad page
> threshold
> 
> On 2021-10-21 11:57, Kent Russell wrote:
> > When a GPU hits the bad_page_threshold, it will not be initialized by
> > the amdgpu driver. This means that the table cannot be cleared, nor can
> > information gathering be performed (getting serial number, BDF, etc).
> >
> > If the bad_page_threshold kernel parameter is set to -2,
> > continue to initialize the GPU, while printing a warning to dmesg that
> > this action has been done
> >
> > Cc: Luben Tuikov 
> > Cc: Mukul Joshi 
> > Signed-off-by: Kent Russell 
> > Acked-by: Felix Kuehling 
> > Reviewed-by: Luben Tuikov 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 
> >  3 files changed, 10 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index d58e37fd01f4..b85b67a88a3d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
> >  extern int amdgpu_ras_enable;
> >  extern uint amdgpu_ras_mask;
> >  extern int amdgpu_bad_page_threshold;
> > +extern bool amdgpu_ignore_bad_page_threshold;
> >  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
> >  extern int amdgpu_async_gfx_ring;
> >  extern int amdgpu_mcbp;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 96bd63aeeddd..eee3cf874e7a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method,
> int, 0444);
> >   * result in the GPU entering bad status when the number of total
> >   * faulty pages by ECC exceeds the threshold value.
> >   */
> > -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> value), 0 = disable bad page retirement)");
> > +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
> >  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 
> > 0444);
> >
> >  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to 
> > setup
> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index ce5089216474..bd6ed43b0df2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> > res = amdgpu_ras_eeprom_correct_header_tag(control,
> >
> > RAS_TABLE_HDR_VAL);
> > } else {
> > -   *exceed_err_limit = true;
> > -   dev_err(adev->dev,
> > -   "RAS records:%d exceed threshold:%d, "
> > -   "GPU will not be initialized. Replace this GPU 
> > or increase the
> threshold",
> > +   dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> > control->ras_num_recs, 
> > ras->bad_page_cnt_threshold);
> 
> I thought this would all go in a single set of patches. I wasn't aware a 
> singleton patch went
> in already which changed just this line--this change was always a part of a 
> patch set.
> 

Ah sorry. When you reviewed the original patch2 clarifying the message, I 
merged it and then re-submitted the remaining 3 (which pared down to 2) for 
review. Sorry for the confusion, I was trying to minimize the number of moving 
parts.

 Kent

> Regards,
> Luben
> 
> > +   if (amdgpu_bad_page_threshold == -2) {
> > +   dev_warn(adev->dev, "GPU will be initialized 
> > due to
> bad_page_threshold = -2.");
> > +   res = 0;
> > +   } else {
> > +   *exceed_err_limit = true;
> > +   dev_err(adev->dev, "GPU will not be 
> > initialized. Replace this
> GPU or increase the threshold.");
> > +   }
> > }
> > } else {
> > DRM_INFO("Creating a new EEPROM table");

RE: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Russell, Kent

[Public]

I had noticed that all of these RAS messages use DRM instead of dev_warn. I 
wasn't sure if there was a reason for that or not. It's definitely inconsistent.

DRM_ERROR("Partial read for checksum, res:%d\n", res);
DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
DRM_ERROR("RAS table incorrect checksum or error:%d\n",
DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
DRM_ERROR("RAS Table incorrect checksum or error:%d\n",
dev_info(adev->dev,  "records:%d threshold:%d, resetting RAS table header 
signature",
dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
DRM_INFO("Creating a new EEPROM table");

Might be worth making a separate patch to handle those inconsistencies. I agree 
that device is useful for this kind of error/warning/info

Kent

From: Lazar, Lijo 
Sent: Thursday, October 21, 2021 12:31 PM
To: Russell, Kent ; amd-gfx@lists.freedesktop.org
Cc: Russell, Kent ; Tuikov, Luben ; 
Joshi, Mukul 
Subject: Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% 
threshold


[Public]

Nit pick - suggest to use dev_warn for easy identification of the device.

Thanks,
Lijo

From: amd-gfx 
mailto:amd-gfx-boun...@lists.freedesktop.org>>
 on behalf of Kent Russell mailto:kent.russ...@amd.com>>
Sent: Thursday, October 21, 2021 9:27:10 PM
To: amd-gfx@lists.freedesktop.org 
mailto:amd-gfx@lists.freedesktop.org>>
Cc: Russell, Kent mailto:kent.russ...@amd.com>>; Tuikov, 
Luben mailto:luben.tui...@amd.com>>; Joshi, Mukul 
mailto:mukul.jo...@amd.com>>
Subject: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.

Cc: Luben Tuikov mailto:luben.tui...@amd.com>>
Cc: Mukul Joshi mailto:mukul.jo...@amd.com>>
Signed-off-by: Kent Russell mailto:kent.russ...@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..ce5089216474 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct 
amdgpu_ras_eeprom_control *control,
 if (res)
 DRM_ERROR("RAS table incorrect checksum or error:%d\n",
   res);
+
+   /* Warn if we are at 90% of the threshold or above */
+   if ((10 * control->ras_num_recs) >= 
(ras->bad_page_cnt_threshold * 9))
+   DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+   control->ras_num_recs,
+   ras->bad_page_cnt_threshold);
 } else if (hdr->header == RAS_TABLE_HDR_BAD &&
amdgpu_bad_page_threshold != 0) {
 res = __verify_ras_table_checksum(control);
--
2.25.1

Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Luben Tuikov

On 2021-10-21 11:57, Kent Russell wrote:
> dmesg doesn't warn when the number of bad pages approaches the
> threshold for page retirement. WARN when the number of bad pages
> is at 90% or greater for easier checks and planning, instead of waiting
> until the GPU is full of bad pages.
>
> Cc: Luben Tuikov 
> Cc: Mukul Joshi 
> Signed-off-by: Kent Russell 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index f4c05ff4b26c..ce5089216474 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct 
> amdgpu_ras_eeprom_control *control,
>   if (res)
>   DRM_ERROR("RAS table incorrect checksum or error:%d\n",
> res);
> +
> + /* Warn if we are at 90% of the threshold or above */

The kernel uses a couple of styles, this is one of them:

/* Warn ...
 */
if (...)

Please use this style as it is used extensively in the amdgpu_ras_eeprom.c file.

> + if ((10 * control->ras_num_recs) >= 
> (ras->bad_page_cnt_threshold * 9))

You don't need the extra parenthesis around multiplication--it has higher 
precedence than relational operators--drop the extra parenthesis.

Regards,
Luben

> + DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
> + control->ras_num_recs,
> + ras->bad_page_cnt_threshold);
>   } else if (hdr->header == RAS_TABLE_HDR_BAD &&
>  amdgpu_bad_page_threshold != 0) {
>   res = __verify_ras_table_checksum(control);

Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Lazar, Lijo

[Public]

Nit pick - suggest to use dev_warn for easy identification of the device.

Thanks,
Lijo

From: amd-gfx  on behalf of Kent Russell 

Sent: Thursday, October 21, 2021 9:27:10 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Russell, Kent ; Tuikov, Luben ; 
Joshi, Mukul 
Subject: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.

Cc: Luben Tuikov 
Cc: Mukul Joshi 
Signed-off-by: Kent Russell 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..ce5089216474 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct 
amdgpu_ras_eeprom_control *control,
 if (res)
 DRM_ERROR("RAS table incorrect checksum or error:%d\n",
   res);
+
+   /* Warn if we are at 90% of the threshold or above */
+   if ((10 * control->ras_num_recs) >= 
(ras->bad_page_cnt_threshold * 9))
+   DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+   control->ras_num_recs,
+   ras->bad_page_cnt_threshold);
 } else if (hdr->header == RAS_TABLE_HDR_BAD &&
amdgpu_bad_page_threshold != 0) {
 res = __verify_ras_table_checksum(control);
--
2.25.1

Re: [PATCH] amd/display: remove ChromeOS workaround

2021-10-21 Thread Simon Ser

On Thursday, October 14th, 2021 at 17:35, Simon Ser  wrote:

> This reverts commits ddab8bd788f5 ("drm/amd/display: Fix two cursor 
> duplication
> when using overlay") and e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay
> validation by considering cursors"").
>
> tl;dr ChromeOS uses the atomic interface for everything except the cursor. 
> This
> is incorrect and forces amdgpu to disable some hardware features. Let's revert
> the ChromeOS-specific workaround in mainline and allow the Chrome team to keep
> it internally in their own tree.
>
> See [1] for more details. This patch is an alternative to [2], which added
> ChromeOS detection.
>
> [1]: 
> https://lore.kernel.org/amd-gfx/JIQ_93_cHcshiIDsrMU1huBzx9P9LVQxucx8hQArpQu7Wk5DrCl_vTXj_Q20m_L-8C8A5dSpNcSJ8ehfcCrsQpfB5QG_Spn14EYkH9chtg0=@emersion.fr/
> [2]: 
> https://lore.kernel.org/amd-gfx/20211011151609.452132-1-cont...@emersion.fr/

Alex, are you okay with moving forward with this patch, or do you prefer the
other approach?

Re: [PATCH] amd/display: remove ChromeOS workaround

2021-10-21 Thread Paul Menzel


Dear Simon,


Am 21.10.21 um 18:08 schrieb Simon Ser:


On Tuesday, October 19th, 2021 at 10:25, Paul Menzel wrote:



Am 19.10.21 um 10:10 schrieb Simon Ser:

On Tuesday, October 19th, 2021 at 01:21, Paul Menzel wrote:


Am 19.10.21 um 01:06 schrieb Simon Ser:

On Tuesday, October 19th, 2021 at 01:03, Paul Menzel wrote:


Excuse my ignorance. Reading the commit message, there was a Linux
kernel change, that broke Chrome OS userspace, right? If so, and we do
not know if there is other userspace using the API incorrectly,
shouldn’t the patch breaking Chrome OS userspace be reverted to adhere
to Linux’ no-regression rule?


No. There was a ChromeOS bug which has been thought to be an amdgpu bug. But
fixing that "bug" breaks other user-space.


Thank you for the explanation. I guess the bug was only surfacing
because Chrome OS device, like Chromebooks, are only using AMD hardware
since a short while (maybe last year).

Reading your message *amdgpu: atomic API and cursor/overlay planes* [1]
again, it says:


Up until now we were using cursor and overlay planes in gamescope [3],
but some changes in the amdgpu driver [1] makes us unable to use planes


So this statement was incorrect? Which changes are that? Or did Chrome
OS ever work correctly with an older Linux kernel or not?


The sequence of events is as follows:

- gamescope can use cursor and overlay planes.
- ChromeOS-specific commit lands, fixing some ChromeOS issues related to video
playback. This breaks gamescope overlays.


I guess, I am confused, which Chrome OS specific commit that is. Is it
one of the reverted commits below? Which one?

1.  ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication
when using overlay")
2.  e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay validation by
considering cursors"")


ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication when using overlay")
is the commit which introduced the validate_overlay logic fixing ChromeOS and
breaking gamescope.


Thank you for elaborating on this. I guess I mixed up Chrome OS and 
gamescope, and was especially confused, the commit message of commit 
ddab8bd788f5 not explicitly listing the problematic userspace. Despite 
the commit message being well written, this crucial information is missing.



Later, 33f409e60eb0 ("drm/amd/display: Fix overlay validation by considering
cursors") relaxed validate_overlay. This breaks ChromeOS and partially fixes
gamescope (when the overlay is used and the cursor plane is unused).

Finally, e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay validation by
considering cursors"") has reverted that change, fixing ChromeOS (again) and
breaking gamescope completely again.


- Discussion to restrict the ChromeOS-specific logic to ChromeOS, or to revert
it, either of these fix gamescope.

Given this, I don't see how the quoted statement is incorrect? Maybe I'm
missing something?


Your reply from August 2021 to commit ddab8bd788f5 (drm/amd/display: Fix
two cursor duplication when using overlay) from April 2021 [2]:


Hm. This patch causes a regression for me. I was using primary + overlay
not covering the whole primary plane + cursor before. This patch breaks it.

This patch makes the overlay plane very useless for me, because the primary
plane is always under the overlay plane.


So, I would have thought, everything worked fine before some Linux
kernel commit changed behavior, and regressed userspace.


I've tried to explain the full story above. My user-space went from working to
broken to partially broken to broken. The quoted reply is a complaint that the
commit flipped gamescope from partially broken to completely broken. At the
time I didn't realize that ddab8bd788f5 caused some pain too.

Does that clear things up?


Yes, it does. Thank you very much for taking the time for walking me 
through this.



Kind regards,

Paul

Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold

2021-10-21 Thread Luben Tuikov

On 2021-10-21 11:57, Kent Russell wrote:
> When a GPU hits the bad_page_threshold, it will not be initialized by
> the amdgpu driver. This means that the table cannot be cleared, nor can
> information gathering be performed (getting serial number, BDF, etc).
>
> If the bad_page_threshold kernel parameter is set to -2,
> continue to initialize the GPU, while printing a warning to dmesg that
> this action has been done
>
> Cc: Luben Tuikov 
> Cc: Mukul Joshi 
> Signed-off-by: Kent Russell 
> Acked-by: Felix Kuehling 
> Reviewed-by: Luben Tuikov 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 
>  3 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index d58e37fd01f4..b85b67a88a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>  extern int amdgpu_ras_enable;
>  extern uint amdgpu_ras_mask;
>  extern int amdgpu_bad_page_threshold;
> +extern bool amdgpu_ignore_bad_page_threshold;
>  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>  extern int amdgpu_async_gfx_ring;
>  extern int amdgpu_mcbp;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 96bd63aeeddd..eee3cf874e7a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, 
> int, 0444);
>   * result in the GPU entering bad status when the number of total
>   * faulty pages by ECC exceeds the threshold value.
>   */
> -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default 
> value), 0 = disable bad page retirement)");
> +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default 
> value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
>  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
>  
>  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup 
> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index ce5089216474..bd6ed43b0df2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct 
> amdgpu_ras_eeprom_control *control,
>   res = amdgpu_ras_eeprom_correct_header_tag(control,
>  
> RAS_TABLE_HDR_VAL);
>   } else {
> - *exceed_err_limit = true;
> - dev_err(adev->dev,
> - "RAS records:%d exceed threshold:%d, "
> - "GPU will not be initialized. Replace this GPU 
> or increase the threshold",
> + dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>   control->ras_num_recs, 
> ras->bad_page_cnt_threshold); 

I thought this would all go in a single set of patches. I wasn't aware a 
singleton patch went in already which changed just this line--this change was 
always a part of a patch set.

Regards,
Luben

> + if (amdgpu_bad_page_threshold == -2) {
> + dev_warn(adev->dev, "GPU will be initialized 
> due to bad_page_threshold = -2.");
> + res = 0;
> + } else {
> + *exceed_err_limit = true;
> + dev_err(adev->dev, "GPU will not be 
> initialized. Replace this GPU or increase the threshold.");
> + }
>   }
>   } else {
>   DRM_INFO("Creating a new EEPROM table");

Re: [PATCH] amd/display: remove ChromeOS workaround

2021-10-21 Thread Simon Ser

Hi again,

On Tuesday, October 19th, 2021 at 10:25, Paul Menzel  
wrote:

> Dear Simon,
>
>
> Am 19.10.21 um 10:10 schrieb Simon Ser:
> > On Tuesday, October 19th, 2021 at 01:21, Paul Menzel 
> >  wrote:
> >
> >> Am 19.10.21 um 01:06 schrieb Simon Ser:
> >>> On Tuesday, October 19th, 2021 at 01:03, Paul Menzel wrote:
> >>>
>  Excuse my ignorance. Reading the commit message, there was a Linux
>  kernel change, that broke Chrome OS userspace, right? If so, and we do
>  not know if there is other userspace using the API incorrectly,
>  shouldn’t the patch breaking Chrome OS userspace be reverted to adhere
>  to Linux’ no-regression rule?
> >>>
> >>> No. There was a ChromeOS bug which has been thought to be an amdgpu bug. 
> >>> But
> >>> fixing that "bug" breaks other user-space.
> >>
> >> Thank you for the explanation. I guess the bug was only surfacing
> >> because Chrome OS device, like Chromebooks, are only using AMD hardware
> >> since a short while (maybe last year).
> >>
> >> Reading your message *amdgpu: atomic API and cursor/overlay planes* [1]
> >> again, it says:
> >>
> >>> Up until now we were using cursor and overlay planes in gamescope [3],
> >>> but some changes in the amdgpu driver [1] makes us unable to use planes
> >>
> >> So this statement was incorrect? Which changes are that? Or did Chrome
> >> OS ever work correctly with an older Linux kernel or not?
> >
> > The sequence of events is as follows:
> >
> > - gamescope can use cursor and overlay planes.
> > - ChromeOS-specific commit lands, fixing some ChromeOS issues related to 
> > video
> >playback. This breaks gamescope overlays.
>
> I guess, I am confused, which Chrome OS specific commit that is. Is it
> one of the reverted commits below? Which one?
>
> 1.  ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication
> when using overlay")
> 2.  e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay validation by
> considering cursors"")

ddab8bd788f5 ("drm/amd/display: Fix two cursor duplication when using overlay")
is the commit which introduced the validate_overlay logic fixing ChromeOS and
breaking gamescope.

Later, 33f409e60eb0 ("drm/amd/display: Fix overlay validation by considering
cursors") relaxed validate_overlay. This breaks ChromeOS and partially fixes
gamescope (when the overlay is used and the cursor plane is unused).

Finally, e7d9560aeae5 ("Revert "drm/amd/display: Fix overlay validation by
considering cursors"") has reverted that change, fixing ChromeOS (again) and
breaking gamescope completely again.

> > - Discussion to restrict the ChromeOS-specific logic to ChromeOS, or to 
> > revert
> >it, either of these fix gamescope.
> >
> > Given this, I don't see how the quoted statement is incorrect? Maybe I'm
> > missing something?
>
> Your reply from August 2021 to commit ddab8bd788f5 (drm/amd/display: Fix
> two cursor duplication when using overlay) from April 2021 [2]:
>
> > Hm. This patch causes a regression for me. I was using primary + overlay
> > not covering the whole primary plane + cursor before. This patch breaks it.
> >
> > This patch makes the overlay plane very useless for me, because the primary
> > plane is always under the overlay plane.
>
> So, I would have thought, everything worked fine before some Linux
> kernel commit changed behavior, and regressed userspace.

I've tried to explain the full story above. My user-space went from working to
broken to partially broken to broken. The quoted reply is a complaint that the
commit flipped gamescope from partially broken to completely broken. At the
time I didn't realize that ddab8bd788f5 caused some pain too.

Does that clear things up?

[PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold

2021-10-21 Thread Kent Russell

When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).

If the bad_page_threshold kernel parameter is set to -2,
continue to initialize the GPU, while printing a warning to dmesg that
this action has been done

Cc: Luben Tuikov 
Cc: Mukul Joshi 
Signed-off-by: Kent Russell 
Acked-by: Felix Kuehling 
Reviewed-by: Luben Tuikov 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 96bd63aeeddd..eee3cf874e7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 
0444);
  * result in the GPU entering bad status when the number of total
  * faulty pages by ECC exceeds the threshold value.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default 
value), 0 = disable bad page retirement)");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default 
value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup 
(8 if set to greater than 8 or less than 0, only affect gfx 8+)");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index ce5089216474..bd6ed43b0df2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct 
amdgpu_ras_eeprom_control *control,
res = amdgpu_ras_eeprom_correct_header_tag(control,
   
RAS_TABLE_HDR_VAL);
} else {
-   *exceed_err_limit = true;
-   dev_err(adev->dev,
-   "RAS records:%d exceed threshold:%d, "
-   "GPU will not be initialized. Replace this GPU 
or increase the threshold",
+   dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
control->ras_num_recs, 
ras->bad_page_cnt_threshold);
+   if (amdgpu_bad_page_threshold == -2) {
+   dev_warn(adev->dev, "GPU will be initialized 
due to bad_page_threshold = -2.");
+   res = 0;
+   } else {
+   *exceed_err_limit = true;
+   dev_err(adev->dev, "GPU will not be 
initialized. Replace this GPU or increase the threshold.");
+   }
}
} else {
DRM_INFO("Creating a new EEPROM table");
-- 
2.25.1

Re: [PATCH v2] drm/amdkfd: Separate pinned BOs destruction from general routine

2021-10-21 Thread Felix Kuehling

Am 2021-10-15 um 2:54 a.m. schrieb Lang Yu:
> Currently, all kfd BOs use same destruction routine. But pinned
> BOs are not unpinned properly. Separate them from general routine.
>
> v2 (Felix):
> Add safeguard to prevent user space from freeing signal BO.
> Kunmap signal BO in the event of setting event page error.
> Just kunmap signal BO to avoid duplicating the code.
>
> Signed-off-by: Lang Yu 

Reviewed-by: Felix Kuehling 


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|   2 +
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  10 ++
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  31 +++--
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c  | 110 +-
>  5 files changed, 119 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 69de31754907..751557af09bb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -279,6 +279,8 @@ int amdgpu_amdkfd_gpuvm_sync_memory(
>   struct kgd_dev *kgd, struct kgd_mem *mem, bool intr);
>  int amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_dev *kgd,
>   struct kgd_mem *mem, void **kptr, uint64_t *size);
> +void amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct kgd_dev *kgd, 
> struct kgd_mem *mem);
> +
>  int amdgpu_amdkfd_gpuvm_restore_process_bos(void *process_info,
>   struct dma_fence **ef);
>  int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct kgd_dev *kgd,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index cdf46bd0d8d5..4969763c2e47 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1871,6 +1871,16 @@ int amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct 
> kgd_dev *kgd,
>   return ret;
>  }
>  
> +void amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct kgd_dev *kgd, 
> struct kgd_mem *mem)
> +{
> + struct amdgpu_bo *bo = mem->bo;
> +
> + amdgpu_bo_reserve(bo, true);
> + amdgpu_bo_kunmap(bo);
> + amdgpu_bo_unpin(bo);
> + amdgpu_bo_unreserve(bo);
> +}
> +
>  int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct kgd_dev *kgd,
> struct kfd_vm_fault_info *mem)
>  {
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index f1e7edeb4e6b..9317a2e238d0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -1011,11 +1011,6 @@ static int kfd_ioctl_create_event(struct file *filp, 
> struct kfd_process *p,
>   void *mem, *kern_addr;
>   uint64_t size;
>  
> - if (p->signal_page) {
> - pr_err("Event page is already set\n");
> - return -EINVAL;
> - }
> -
>   kfd = kfd_device_by_id(GET_GPU_ID(args->event_page_offset));
>   if (!kfd) {
>   pr_err("Getting device by id failed in %s\n", __func__);
> @@ -1023,6 +1018,13 @@ static int kfd_ioctl_create_event(struct file *filp, 
> struct kfd_process *p,
>   }
>  
>   mutex_lock(>mutex);
> +
> + if (p->signal_page) {
> + pr_err("Event page is already set\n");
> + err = -EINVAL;
> + goto out_unlock;
> + }
> +
>   pdd = kfd_bind_process_to_device(kfd, p);
>   if (IS_ERR(pdd)) {
>   err = PTR_ERR(pdd);
> @@ -1037,20 +1039,24 @@ static int kfd_ioctl_create_event(struct file *filp, 
> struct kfd_process *p,
>   err = -EINVAL;
>   goto out_unlock;
>   }
> - mutex_unlock(>mutex);
>  
>   err = amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(kfd->kgd,
>   mem, _addr, );
>   if (err) {
>   pr_err("Failed to map event page to kernel\n");
> - return err;
> + goto out_unlock;
>   }
>  
>   err = kfd_event_page_set(p, kern_addr, size);
>   if (err) {
>   pr_err("Failed to set event page\n");
> - return err;
> + amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(kfd->kgd, 
> mem);
> + goto out_unlock;
>   }
> +
> + p->signal_handle = args->event_page_offset;
> +
> + mutex_unlock(>mutex);
>   }
>  
>   err = kfd_event_create(filp, p, args->event_type,
> @@ -1368,6 +1374,15 @@ static int kfd_ioctl_free_memory_of_gpu(struct file 
> *filep,
>   return -EINVAL;
>  
>   mutex_lock(>mutex);
> + /*
> +  *

[PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Kent Russell

dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.

Cc: Luben Tuikov 
Cc: Mukul Joshi 
Signed-off-by: Kent Russell 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..ce5089216474 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct 
amdgpu_ras_eeprom_control *control,
if (res)
DRM_ERROR("RAS table incorrect checksum or error:%d\n",
  res);
+
+   /* Warn if we are at 90% of the threshold or above */
+   if ((10 * control->ras_num_recs) >= 
(ras->bad_page_cnt_threshold * 9))
+   DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+   control->ras_num_recs,
+   ras->bad_page_cnt_threshold);
} else if (hdr->header == RAS_TABLE_HDR_BAD &&
   amdgpu_bad_page_threshold != 0) {
res = __verify_ras_table_checksum(control);
-- 
2.25.1

Re: FW: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()

2021-10-21 Thread Andrey Grodzovsky


On 2021-10-21 3:19 a.m., Yu, Lang wrote:


[AMD Official Use Only]




-Original Message-
From: Yu, Lang 
Sent: Thursday, October 21, 2021 3:18 PM
To: Grodzovsky, Andrey 
Cc: Deucher, Alexander ; Koenig, Christian
; Huang, Ray ; Yu, Lang

Subject: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in
amdgpu_device_fini_sw()

amdgpu_fence_driver_sw_fini() should be executed before
amdgpu_device_ip_fini(), otherwise fence driver resource won't be properly freed
as adev->rings have been tore down.



Cam you clarify more where exactly the memleak happens ?

Andrey




Fixes: 72c8c97b1522 ("drm/amdgpu: Split amdgpu_device_fini into early and late")

Signed-off-by: Lang Yu 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 41ce86244144..5654c4790773 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3843,8 +3843,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device
*adev)

void amdgpu_device_fini_sw(struct amdgpu_device *adev)  {
-   amdgpu_device_ip_fini(adev);
amdgpu_fence_driver_sw_fini(adev);
+   amdgpu_device_ip_fini(adev);
release_firmware(adev->firmware.gpu_info_fw);
adev->firmware.gpu_info_fw = NULL;
adev->accel_working = false;
--
2.25.1

Re: [PATCH v2 1/3] drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr

2021-10-21 Thread Christian König


Am 21.10.21 um 16:31 schrieb Nirmoy Das:

Do not allow exported amdgpu_gtt_mgr_*() to accept
any ttm_resource_manager pointer. Also there is no need
to force other module to call a ttm function just to
eventually call gtt_mgr functions.

v2: pass adev's gtt_mgr instead of adev

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  4 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 23 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c |  4 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h |  4 ++--
  4 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 41ce86244144..2b53d86aebac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4287,7 +4287,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,

amdgpu_virt_init_data_exchange(adev);
/* we need recover gart prior to run SMC/CP/SDMA resume */
-   amdgpu_gtt_mgr_recover(ttm_manager_type(>mman.bdev, TTM_PL_TT));
+   amdgpu_gtt_mgr_recover(>mman.gtt_mgr);

r = amdgpu_device_fw_loading(adev);
if (r)
@@ -4604,7 +4604,7 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
amdgpu_inc_vram_lost(tmp_adev);
}

-   r = 
amdgpu_gtt_mgr_recover(ttm_manager_type(_adev->mman.bdev, TTM_PL_TT));
+   r = 
amdgpu_gtt_mgr_recover(_adev->mman.gtt_mgr);
if (r)
goto out;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index c18f16b3be9c..e429f2df73be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -77,10 +77,8 @@ static ssize_t amdgpu_mem_info_gtt_used_show(struct device 
*dev,
  {
struct drm_device *ddev = dev_get_drvdata(dev);
struct amdgpu_device *adev = drm_to_adev(ddev);
-   struct ttm_resource_manager *man;

-   man = ttm_manager_type(>mman.bdev, TTM_PL_TT);
-   return sysfs_emit(buf, "%llu\n", amdgpu_gtt_mgr_usage(man));
+   return sysfs_emit(buf, "%llu\n", 
amdgpu_gtt_mgr_usage(>mman.gtt_mgr));
  }

  static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
@@ -206,14 +204,15 @@ static void amdgpu_gtt_mgr_del(struct 
ttm_resource_manager *man,
  /**
   * amdgpu_gtt_mgr_usage - return usage of GTT domain
   *
- * @man: TTM memory type manager
+ * @mgr: amdgpu_gtt_mgr pointer
   *
   * Return how many bytes are used in the GTT domain
   */
-uint64_t amdgpu_gtt_mgr_usage(struct ttm_resource_manager *man)
+uint64_t amdgpu_gtt_mgr_usage(struct amdgpu_gtt_mgr *mgr)
  {
-   struct amdgpu_gtt_mgr *mgr = to_gtt_mgr(man);
-   s64 result = man->size - atomic64_read(>available);
+   s64 result;
+
+   result = mgr->manager.size - atomic64_read(>available);

return (result > 0 ? result : 0) * PAGE_SIZE;
  }
@@ -221,16 +220,15 @@ uint64_t amdgpu_gtt_mgr_usage(struct ttm_resource_manager 
*man)
  /**
   * amdgpu_gtt_mgr_recover - re-init gart
   *
- * @man: TTM memory type manager
+ * @mgr: amdgpu_gtt_mgr pointer
   *
   * Re-init the gart for each known BO in the GTT.
   */
-int amdgpu_gtt_mgr_recover(struct ttm_resource_manager *man)
+int amdgpu_gtt_mgr_recover(struct amdgpu_gtt_mgr *mgr)
  {
-   struct amdgpu_gtt_mgr *mgr = to_gtt_mgr(man);
-   struct amdgpu_device *adev;
struct amdgpu_gtt_node *node;
struct drm_mm_node *mm_node;
+   struct amdgpu_device *adev;
int r = 0;

adev = container_of(mgr, typeof(*adev), mman.gtt_mgr);
@@ -260,6 +258,7 @@ static void amdgpu_gtt_mgr_debug(struct 
ttm_resource_manager *man,
 struct drm_printer *printer)
  {
struct amdgpu_gtt_mgr *mgr = to_gtt_mgr(man);
+   struct amdgpu_device *adev = container_of(mgr, typeof(*adev), 
mman.gtt_mgr);

spin_lock(>lock);
drm_mm_print(>mm, printer);
@@ -267,7 +266,7 @@ static void amdgpu_gtt_mgr_debug(struct 
ttm_resource_manager *man,

drm_printf(printer, "man size:%llu pages, gtt available:%lld pages, 
usage:%lluMB\n",
   man->size, (u64)atomic64_read(>available),
-  amdgpu_gtt_mgr_usage(man) >> 20);
+  amdgpu_gtt_mgr_usage(>mman.gtt_mgr) >> 20);


That here needs fixing, we shouldn't use the adev->mman.gtt_mgr here but 
rather upcast man.


Regards,
Christian.


  }

  static const struct ttm_resource_manager_func amdgpu_gtt_mgr_func = {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index d2955ea4a62b..603ce32db5c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -678,7 +678,7 @@ int amdgpu_info_ioctl(struct

[PATCH v2 2/3] drm/amdgpu: do not pass ttm_resource_manager to vram_mgr

2021-10-21 Thread Nirmoy Das

Do not allow exported amdgpu_vram_mgr_*() to accept
any ttm_resource_manager pointer. Also there is no need
to force other module to call a ttm function just to
eventually call vram_mgr functions.

v2: pass adev's vram_mgr instead of adev
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c   |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |  5 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  | 10 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  |  6 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h  |  8 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c |  5 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 38 
 7 files changed, 30 insertions(+), 45 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 7077f21f0021..df818e145d9a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -531,9 +531,8 @@ int amdgpu_amdkfd_get_dmabuf_info(struct kgd_dev *kgd, int 
dma_buf_fd,
 uint64_t amdgpu_amdkfd_get_vram_usage(struct kgd_dev *kgd)
 {
struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
-   struct ttm_resource_manager *vram_man = 
ttm_manager_type(>mman.bdev, TTM_PL_VRAM);

-   return amdgpu_vram_mgr_usage(vram_man);
+   return amdgpu_vram_mgr_usage(>mman.vram_mgr);
 }

 uint64_t amdgpu_amdkfd_get_hive_id(struct kgd_dev *kgd)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 76fe5b71e35d..7e745164a624 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -298,7 +298,6 @@ static void amdgpu_cs_get_threshold_for_moves(struct 
amdgpu_device *adev,
 {
s64 time_us, increment_us;
u64 free_vram, total_vram, used_vram;
-   struct ttm_resource_manager *vram_man = 
ttm_manager_type(>mman.bdev, TTM_PL_VRAM);
/* Allow a maximum of 200 accumulated ms. This is basically per-IB
 * throttling.
 *
@@ -315,7 +314,7 @@ static void amdgpu_cs_get_threshold_for_moves(struct 
amdgpu_device *adev,
}

total_vram = adev->gmc.real_vram_size - 
atomic64_read(>vram_pin_size);
-   used_vram = amdgpu_vram_mgr_usage(vram_man);
+   used_vram = amdgpu_vram_mgr_usage(>mman.vram_mgr);
free_vram = used_vram >= total_vram ? 0 : total_vram - used_vram;

spin_lock(>mm_stats.lock);
@@ -362,7 +361,7 @@ static void amdgpu_cs_get_threshold_for_moves(struct 
amdgpu_device *adev,
if (!amdgpu_gmc_vram_full_visible(>gmc)) {
u64 total_vis_vram = adev->gmc.visible_vram_size;
u64 used_vis_vram =
- amdgpu_vram_mgr_vis_usage(vram_man);
+ amdgpu_vram_mgr_vis_usage(>mman.vram_mgr);

if (used_vis_vram < total_vis_vram) {
u64 free_vis_vram = total_vis_vram - used_vis_vram;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 603ce32db5c5..b426e03ad630 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -672,10 +672,10 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
ui64 = atomic64_read(>num_vram_cpu_page_faults);
return copy_to_user(out, , min(size, 8u)) ? -EFAULT : 0;
case AMDGPU_INFO_VRAM_USAGE:
-   ui64 = amdgpu_vram_mgr_usage(ttm_manager_type(>mman.bdev, 
TTM_PL_VRAM));
+   ui64 = amdgpu_vram_mgr_usage(>mman.vram_mgr);
return copy_to_user(out, , min(size, 8u)) ? -EFAULT : 0;
case AMDGPU_INFO_VIS_VRAM_USAGE:
-   ui64 = 
amdgpu_vram_mgr_vis_usage(ttm_manager_type(>mman.bdev, TTM_PL_VRAM));
+   ui64 = amdgpu_vram_mgr_vis_usage(>mman.vram_mgr);
return copy_to_user(out, , min(size, 8u)) ? -EFAULT : 0;
case AMDGPU_INFO_GTT_USAGE:
ui64 = amdgpu_gtt_mgr_usage(>mman.gtt_mgr);
@@ -709,8 +709,6 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
}
case AMDGPU_INFO_MEMORY: {
struct drm_amdgpu_memory_info mem;
-   struct ttm_resource_manager *vram_man =
-   ttm_manager_type(>mman.bdev, TTM_PL_VRAM);
struct ttm_resource_manager *gtt_man =
ttm_manager_type(>mman.bdev, TTM_PL_TT);
memset(, 0, sizeof(mem));
@@ -719,7 +717,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
atomic64_read(>vram_pin_size) -
AMDGPU_VM_RESERVED_VRAM;
mem.vram.heap_usage =
-   amdgpu_vram_mgr_usage(vram_man);
+   amdgpu_vram_mgr_usage(>mman.vram_mgr);
mem.vram.max_allocation = mem.vram.usable_heap_size * 3 / 4;

[PATCH v3 3/3] drm/amdgpu: recover gart table at resume

2021-10-21 Thread Nirmoy Das

Get rid off pin/unpin of gart BO at resume/suspend and
instead pin only once and try to recover gart content
at resume time. This is much more stable in case there
is OOM situation at 2nd call to amdgpu_device_evict_resources()
while evicting GART table.

v3: remove gart recovery from other places
v2: pin gart at amdgpu_gart_table_vram_alloc()
Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   | 80 ++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |  3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c  |  3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c  |  3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c  |  3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  3 +-
 7 files changed, 12 insertions(+), 94 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2b53d86aebac..f0c70e9d37fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3935,16 +3935,11 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
if (!adev->in_s0ix)
amdgpu_amdkfd_suspend(adev, adev->in_runpm);

-   /* First evict vram memory */
amdgpu_device_evict_resources(adev);

amdgpu_fence_driver_hw_fini(adev);

amdgpu_device_ip_suspend_phase2(adev);
-   /* This second call to evict device resources is to evict
-* the gart page table using the CPU.
-*/
-   amdgpu_device_evict_resources(adev);

return 0;
 }
@@ -4286,8 +4281,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
goto error;

amdgpu_virt_init_data_exchange(adev);
-   /* we need recover gart prior to run SMC/CP/SDMA resume */
-   amdgpu_gtt_mgr_recover(>mman.gtt_mgr);

r = amdgpu_device_fw_loading(adev);
if (r)
@@ -4604,10 +4597,6 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
amdgpu_inc_vram_lost(tmp_adev);
}

-   r = 
amdgpu_gtt_mgr_recover(_adev->mman.gtt_mgr);
-   if (r)
-   goto out;
-
r = amdgpu_device_fw_loading(tmp_adev);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index d3e4203f6217..679eec122bb5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -116,78 +116,16 @@ int amdgpu_gart_table_vram_alloc(struct amdgpu_device 
*adev)
 {
int r;

-   if (adev->gart.bo == NULL) {
-   struct amdgpu_bo_param bp;
-
-   memset(, 0, sizeof(bp));
-   bp.size = adev->gart.table_size;
-   bp.byte_align = PAGE_SIZE;
-   bp.domain = AMDGPU_GEM_DOMAIN_VRAM;
-   bp.flags = AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED |
-   AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS;
-   bp.type = ttm_bo_type_kernel;
-   bp.resv = NULL;
-   bp.bo_ptr_size = sizeof(struct amdgpu_bo);
-
-   r = amdgpu_bo_create(adev, , >gart.bo);
-   if (r) {
-   return r;
-   }
-   }
-   return 0;
-}
-
-/**
- * amdgpu_gart_table_vram_pin - pin gart page table in vram
- *
- * @adev: amdgpu_device pointer
- *
- * Pin the GART page table in vram so it will not be moved
- * by the memory manager (pcie r4xx, r5xx+).  These asics require the
- * gart table to be in video memory.
- * Returns 0 for success, error for failure.
- */
-int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev)
-{
-   int r;
+   if (adev->gart.bo != NULL)
+   return 0;

-   r = amdgpu_bo_reserve(adev->gart.bo, false);
-   if (unlikely(r != 0))
-   return r;
-   r = amdgpu_bo_pin(adev->gart.bo, AMDGPU_GEM_DOMAIN_VRAM);
+   r = amdgpu_bo_create_kernel(adev,  adev->gart.table_size, PAGE_SIZE,
+   AMDGPU_GEM_DOMAIN_VRAM, >gart.bo,
+   NULL, (void *)>gart.ptr);
if (r) {
-   amdgpu_bo_unreserve(adev->gart.bo);
return r;
}
-   r = amdgpu_bo_kmap(adev->gart.bo, >gart.ptr);
-   if (r)
-   amdgpu_bo_unpin(adev->gart.bo);
-   amdgpu_bo_unreserve(adev->gart.bo);
-   return r;
-}
-
-/**
- * amdgpu_gart_table_vram_unpin - unpin gart page table in vram
- *
- * @adev: amdgpu_device pointer
- *
- * Unpin the GART page table in vram (pcie r4xx, r5xx+).
- * These asics require the gart table to be in video memory.
- */
-void amdgpu_gart_table_vram_unpin(struct amdgpu_device *adev)
-{
-   int r;
-
-   if (adev->gart.bo == NULL) {
-

[PATCH v2 1/3] drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr

2021-10-21 Thread Nirmoy Das

Do not allow exported amdgpu_gtt_mgr_*() to accept
any ttm_resource_manager pointer. Also there is no need
to force other module to call a ttm function just to
eventually call gtt_mgr functions.

v2: pass adev's gtt_mgr instead of adev

Signed-off-by: Nirmoy Das 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 23 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h |  4 ++--
 4 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 41ce86244144..2b53d86aebac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4287,7 +4287,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,

amdgpu_virt_init_data_exchange(adev);
/* we need recover gart prior to run SMC/CP/SDMA resume */
-   amdgpu_gtt_mgr_recover(ttm_manager_type(>mman.bdev, TTM_PL_TT));
+   amdgpu_gtt_mgr_recover(>mman.gtt_mgr);

r = amdgpu_device_fw_loading(adev);
if (r)
@@ -4604,7 +4604,7 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
amdgpu_inc_vram_lost(tmp_adev);
}

-   r = 
amdgpu_gtt_mgr_recover(ttm_manager_type(_adev->mman.bdev, TTM_PL_TT));
+   r = 
amdgpu_gtt_mgr_recover(_adev->mman.gtt_mgr);
if (r)
goto out;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index c18f16b3be9c..e429f2df73be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -77,10 +77,8 @@ static ssize_t amdgpu_mem_info_gtt_used_show(struct device 
*dev,
 {
struct drm_device *ddev = dev_get_drvdata(dev);
struct amdgpu_device *adev = drm_to_adev(ddev);
-   struct ttm_resource_manager *man;

-   man = ttm_manager_type(>mman.bdev, TTM_PL_TT);
-   return sysfs_emit(buf, "%llu\n", amdgpu_gtt_mgr_usage(man));
+   return sysfs_emit(buf, "%llu\n", 
amdgpu_gtt_mgr_usage(>mman.gtt_mgr));
 }

 static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
@@ -206,14 +204,15 @@ static void amdgpu_gtt_mgr_del(struct 
ttm_resource_manager *man,
 /**
  * amdgpu_gtt_mgr_usage - return usage of GTT domain
  *
- * @man: TTM memory type manager
+ * @mgr: amdgpu_gtt_mgr pointer
  *
  * Return how many bytes are used in the GTT domain
  */
-uint64_t amdgpu_gtt_mgr_usage(struct ttm_resource_manager *man)
+uint64_t amdgpu_gtt_mgr_usage(struct amdgpu_gtt_mgr *mgr)
 {
-   struct amdgpu_gtt_mgr *mgr = to_gtt_mgr(man);
-   s64 result = man->size - atomic64_read(>available);
+   s64 result;
+
+   result = mgr->manager.size - atomic64_read(>available);

return (result > 0 ? result : 0) * PAGE_SIZE;
 }
@@ -221,16 +220,15 @@ uint64_t amdgpu_gtt_mgr_usage(struct ttm_resource_manager 
*man)
 /**
  * amdgpu_gtt_mgr_recover - re-init gart
  *
- * @man: TTM memory type manager
+ * @mgr: amdgpu_gtt_mgr pointer
  *
  * Re-init the gart for each known BO in the GTT.
  */
-int amdgpu_gtt_mgr_recover(struct ttm_resource_manager *man)
+int amdgpu_gtt_mgr_recover(struct amdgpu_gtt_mgr *mgr)
 {
-   struct amdgpu_gtt_mgr *mgr = to_gtt_mgr(man);
-   struct amdgpu_device *adev;
struct amdgpu_gtt_node *node;
struct drm_mm_node *mm_node;
+   struct amdgpu_device *adev;
int r = 0;

adev = container_of(mgr, typeof(*adev), mman.gtt_mgr);
@@ -260,6 +258,7 @@ static void amdgpu_gtt_mgr_debug(struct 
ttm_resource_manager *man,
 struct drm_printer *printer)
 {
struct amdgpu_gtt_mgr *mgr = to_gtt_mgr(man);
+   struct amdgpu_device *adev = container_of(mgr, typeof(*adev), 
mman.gtt_mgr);

spin_lock(>lock);
drm_mm_print(>mm, printer);
@@ -267,7 +266,7 @@ static void amdgpu_gtt_mgr_debug(struct 
ttm_resource_manager *man,

drm_printf(printer, "man size:%llu pages, gtt available:%lld pages, 
usage:%lluMB\n",
   man->size, (u64)atomic64_read(>available),
-  amdgpu_gtt_mgr_usage(man) >> 20);
+  amdgpu_gtt_mgr_usage(>mman.gtt_mgr) >> 20);
 }

 static const struct ttm_resource_manager_func amdgpu_gtt_mgr_func = {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index d2955ea4a62b..603ce32db5c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -678,7 +678,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
ui64 = 
amdgpu_vram_mgr_vis_usage(ttm_manager_type(>mman.bdev, TTM_PL_VRAM));
return copy_to_user(out, ,

Re: [PATCH 2/2] drm/amdkfd: debug message to count successfully migrated pages

2021-10-21 Thread Felix Kuehling

Am 2021-10-20 um 8:47 p.m. schrieb Philip Yang:
> Not all migrate.cpages returned from migrate_vma_setup can be migrated,
> for example non anonymous page, or out of device memory. So after
> migrate_vma_pages returns, add debug message to count pages are
> successfully migrated which has MIGRATE_PFN_VALID and
> MIGRATE_PFN_MIGRATE flag set.
>
> Signed-off-by: Philip Yang 

The series is

Reviewed-by: Felix Kuehling 


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 21 +
>  1 file changed, 21 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index a14d0077e262..6d8634e40b3b 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -268,6 +268,19 @@ static void svm_migrate_put_sys_page(unsigned long addr)
>   put_page(page);
>  }
>  
> +static unsigned long svm_migrate_successful_pages(struct migrate_vma 
> *migrate)
> +{
> + unsigned long cpages = 0;
> + unsigned long i;
> +
> + for (i = 0; i < migrate->npages; i++) {
> + if (migrate->src[i] & MIGRATE_PFN_VALID &&
> + migrate->src[i] & MIGRATE_PFN_MIGRATE)
> + cpages++;
> + }
> + return cpages;
> +}
> +
>  static int
>  svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range 
> *prange,
>struct migrate_vma *migrate, struct dma_fence **mfence,
> @@ -429,6 +442,10 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, 
> struct svm_range *prange,
>  
>   r = svm_migrate_copy_to_vram(adev, prange, , , scratch);
>   migrate_vma_pages();
> +
> + pr_debug("successful/cpages/npages 0x%lx/0x%lx/0x%lx\n",
> + svm_migrate_successful_pages(), cpages, migrate.npages);
> +
>   svm_migrate_copy_done(adev, mfence);
>   migrate_vma_finalize();
>  
> @@ -665,6 +682,10 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, 
> struct svm_range *prange,
>   r = svm_migrate_copy_to_ram(adev, prange, , ,
>   scratch, npages);
>   migrate_vma_pages();
> +
> + pr_debug("successful/cpages/npages 0x%lx/0x%lx/0x%lx\n",
> + svm_migrate_successful_pages(), cpages, migrate.npages);
> +
>   svm_migrate_copy_done(adev, mfence);
>   migrate_vma_finalize();
>   svm_range_dma_unmap(adev->dev, scratch, 0, npages);

[PATCH 2/2] drm/amdgpu/swsmu: handle VCN harvesting for VCN PG control

2021-10-21 Thread Alex Deucher

Check if VCN instances are harvested when controlling
VCN power gating.

Fixes: 1b592d00b4ac83 ("drm/amdgpu/vcn: remove manual instance setting")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
Signed-off-by: Alex Deucher 
---
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 28 +--
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 15e66e1912de..9326547fe5fb 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -902,32 +902,18 @@ static int sienna_cichlid_set_default_dpm_table(struct 
smu_context *smu)
 static int sienna_cichlid_dpm_set_vcn_enable(struct smu_context *smu, bool 
enable)
 {
struct amdgpu_device *adev = smu->adev;
-   int ret = 0;
+   int i, ret = 0;
 
-   if (enable) {
+   for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
+   if (adev->vcn.harvest_config & (1 << i))
+   continue;
/* vcn dpm on is a prerequisite for vcn power gate messages */
if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) 
{
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerUpVcn, 0, NULL);
+   ret = smu_cmn_send_smc_msg_with_param(smu, enable ?
+ 
SMU_MSG_PowerUpVcn : SMU_MSG_PowerDownVcn,
+ 0x1 * i, 
NULL);
if (ret)
return ret;
-   if (adev->vcn.num_vcn_inst > 1) {
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerUpVcn,
- 0x1, 
NULL);
-   if (ret)
-   return ret;
-   }
-   }
-   } else {
-   if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) 
{
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerDownVcn, 0, NULL);
-   if (ret)
-   return ret;
-   if (adev->vcn.num_vcn_inst > 1) {
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerDownVcn,
- 0x1, 
NULL);
-   if (ret)
-   return ret;
-   }
}
}
 
-- 
2.31.1

[PATCH 1/2] drm/amdgpu: Workaround harvesting info for some navy flounder boards

2021-10-21 Thread Alex Deucher

Some navy flounder boards do not properly mark harvested
VCN instances.  Fix that here.

v2: use IP versions

Fixes: 1b592d00b4ac83 ("drm/amdgpu/vcn: remove manual instance setting")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index dfb92f229748..814e9620fac5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -507,6 +507,10 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device 
*adev)
break;
}
}
+   /* some IP discovery tables on Navy Flounder don't have this set 
correctly */
+   if ((adev->ip_versions[UVD_HWIP][1] == IP_VERSION(3, 0, 1)) &&
+   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
+   adev->vcn.harvest_config |= AMDGPU_VCN_HARVEST_VCN1;
if (vcn_harvest_count == adev->vcn.num_vcn_inst) {
adev->harvest_ip_mask |= AMD_HARVEST_IP_VCN_MASK;
adev->harvest_ip_mask |= AMD_HARVEST_IP_JPEG_MASK;
-- 
2.31.1

RE: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Chen, Guchun

Why using asic_type to check this? The issue is caused by the ip discovery 
series, and I thought that series' goal is to remove DID/asic_type as much as 
possible in kernel driver.

+   /* some IP discovery tables on NF don't have this set correctly */
+   if (adev->asic_type == CHIP_NAVY_FLOUNDER)
+   adev->vcn.harvest_config |= AMDGPU_VCN_HARVEST_VCN1;

Regards,
Guchun

-Original Message-
From: Alex Deucher  
Sent: Thursday, October 21, 2021 10:02 PM
To: Chen, Guchun 
Cc: amd-gfx list ; Koenig, Christian 
; Pan, Xinhui ; Deucher, 
Alexander ; Liu, Leo 
Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
NAVY_FLOUNDER

Thanks.  I think this patch set fixes it in a bit more future proof way:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fseries%2F96132%2Fdata=04%7C01%7CGuchun.Chen%40amd.com%7C52fab5ccf8f64b6eb09b08d9949b548f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637704217145304873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=2KMrUDLZZ1s3colyVy1WwY4Yz6GbyI9z53qixn%2BuUwQ%3Dreserved=0

Alex

On Thu, Oct 21, 2021 at 9:34 AM Chen, Guchun  wrote:
>
> Additionally, in sienna_cichlid_dpm_set_vcn_enable, we also use num_vcn_inst 
> to set dpm for VCN1 if it's > 1.
> The main problem here is VCN harvest info is not set correctly, so 
> vcn.harvest_config is not reliable in this case.
>
> if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) {
> ret = smu_cmn_send_smc_msg_with_param(smu, 
> SMU_MSG_PowerUpVcn, 0, NULL);
> if (ret)
> return ret;
> if (adev->vcn.num_vcn_inst > 1) {
> ret = smu_cmn_send_smc_msg_with_param(smu, 
> SMU_MSG_PowerUpVcn,
>   0x1, 
> NULL);
> if (ret)
> return ret;
> }
> }
>
> Regards,
> Guchun
>
> -Original Message-
> From: Chen, Guchun
> Sent: Thursday, October 21, 2021 9:14 PM
> To: Alex Deucher 
> Cc: amd-gfx list ; Koenig, Christian 
> ; Pan, Xinhui ; Deucher, 
> Alexander ; Liu, Leo 
> Subject: RE: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
> NAVY_FLOUNDER
>
> Hi Alex,
>
> No, it does not help.
>
> adev->vcn.harvest_config is 0 after retrieving harvest info from VBIOS. Looks 
> that harvest info in VBIOs does not reflect the case that VCN1 is power gated.
>
> I checked several navy flounders SKUs, the observation is the same, so this 
> is likely a common case. Perhaps we need to check with VBIOS/SMU guys.
>
> Regards,
> Guchun
>
> -Original Message-
> From: Alex Deucher 
> Sent: Thursday, October 21, 2021 9:06 PM
> To: Chen, Guchun 
> Cc: amd-gfx list ; Koenig, Christian 
> ; Pan, Xinhui ; Deucher, 
> Alexander ; Liu, Leo 
> Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
> NAVY_FLOUNDER
>
> On Thu, Oct 21, 2021 at 3:15 AM Guchun Chen  wrote:
> >
> > VCN instance 1 is power gated permanently by SMU.
> >
> > Bug:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > tl 
> > ab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1743data=04%7C01%
> > 7C
> > guchun.chen%40amd.com%7Cda80a308a28049d543ad08d99493847d%7C3dd8961fe
> > 48 
> > 84e608e11a82d994e183d%7C0%7C0%7C637704183581593964%7CUnknown%7CTWFpb
> > GZ
> > sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> > %3 
> > D%7C1000sdata=2vNLj9bXE2oV97rxBiUOiaFNpKopVSJefL%2BMcQE%2BSfo%3
> > D&
> > amp;reserved=0
> >
> > Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance 
> > setting")
> > Signed-off-by: Guchun Chen 
>
> Doesn't this patch effectively do the same thing?
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatc
> hwork.freedesktop.org%2Fpatch%2F460329%2Fdata=04%7C01%7CGuchun.Ch
> en%40amd.com%7C52fab5ccf8f64b6eb09b08d9949b548f%7C3dd8961fe4884e608e11
> a82d994e183d%7C0%7C0%7C637704217145304873%7CUnknown%7CTWFpbGZsb3d8eyJW
> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&
> amp;sdata=EmyT%2BNBnV8rIhJSqncnyFwR94smOvu2AGeb4vESFhdE%3Dreserve
> d=0 Where else is num_vcn_inst used that it causes a problem?  Or is 
> the VCN harvesting not set correctly on some navy flounders?
>
> Alex
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> > b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> > index dbfd92984655..4848922667f2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> > @@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
> > adev->vcn.num_enc_rings = 0;
> > else
> >

RE: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

2021-10-21 Thread Russell, Kent

[AMD Official Use Only]

My editor won't let me reply in-line without making it look like garbage.

Thanks for the insight, Luben! They're all useful points, especially the 
consolidation and relying on the threshold_validation which has already 
occurred before we get to this point (which I should've checked).

And I overdid the transitive multiplication explanation, so I wouldn't have to 
answer questions about it later. But your concise comment below pretty much 
covers things and shouldn't cause any unnecessary inquiries.

Kent

From: Tuikov, Luben 
Sent: Wednesday, October 20, 2021 5:48 PM
To: Russell, Kent ; amd-gfx@lists.freedesktop.org
Cc: Joshi, Mukul 
Subject: Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% 
threshold

On 2021-10-20 12:35, Kent Russell wrote:

Currently dmesg doesn't warn when the number of bad pages approaches the

"Currently" is redundant in this sentence as it is already in present simple 
tense.

Fair point



threshold for page retirement. WARN when the number of bad pages

is at 90% or greater for easier checks and planning, instead of waiting

until the GPU is full of bad pages

Missing full-stop (period) above.







Cc: Luben Tuikov 

Cc: Mukul Joshi 

Signed-off-by: Kent Russell 

---

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +

 1 file changed, 17 insertions(+)



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

index f4c05ff4b26c..1ede0f0d6f55 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

@@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct 
amdgpu_ras_eeprom_control *control,

control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);



if (hdr->header == RAS_TABLE_HDR_VAL) {

+   int threshold = 0;

DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",

control->ras_num_recs);

res = __verify_ras_table_checksum(control);

if (res)

DRM_ERROR("RAS table incorrect checksum or error:%d\n",

 res);

+

+   /* threshold = 0 means that page retirement is disabled, while

+* threshold = -1 means default behaviour

+*/

+   if (amdgpu_bad_page_threshold == -1)

+   threshold = ras->bad_page_cnt_threshold;

+   else if (amdgpu_bad_page_threshold > 0)

+   threshold = amdgpu_bad_page_threshold;

I believe we don't need this calculation here as it's already been done for us 
in amdgpu_ras_validate_threshold(), in amdgpu_ras.c.


I believe you want to use "ras->bad_page_cnt_threshold" here instead. For 
instance of this, see a bit further down in this very function this check 
including the comment, in italics:

} else if (hdr->header == RAS_TABLE_HDR_BAD &&
   amdgpu_bad_page_threshold != 0) {
res = __verify_ras_table_checksum(control);
if (res)
DRM_ERROR("RAS Table incorrect checksum or error:%d\n",
  res);
if (ras->bad_page_cnt_threshold > control->ras_num_recs) {
/* This means that, the threshold was increased since
 * the last time the system was booted, and now,
 * ras->bad_page_cnt_threshold - control->num_recs > 0,
 * so that at least one more record can be saved,
 * before the page count threshold is reached.
 */

And on the "else", a bit further down, again in italics:

} else {
*exceed_err_limit = true;
dev_err(adev->dev,
"RAS records:%d exceed threshold:%d, "
"maybe retire this GPU?",
control->ras_num_recs, ras->bad_page_cnt_threshold);
}


See how it says "records exceed threshold"--well, with this patch you want to 
say "records exceed 90% of threshold." :-) So these are the quantities we gauge 
each other to.

Clarification on this below.





+

+   /* Since multiplcation is transitive, a = 9b/10 is the same

+* as 10a = 9b. Use this for our 90% limit to avoid rounding

+*/

I really like the format of the comment. But I feel that the comment itself 
isn't necessary... at least the way it is written ("9b" may mean "9 bits" or "9 
binary". I'd avoid getting into arithmetic theory, and remove the comment 
completely. Anything else (explaining the math) really distracts from the real 
purpose of what we're doing. (After all, this is C, not a class on 
arithmetic--they who can, will figure it out.)

Perhaps something like:

/* Warn if we get past 90% of the threshold.
 */



+   if (threshold > 0 && ((control->ras_num_recs * 10) >=

Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Alex Deucher

Thanks.  I think this patch set fixes it in a bit more future proof way:
https://patchwork.freedesktop.org/series/96132/

Alex

On Thu, Oct 21, 2021 at 9:34 AM Chen, Guchun  wrote:
>
> Additionally, in sienna_cichlid_dpm_set_vcn_enable, we also use num_vcn_inst 
> to set dpm for VCN1 if it's > 1.
> The main problem here is VCN harvest info is not set correctly, so 
> vcn.harvest_config is not reliable in this case.
>
> if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) {
> ret = smu_cmn_send_smc_msg_with_param(smu, 
> SMU_MSG_PowerUpVcn, 0, NULL);
> if (ret)
> return ret;
> if (adev->vcn.num_vcn_inst > 1) {
> ret = smu_cmn_send_smc_msg_with_param(smu, 
> SMU_MSG_PowerUpVcn,
>   0x1, 
> NULL);
> if (ret)
> return ret;
> }
> }
>
> Regards,
> Guchun
>
> -Original Message-
> From: Chen, Guchun
> Sent: Thursday, October 21, 2021 9:14 PM
> To: Alex Deucher 
> Cc: amd-gfx list ; Koenig, Christian 
> ; Pan, Xinhui ; Deucher, 
> Alexander ; Liu, Leo 
> Subject: RE: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
> NAVY_FLOUNDER
>
> Hi Alex,
>
> No, it does not help.
>
> adev->vcn.harvest_config is 0 after retrieving harvest info from VBIOS. Looks 
> that harvest info in VBIOs does not reflect the case that VCN1 is power gated.
>
> I checked several navy flounders SKUs, the observation is the same, so this 
> is likely a common case. Perhaps we need to check with VBIOS/SMU guys.
>
> Regards,
> Guchun
>
> -Original Message-
> From: Alex Deucher 
> Sent: Thursday, October 21, 2021 9:06 PM
> To: Chen, Guchun 
> Cc: amd-gfx list ; Koenig, Christian 
> ; Pan, Xinhui ; Deucher, 
> Alexander ; Liu, Leo 
> Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
> NAVY_FLOUNDER
>
> On Thu, Oct 21, 2021 at 3:15 AM Guchun Chen  wrote:
> >
> > VCN instance 1 is power gated permanently by SMU.
> >
> > Bug:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitl
> > ab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1743data=04%7C01%7C
> > guchun.chen%40amd.com%7Cda80a308a28049d543ad08d99493847d%7C3dd8961fe48
> > 84e608e11a82d994e183d%7C0%7C0%7C637704183581593964%7CUnknown%7CTWFpbGZ
> > sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> > D%7C1000sdata=2vNLj9bXE2oV97rxBiUOiaFNpKopVSJefL%2BMcQE%2BSfo%3D&
> > amp;reserved=0
> >
> > Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")
> > Signed-off-by: Guchun Chen 
>
> Doesn't this patch effectively do the same thing?
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F460329%2Fdata=04%7C01%7Cguchun.chen%40amd.com%7Cda80a308a28049d543ad08d99493847d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637704183581593964%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=jPu3Yh%2B6OHR4F1BS5MWL3VyZ3pui6c0dP97Zl7yBJKY%3Dreserved=0
> Where else is num_vcn_inst used that it causes a problem?  Or is the VCN 
> harvesting not set correctly on some navy flounders?
>
> Alex
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> > b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> > index dbfd92984655..4848922667f2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> > @@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
> > adev->vcn.num_enc_rings = 0;
> > else
> > adev->vcn.num_enc_rings = 2;
> > +
> > +   /*
> > +* Fix ME.
> > +* VCN instance number is limited to 1 for below ASIC due to
> > +* VCN instnace 1 is permanently power gated.
> > +*/
> > +   if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) 
> > &&
> > +   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 
> > 2)))
> > +   adev->vcn.num_vcn_inst = 1;
> > }
> >
> > vcn_v3_0_set_dec_ring_funcs(adev);
> > --
> > 2.17.1
> >

[PATCH 2/2] drm/amdgpu/swsmu: handle VCN harvesting for VCN PG control

2021-10-21 Thread Alex Deucher

Check if VCN instances are harvested when controlling
VCN power gating.

Fixes: 1b592d00b4ac83 ("drm/amdgpu/vcn: remove manual instance setting")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
Signed-off-by: Alex Deucher 
---
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 28 +--
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 15e66e1912de..9326547fe5fb 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -902,32 +902,18 @@ static int sienna_cichlid_set_default_dpm_table(struct 
smu_context *smu)
 static int sienna_cichlid_dpm_set_vcn_enable(struct smu_context *smu, bool 
enable)
 {
struct amdgpu_device *adev = smu->adev;
-   int ret = 0;
+   int i, ret = 0;
 
-   if (enable) {
+   for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
+   if (adev->vcn.harvest_config & (1 << i))
+   continue;
/* vcn dpm on is a prerequisite for vcn power gate messages */
if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) 
{
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerUpVcn, 0, NULL);
+   ret = smu_cmn_send_smc_msg_with_param(smu, enable ?
+ 
SMU_MSG_PowerUpVcn : SMU_MSG_PowerDownVcn,
+ 0x1 * i, 
NULL);
if (ret)
return ret;
-   if (adev->vcn.num_vcn_inst > 1) {
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerUpVcn,
- 0x1, 
NULL);
-   if (ret)
-   return ret;
-   }
-   }
-   } else {
-   if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) 
{
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerDownVcn, 0, NULL);
-   if (ret)
-   return ret;
-   if (adev->vcn.num_vcn_inst > 1) {
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerDownVcn,
- 0x1, 
NULL);
-   if (ret)
-   return ret;
-   }
}
}
 
-- 
2.31.1

[PATCH 1/2] drm/amdgpu: Workaround harvesting info for some navy flounder boards

2021-10-21 Thread Alex Deucher

Some navy flounder boards do not properly mark harvested
VCN instances.  Fix that here.

Fixes: 1b592d00b4ac83 ("drm/amdgpu/vcn: remove manual instance setting")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index dfb92f229748..c2852ec1ade2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -507,6 +507,9 @@ void amdgpu_discovery_harvest_ip(struct amdgpu_device *adev)
break;
}
}
+   /* some IP discovery tables on NF don't have this set correctly */
+   if (adev->asic_type == CHIP_NAVY_FLOUNDER)
+   adev->vcn.harvest_config |= AMDGPU_VCN_HARVEST_VCN1;
if (vcn_harvest_count == adev->vcn.num_vcn_inst) {
adev->harvest_ip_mask |= AMD_HARVEST_IP_VCN_MASK;
adev->harvest_ip_mask |= AMD_HARVEST_IP_JPEG_MASK;
-- 
2.31.1

RE: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

2021-10-21 Thread Russell, Kent

[AMD Official Use Only]



> -Original Message-
> From: Tuikov, Luben 
> Sent: Wednesday, October 20, 2021 6:01 PM
> To: Kuehling, Felix ; Russell, Kent 
> ;
> amd-gfx@lists.freedesktop.org
> Cc: Joshi, Mukul 
> Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
> 
> On 2021-10-20 17:54, Felix Kuehling wrote:
> > On 2021-10-20 12:35 p.m., Kent Russell wrote:
> >> If the bad_page_threshold kernel parameter is set to -2,
> >> continue to post the GPU. Print a warning to dmesg that this action has
> >> been done, and that page retirement will obviously not work for said GPU
> > I'd squash patch 2 and 3. The squashed patch is
> >
> > Acked-by: Felix Kuehling 
> 
> I was just thinking the same thing. Keep the title and text of patch 2 and 
> add the description
> of 3 to 2. With that done:
> 
> Reviewed-by: Luben Tuikov 

Sounds good, thanks. I was on the fence about combining them from when I had 
the separate kernel param, and it was easier to squash it at review time than 
to separate it. I'll still need to work on patch #1 but thanks for the reviews 
here!

 Kent

> 
> Regards,
> Luben
> 
> >
> >
> >> Cc: Luben Tuikov 
> >> Cc: Mukul Joshi 
> >> Signed-off-by: Kent Russell 
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +
> >>   1 file changed, 9 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> index 1ede0f0d6f55..31852330c1db 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >>res = amdgpu_ras_eeprom_correct_header_tag(control,
> >>   
> >> RAS_TABLE_HDR_VAL);
> >>} else {
> >> -  *exceed_err_limit = true;
> >> -  dev_err(adev->dev,
> >> -  "RAS records:%d exceed threshold:%d, "
> >> -  "GPU will not be initialized. Replace this GPU 
> >> or increase the
> threshold",
> >> +  dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> >>control->ras_num_recs, 
> >> ras->bad_page_cnt_threshold);
> >> +  if (amdgpu_bad_page_threshold == -2) {
> >> +  dev_warn(adev->dev, "GPU will be initialized 
> >> due to
> bad_page_threshold = -2.");
> >> +  dev_warn(adev->dev, "Page retirement will not 
> >> work for
> this GPU in this state.");
> >> +  res = 0;
> >> +  } else {
> >> +  *exceed_err_limit = true;
> >> +  dev_err(adev->dev, "GPU will not be 
> >> initialized. Replace this
> GPU or increase the threshold.");
> >> +  }
> >>}
> >>} else {
> >>DRM_INFO("Creating a new EEPROM table");

RE: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

2021-10-21 Thread Russell, Kent

[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, October 21, 2021 1:25 AM
> To: Russell, Kent ; amd-gfx@lists.freedesktop.org
> Cc: Tuikov, Luben ; Joshi, Mukul 
> Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
> 
> 
> 
> On 10/20/2021 10:05 PM, Kent Russell wrote:
> > If the bad_page_threshold kernel parameter is set to -2,
> > continue to post the GPU. Print a warning to dmesg that this action has
> > been done, and that page retirement will obviously not work for said GPU
> >
> > Cc: Luben Tuikov 
> > Cc: Mukul Joshi 
> > Signed-off-by: Kent Russell 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +
> >   1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 1ede0f0d6f55..31852330c1db 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> > res = amdgpu_ras_eeprom_correct_header_tag(control,
> >
> > RAS_TABLE_HDR_VAL);
> > } else {
> > -   *exceed_err_limit = true;
> > -   dev_err(adev->dev,
> > -   "RAS records:%d exceed threshold:%d, "
> > -   "GPU will not be initialized. Replace this GPU 
> > or increase the
> threshold",
> > +   dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> > control->ras_num_recs, 
> > ras->bad_page_cnt_threshold);
> > +   if (amdgpu_bad_page_threshold == -2) {
> > +   dev_warn(adev->dev, "GPU will be initialized 
> > due to
> bad_page_threshold = -2.");
> > +   dev_warn(adev->dev, "Page retirement will not 
> > work for
> this GPU in this state.");
> 
> Now, this looks as good as booting with 0 = disable bad page retirement.
> I thought page retirement will work as long as EEPROM has space, but it
> won't bother about the threshold. If the intent is to ignore bad page
> retirement, then 0 is good enough and -2 is not required.
> 
> Also, when user passes threshold=-2, what is the threshold being
> compared against to say *exceed_err_limit = true?

My thought on having the -2 option is that we'll still enable page retirement, 
we just won't shut the GPU down when it hits the threshold. The bad pages will 
still be retired as we hit them, it will just never disable the GPU. The 
comment about retirement not working is definitely incorrect now (leftover from 
previous local patches), so I'll remove that. In this case, I don't think we'd 
ever exceed the error limit. exceed_err_limit is only really used when we are 
disabling the GPU, so we wouldn't want to set that to true. Otherwise we 
wouldn't be loading the bad pages in gpu_recovery_init, and we'll still return 
0 from gpu_recovery_init.

 Kent
> 
> Thanks,
> Lijo
> 
> > +   res = 0;
> > +   } else {
> > +   *exceed_err_limit = true;
> > +   dev_err(adev->dev, "GPU will not be 
> > initialized. Replace this
> GPU or increase the threshold.");
> > +   }
> > }
> > } else {
> > DRM_INFO("Creating a new EEPROM table");
> >

Re: [PATCH v3 13/13] drm/i915: replace drm_detect_hdmi_monitor() with drm_display_info.is_hdmi

2021-10-21 Thread Ville Syrjälä

On Wed, Oct 20, 2021 at 12:51:21AM +0200, Claudio Suarez wrote:
> drm_get_edid() internally calls to drm_connector_update_edid_property()
> and then drm_add_display_info(), which parses the EDID.
> This happens in the function intel_hdmi_set_edid() and
> intel_sdvo_tmds_sink_detect() (via intel_sdvo_get_edid()).
> 
> Once EDID is parsed, the monitor HDMI support information is available
> through drm_display_info.is_hdmi. Retriving the same information with
> drm_detect_hdmi_monitor() is less efficient. Change to
> drm_display_info.is_hdmi

I meant we need to examine all call chains that can lead to
.detect() to make sure all of them do in fact update the
display_info beforehand.

> 
> This is a TODO task in Documentation/gpu/todo.rst
> 
> Signed-off-by: Claudio Suarez 
> ---
>  drivers/gpu/drm/i915/display/intel_hdmi.c | 2 +-
>  drivers/gpu/drm/i915/display/intel_sdvo.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/display/intel_hdmi.c 
> b/drivers/gpu/drm/i915/display/intel_hdmi.c
> index b04685bb6439..008e5b0ba408 100644
> --- a/drivers/gpu/drm/i915/display/intel_hdmi.c
> +++ b/drivers/gpu/drm/i915/display/intel_hdmi.c
> @@ -2355,7 +2355,7 @@ intel_hdmi_set_edid(struct drm_connector *connector)
>   to_intel_connector(connector)->detect_edid = edid;
>   if (edid && edid->input & DRM_EDID_INPUT_DIGITAL) {
>   intel_hdmi->has_audio = drm_detect_monitor_audio(edid);
> - intel_hdmi->has_hdmi_sink = drm_detect_hdmi_monitor(edid);
> + intel_hdmi->has_hdmi_sink = connector->display_info.is_hdmi;
>  
>   connected = true;
>   }
> diff --git a/drivers/gpu/drm/i915/display/intel_sdvo.c 
> b/drivers/gpu/drm/i915/display/intel_sdvo.c
> index 6cb27599ea03..b4065e4df644 100644
> --- a/drivers/gpu/drm/i915/display/intel_sdvo.c
> +++ b/drivers/gpu/drm/i915/display/intel_sdvo.c
> @@ -2060,8 +2060,9 @@ intel_sdvo_tmds_sink_detect(struct drm_connector 
> *connector)
>   if (edid->input & DRM_EDID_INPUT_DIGITAL) {
>   status = connector_status_connected;
>   if (intel_sdvo_connector->is_hdmi) {
> - intel_sdvo->has_hdmi_monitor = 
> drm_detect_hdmi_monitor(edid);
>   intel_sdvo->has_hdmi_audio = 
> drm_detect_monitor_audio(edid);
> + intel_sdvo->has_hdmi_monitor =
> + 
> connector->display_info.is_hdmi;
>   }
>   } else
>   status = connector_status_disconnected;
> -- 
> 2.33.0
> 
> 

-- 
Ville Syrjälä
Intel

Re: [PATCH 4/4] drm/amdgpu/vcn3.0: remove intermediate variable

2021-10-21 Thread Leo Liu


The series are:

Reviewed-by: Leo Liu 

On 2021-10-19 4:10 p.m., Alex Deucher wrote:

No need to use the id variable, just use the constant
plus instance offset directly.

Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 11 ++-
  1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index 57b62fb04750..da11ceba0698 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -60,11 +60,6 @@ static int amdgpu_ih_clientid_vcns[] = {
SOC15_IH_CLIENTID_VCN1
  };
  
-static int amdgpu_ucode_id_vcns[] = {

-   AMDGPU_UCODE_ID_VCN,
-   AMDGPU_UCODE_ID_VCN1
-};
-
  static int vcn_v3_0_start_sriov(struct amdgpu_device *adev);
  static void vcn_v3_0_set_dec_ring_funcs(struct amdgpu_device *adev);
  static void vcn_v3_0_set_enc_ring_funcs(struct amdgpu_device *adev);
@@ -1278,7 +1273,6 @@ static int vcn_v3_0_start_sriov(struct amdgpu_device 
*adev)
uint32_t param, resp, expected;
uint32_t offset, cache_size;
uint32_t tmp, timeout;
-   uint32_t id;
  
  	struct amdgpu_mm_table *table = >virt.mm_table;

uint32_t *table_loc;
@@ -1322,13 +1316,12 @@ static int vcn_v3_0_start_sriov(struct amdgpu_device 
*adev)
cache_size = AMDGPU_GPU_PAGE_ALIGN(adev->vcn.fw->size + 4);
  
  		if (adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) {

-   id = amdgpu_ucode_id_vcns[i];
MMSCH_V3_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCN, i,
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_LOW),
-   adev->firmware.ucode[id].tmr_mc_addr_lo);
+   adev->firmware.ucode[AMDGPU_UCODE_ID_VCN + 
i].tmr_mc_addr_lo);
MMSCH_V3_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCN, i,
mmUVD_LMI_VCPU_CACHE_64BIT_BAR_HIGH),
-   adev->firmware.ucode[id].tmr_mc_addr_hi);
+   adev->firmware.ucode[AMDGPU_UCODE_ID_VCN + 
i].tmr_mc_addr_hi);
offset = 0;
MMSCH_V3_0_INSERT_DIRECT_WT(SOC15_REG_OFFSET(VCN, i,
mmUVD_VCPU_CACHE_OFFSET0),

RE: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Chen, Guchun

Additionally, in sienna_cichlid_dpm_set_vcn_enable, we also use num_vcn_inst to 
set dpm for VCN1 if it's > 1.
The main problem here is VCN harvest info is not set correctly, so 
vcn.harvest_config is not reliable in this case.

if (smu_cmn_feature_is_enabled(smu, SMU_FEATURE_MM_DPM_PG_BIT)) {
ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerUpVcn, 0, NULL);
if (ret)
return ret;
if (adev->vcn.num_vcn_inst > 1) {
ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_PowerUpVcn,
  0x1, 
NULL);
if (ret)
return ret;
}
}

Regards,
Guchun

-Original Message-
From: Chen, Guchun 
Sent: Thursday, October 21, 2021 9:14 PM
To: Alex Deucher 
Cc: amd-gfx list ; Koenig, Christian 
; Pan, Xinhui ; Deucher, 
Alexander ; Liu, Leo 
Subject: RE: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
NAVY_FLOUNDER

Hi Alex,

No, it does not help.

adev->vcn.harvest_config is 0 after retrieving harvest info from VBIOS. Looks 
that harvest info in VBIOs does not reflect the case that VCN1 is power gated.

I checked several navy flounders SKUs, the observation is the same, so this is 
likely a common case. Perhaps we need to check with VBIOS/SMU guys.

Regards,
Guchun

-Original Message-
From: Alex Deucher  
Sent: Thursday, October 21, 2021 9:06 PM
To: Chen, Guchun 
Cc: amd-gfx list ; Koenig, Christian 
; Pan, Xinhui ; Deucher, 
Alexander ; Liu, Leo 
Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
NAVY_FLOUNDER

On Thu, Oct 21, 2021 at 3:15 AM Guchun Chen  wrote:
>
> VCN instance 1 is power gated permanently by SMU.
>
> Bug: 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitl
> ab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1743data=04%7C01%7C
> guchun.chen%40amd.com%7Cda80a308a28049d543ad08d99493847d%7C3dd8961fe48
> 84e608e11a82d994e183d%7C0%7C0%7C637704183581593964%7CUnknown%7CTWFpbGZ
> sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7C1000sdata=2vNLj9bXE2oV97rxBiUOiaFNpKopVSJefL%2BMcQE%2BSfo%3D&
> amp;reserved=0
>
> Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")
> Signed-off-by: Guchun Chen 

Doesn't this patch effectively do the same thing?
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F460329%2Fdata=04%7C01%7Cguchun.chen%40amd.com%7Cda80a308a28049d543ad08d99493847d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637704183581593964%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=jPu3Yh%2B6OHR4F1BS5MWL3VyZ3pui6c0dP97Zl7yBJKY%3Dreserved=0
Where else is num_vcn_inst used that it causes a problem?  Or is the VCN 
harvesting not set correctly on some navy flounders?

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
> b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> index dbfd92984655..4848922667f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> @@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
> adev->vcn.num_enc_rings = 0;
> else
> adev->vcn.num_enc_rings = 2;
> +
> +   /*
> +* Fix ME.
> +* VCN instance number is limited to 1 for below ASIC due to
> +* VCN instnace 1 is permanently power gated.
> +*/
> +   if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) &&
> +   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 
> 2)))
> +   adev->vcn.num_vcn_inst = 1;
> }
>
> vcn_v3_0_set_dec_ring_funcs(adev);
> --
> 2.17.1
>

RE: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Chen, Guchun

Hi Alex,

No, it does not help.

adev->vcn.harvest_config is 0 after retrieving harvest info from VBIOS. Looks 
that harvest info in VBIOs does not reflect the case that VCN1 is power gated.

I checked several navy flounders SKUs, the observation is the same, so this is 
likely a common case. Perhaps we need to check with VBIOS/SMU guys.

Regards,
Guchun

-Original Message-
From: Alex Deucher  
Sent: Thursday, October 21, 2021 9:06 PM
To: Chen, Guchun 
Cc: amd-gfx list ; Koenig, Christian 
; Pan, Xinhui ; Deucher, 
Alexander ; Liu, Leo 
Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
NAVY_FLOUNDER

On Thu, Oct 21, 2021 at 3:15 AM Guchun Chen  wrote:
>
> VCN instance 1 is power gated permanently by SMU.
>
> Bug: 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitl
> ab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1743data=04%7C01%7C
> guchun.chen%40amd.com%7Cda80a308a28049d543ad08d99493847d%7C3dd8961fe48
> 84e608e11a82d994e183d%7C0%7C0%7C637704183581593964%7CUnknown%7CTWFpbGZ
> sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7C1000sdata=2vNLj9bXE2oV97rxBiUOiaFNpKopVSJefL%2BMcQE%2BSfo%3D&
> amp;reserved=0
>
> Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")
> Signed-off-by: Guchun Chen 

Doesn't this patch effectively do the same thing?
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F460329%2Fdata=04%7C01%7Cguchun.chen%40amd.com%7Cda80a308a28049d543ad08d99493847d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637704183581593964%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=jPu3Yh%2B6OHR4F1BS5MWL3VyZ3pui6c0dP97Zl7yBJKY%3Dreserved=0
Where else is num_vcn_inst used that it causes a problem?  Or is the VCN 
harvesting not set correctly on some navy flounders?

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
> b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> index dbfd92984655..4848922667f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> @@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
> adev->vcn.num_enc_rings = 0;
> else
> adev->vcn.num_enc_rings = 2;
> +
> +   /*
> +* Fix ME.
> +* VCN instance number is limited to 1 for below ASIC due to
> +* VCN instnace 1 is permanently power gated.
> +*/
> +   if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) &&
> +   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 
> 2)))
> +   adev->vcn.num_vcn_inst = 1;
> }
>
> vcn_v3_0_set_dec_ring_funcs(adev);
> --
> 2.17.1
>

Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Alex Deucher

On Thu, Oct 21, 2021 at 3:15 AM Guchun Chen  wrote:
>
> VCN instance 1 is power gated permanently by SMU.
>
> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
>
> Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")
> Signed-off-by: Guchun Chen 

Doesn't this patch effectively do the same thing?
https://patchwork.freedesktop.org/patch/460329/
Where else is num_vcn_inst used that it causes a problem?  Or is the
VCN harvesting not set correctly on some navy flounders?

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
> b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> index dbfd92984655..4848922667f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> @@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
> adev->vcn.num_enc_rings = 0;
> else
> adev->vcn.num_enc_rings = 2;
> +
> +   /*
> +* Fix ME.
> +* VCN instance number is limited to 1 for below ASIC due to
> +* VCN instnace 1 is permanently power gated.
> +*/
> +   if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) &&
> +   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 
> 2)))
> +   adev->vcn.num_vcn_inst = 1;
> }
>
> vcn_v3_0_set_dec_ring_funcs(adev);
> --
> 2.17.1
>

RE: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Chen, Guchun

Re: But the logic applied in this fix tells that anything in IP discovery 
(version table or harvest table) doesn't solve the problem. This is equivalent 
to an ASIC specific logic similar to old ASIC enum checks.

Exactly, this is the challenge.

Regards,
Guchun

-Original Message-
From: Lazar, Lijo  
Sent: Thursday, October 21, 2021 8:56 PM
To: Chen, Guchun ; Koenig, Christian 
; Pan, Xinhui ; Deucher, 
Alexander ; Liu, Leo ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
NAVY_FLOUNDER



On 10/21/2021 6:10 PM, Chen, Guchun wrote:
> Hi Lijo,
> 
> Alex has a following fix "85db7fcb2e53 drm/amdgpu: get VCN harvest 
> information from IP discovery table" to fix that logic.

But the logic applied in this fix tells that anything in IP discovery (version 
table or harvest table) doesn't solve the problem. This is equivalent to an 
ASIC specific logic similar to old ASIC enum checks.

> 
> For other ASCIs like DIMGREY_CAVEFISH and BEIGE_GOBY, its instance num is 1, 
> match with VBIOS discovery table. So there is no need to handle it.
> 

Thanks for the clarification! It looks good to me, will leave it to 
Alex/Leo/James.

Thanks,
Lijo

> Regards,
> Guchun
> 
> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, October 21, 2021 5:45 PM
> To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; 
> Koenig, Christian ; Pan, Xinhui 
> ; Deucher, Alexander ; 
> Liu, Leo 
> Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
> NAVY_FLOUNDER
> 
> 
> 
> On 10/21/2021 12:45 PM, Guchun Chen wrote:
>> VCN instance 1 is power gated permanently by SMU.
>>
>> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
>>
>> Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")
> 
> Nice find. Looking at the fix, the logic is already broken by
> 5e26e52adb46("drm/amdgpu/vcn3.0: convert to IP version checking")
> 
> Any ASIC other than Sienna which has same VCN IP version (3.0.0) may be 
> broken. Any more extra checks?
> 
> Thanks,
> Lijo
> 
>> Signed-off-by: Guchun Chen 
>> ---
>>drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
>>1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>> b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>> index dbfd92984655..4848922667f2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
>> @@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
>>  adev->vcn.num_enc_rings = 0;
>>  else
>>  adev->vcn.num_enc_rings = 2;
>> +
>> +/*
>> + * Fix ME.
>> + * VCN instance number is limited to 1 for below ASIC due to
>> + * VCN instnace 1 is permanently power gated.
>> + */
>> +if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) &&
>> +(adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
>> +adev->vcn.num_vcn_inst = 1;
>>  }
>>
>>  vcn_v3_0_set_dec_ring_funcs(adev);
>>

Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Lazar, Lijo





On 10/21/2021 6:10 PM, Chen, Guchun wrote:

Hi Lijo,

Alex has a following fix "85db7fcb2e53 drm/amdgpu: get VCN harvest information from 
IP discovery table" to fix that logic.


But the logic applied in this fix tells that anything in IP discovery 
(version table or harvest table) doesn't solve the problem. This is 
equivalent to an ASIC specific logic similar to old ASIC enum checks.




For other ASCIs like DIMGREY_CAVEFISH and BEIGE_GOBY, its instance num is 1, 
match with VBIOS discovery table. So there is no need to handle it.



Thanks for the clarification! It looks good to me, will leave it to 
Alex/Leo/James.


Thanks,
Lijo


Regards,
Guchun

-Original Message-
From: Lazar, Lijo 
Sent: Thursday, October 21, 2021 5:45 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Koenig, Christian 
; Pan, Xinhui ; Deucher, Alexander 
; Liu, Leo 
Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
NAVY_FLOUNDER



On 10/21/2021 12:45 PM, Guchun Chen wrote:

VCN instance 1 is power gated permanently by SMU.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743

Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")


Nice find. Looking at the fix, the logic is already broken by
5e26e52adb46("drm/amdgpu/vcn3.0: convert to IP version checking")

Any ASIC other than Sienna which has same VCN IP version (3.0.0) may be broken. 
Any more extra checks?

Thanks,
Lijo


Signed-off-by: Guchun Chen 
---
   drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
   1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index dbfd92984655..4848922667f2 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
adev->vcn.num_enc_rings = 0;
else
adev->vcn.num_enc_rings = 2;
+
+   /*
+* Fix ME.
+* VCN instance number is limited to 1 for below ASIC due to
+* VCN instnace 1 is permanently power gated.
+*/
+   if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) &&
+   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
+   adev->vcn.num_vcn_inst = 1;
}
   
   	vcn_v3_0_set_dec_ring_funcs(adev);

RE: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Chen, Guchun

Hi Lijo,

Alex has a following fix "85db7fcb2e53 drm/amdgpu: get VCN harvest information 
from IP discovery table" to fix that logic.

For other ASCIs like DIMGREY_CAVEFISH and BEIGE_GOBY, its instance num is 1, 
match with VBIOS discovery table. So there is no need to handle it.

Regards,
Guchun

-Original Message-
From: Lazar, Lijo  
Sent: Thursday, October 21, 2021 5:45 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; Pan, Xinhui ; 
Deucher, Alexander ; Liu, Leo 
Subject: Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for 
NAVY_FLOUNDER



On 10/21/2021 12:45 PM, Guchun Chen wrote:
> VCN instance 1 is power gated permanently by SMU.
> 
> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743
> 
> Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")

Nice find. Looking at the fix, the logic is already broken by
5e26e52adb46("drm/amdgpu/vcn3.0: convert to IP version checking")

Any ASIC other than Sienna which has same VCN IP version (3.0.0) may be broken. 
Any more extra checks?

Thanks,
Lijo

> Signed-off-by: Guchun Chen 
> ---
>   drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
>   1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
> b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> index dbfd92984655..4848922667f2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
> @@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
>   adev->vcn.num_enc_rings = 0;
>   else
>   adev->vcn.num_enc_rings = 2;
> +
> + /*
> +  * Fix ME.
> +  * VCN instance number is limited to 1 for below ASIC due to
> +  * VCN instnace 1 is permanently power gated.
> +  */
> + if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) &&
> + (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
> + adev->vcn.num_vcn_inst = 1;
>   }
>   
>   vcn_v3_0_set_dec_ring_funcs(adev);
>

RE: [PATCH v2] drm/amd/amdgpu: add dummy_page_addr to sriov msg

2021-10-21 Thread Chen, Horace

[AMD Official Use Only]

Reviewed-by: Horace Chen 

-Original Message-
From: Chen, JingWen 
Sent: Thursday, October 21, 2021 6:12 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Monk ; Chen, Horace ; Chen, 
JingWen 
Subject: [PATCH v2] drm/amd/amdgpu: add dummy_page_addr to sriov msg

Add dummy_page_addr to sriov msg for host driver to set
GCVM_L2_PROTECTION_DEFAULT_ADDR* registers correctly.

v2:
should update vf2pf msg instead
Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c| 1 +
 drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 88c4177b708a..99c149397aae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -584,6 +584,7 @@ static int amdgpu_virt_write_vf2pf_data(struct 
amdgpu_device *adev)
vf2pf_info->encode_usage = 0;
vf2pf_info->decode_usage = 0;

+   vf2pf_info->dummy_page_addr = (uint64_t)adev->dummy_page_addr;
vf2pf_info->checksum =
amd_sriov_msg_checksum(
vf2pf_info, vf2pf_info->header.size, 0, 0); diff --git 
a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h 
b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
index 995899191288..7326b6c1b71c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
@@ -261,9 +261,10 @@ struct amd_sriov_msg_vf2pf_info {
uint8_t  id;
uint32_t version;
} ucode_info[AMD_SRIOV_MSG_RESERVE_UCODE];
+   uint64_t dummy_page_addr;

/* reserved */
-   uint32_t reserved[256-68];
+   uint32_t reserved[256-70];
 };

 /* mailbox message send from guest to host  */
--
2.30.2

[PATCH v2] drm/amd/amdgpu: add dummy_page_addr to sriov msg

2021-10-21 Thread Jingwen Chen

Add dummy_page_addr to sriov msg for host driver to set
GCVM_L2_PROTECTION_DEFAULT_ADDR* registers correctly.

v2:
should update vf2pf msg instead
Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c| 1 +
 drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 88c4177b708a..99c149397aae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -584,6 +584,7 @@ static int amdgpu_virt_write_vf2pf_data(struct 
amdgpu_device *adev)
vf2pf_info->encode_usage = 0;
vf2pf_info->decode_usage = 0;
 
+   vf2pf_info->dummy_page_addr = (uint64_t)adev->dummy_page_addr;
vf2pf_info->checksum =
amd_sriov_msg_checksum(
vf2pf_info, vf2pf_info->header.size, 0, 0);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h 
b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
index 995899191288..7326b6c1b71c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
@@ -261,9 +261,10 @@ struct amd_sriov_msg_vf2pf_info {
uint8_t  id;
uint32_t version;
} ucode_info[AMD_SRIOV_MSG_RESERVE_UCODE];
+   uint64_t dummy_page_addr;
 
/* reserved */
-   uint32_t reserved[256-68];
+   uint32_t reserved[256-70];
 };
 
 /* mailbox message send from guest to host  */
-- 
2.30.2

Re: [RFC PATCH 0/4] drm/dp: Use DP2.0 DPCD 248h updated register/field names for DP PHY CTS

2021-10-21 Thread Jani Nikula

On Wed, 20 Oct 2021, Khaled Almahallawy  wrote:
> This series updates DPCD 248h register name and PHY test patterns names to 
> follow DP 2.0 Specs.
> Also updates the DP PHY CTS codes of the affected drivers (i915, amd, msm)
> No functional changes expected.
>  
> Reference: “DPCD 248h/10Bh/10Ch/10Dh/10Eh Name/Description Consistency”
> https://groups.vesa.org/wg/AllMem/documentComment/2738

You can't do renames like this piece by piece. Every commit must build.

Incidentally, this is one of the reasons we often don't bother with
renames to follow spec changes, but rather stick to the original names.

However, in this case you could switch all drivers to the different test
pattern macros piece by piece, as they're already there.


BR,
Jani.


>
> Khaled Almahallawy (4):
>   drm/dp: Rename DPCD 248h according to DP 2.0 specs
>   drm/i915/dp: Use DP 2.0 LINK_QUAL_PATTERN_* Phy test pattern
> definitions
>   drm/amd/dc: Use DPCD 248h DP 2.0 new name
>   drm/msm/dp: Use DPCD 248h DP 2.0 new names/definitions
>
>  drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c |  2 +-
>  drivers/gpu/drm/drm_dp_helper.c  |  6 +++---
>  drivers/gpu/drm/i915/display/intel_dp.c  | 12 ++--
>  drivers/gpu/drm/msm/dp/dp_catalog.c  | 12 ++--
>  drivers/gpu/drm/msm/dp/dp_ctrl.c | 12 ++--
>  drivers/gpu/drm/msm/dp/dp_link.c | 16 
>  include/drm/drm_dp_helper.h  | 13 +++--
>  7 files changed, 33 insertions(+), 40 deletions(-)

-- 
Jani Nikula, Intel Open Source Graphics Center

Re: [Intel-gfx] [RFC PATCH 1/4] drm/dp: Rename DPCD 248h according to DP 2.0 specs

2021-10-21 Thread Jani Nikula

On Wed, 20 Oct 2021, Khaled Almahallawy  wrote:
> DPCD 248h name was changed from “PHY_TEST_PATTERN” in DP 1.4 to 
> “LINK_QUAL_PATTERN_SELECT” in DP 2.0.

Please use ASCII double quotes ". Please reflow the commit message to
limit line lenghts to about 72 characters.

> Also, DPCD 248h [6:0] is the same as DPCDs 10Bh/10Ch/10Dh/10Eh [6:0]. So 
> removed the repeated definition of PHY patterns.
>
> Reference: “DPCD 248h/10Bh/10Ch/10Dh/10Eh Name/Description Consistency”
> https://groups.vesa.org/wg/AllMem/documentComment/2738
>
> Signed-off-by: Khaled Almahallawy 
> ---
>  drivers/gpu/drm/drm_dp_helper.c |  6 +++---
>  include/drm/drm_dp_helper.h | 13 +++--
>  2 files changed, 6 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_dp_helper.c b/drivers/gpu/drm/drm_dp_helper.c
> index ada0a1ff262d..c9c928c08026 100644
> --- a/drivers/gpu/drm/drm_dp_helper.c
> +++ b/drivers/gpu/drm/drm_dp_helper.c
> @@ -2489,19 +2489,19 @@ int drm_dp_get_phy_test_pattern(struct drm_dp_aux 
> *aux,
>   if (lanes & DP_ENHANCED_FRAME_CAP)
>   data->enhanced_frame_cap = true;
>  
> - err = drm_dp_dpcd_readb(aux, DP_PHY_TEST_PATTERN, >phy_pattern);
> + err = drm_dp_dpcd_readb(aux, DP_LINK_QUAL_PATTERN_SELECT, 
> >phy_pattern);
>   if (err < 0)
>   return err;
>  
>   switch (data->phy_pattern) {
> - case DP_PHY_TEST_PATTERN_80BIT_CUSTOM:
> + case DP_LINK_QUAL_PATTERN_80BIT_CUSTOM:
>   err = drm_dp_dpcd_read(aux, DP_TEST_80BIT_CUSTOM_PATTERN_7_0,
>  >custom80, sizeof(data->custom80));
>   if (err < 0)
>   return err;
>  
>   break;
> - case DP_PHY_TEST_PATTERN_CP2520:
> + case DP_LINK_QUAL_PATTERN_CP2520_PAT_1:
>   err = drm_dp_dpcd_read(aux, DP_TEST_HBR2_SCRAMBLER_RESET,
>  >hbr2_reset,
>  sizeof(data->hbr2_reset));
> diff --git a/include/drm/drm_dp_helper.h b/include/drm/drm_dp_helper.h
> index afdf7f4183f9..ef915bb75bb4 100644
> --- a/include/drm/drm_dp_helper.h
> +++ b/include/drm/drm_dp_helper.h
> @@ -862,16 +862,9 @@ struct drm_panel;
>  # define DP_TEST_CRC_SUPPORTED   (1 << 5)
>  # define DP_TEST_COUNT_MASK  0xf
>  
> -#define DP_PHY_TEST_PATTERN 0x248
> -# define DP_PHY_TEST_PATTERN_SEL_MASK   0x7
> -# define DP_PHY_TEST_PATTERN_NONE   0x0
> -# define DP_PHY_TEST_PATTERN_D10_2  0x1
> -# define DP_PHY_TEST_PATTERN_ERROR_COUNT0x2
> -# define DP_PHY_TEST_PATTERN_PRBS7  0x3
> -# define DP_PHY_TEST_PATTERN_80BIT_CUSTOM   0x4
> -# define DP_PHY_TEST_PATTERN_CP2520 0x5
> -
> -#define DP_PHY_SQUARE_PATTERN0x249
> +#define DP_LINK_QUAL_PATTERN_SELECT 0x248

Please add a comment here referencing where the values are. There are
examples in the file.

> +
> +#define DP_PHY_SQUARE_PATTERN   0x249
>  
>  #define DP_TEST_HBR2_SCRAMBLER_RESET0x24A
>  #define DP_TEST_80BIT_CUSTOM_PATTERN_7_00x250

-- 
Jani Nikula, Intel Open Source Graphics Center

Re: [PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Lazar, Lijo





On 10/21/2021 12:45 PM, Guchun Chen wrote:

VCN instance 1 is power gated permanently by SMU.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743

Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")


Nice find. Looking at the fix, the logic is already broken by
5e26e52adb46("drm/amdgpu/vcn3.0: convert to IP version checking")

Any ASIC other than Sienna which has same VCN IP version (3.0.0) may be 
broken. Any more extra checks?


Thanks,
Lijo


Signed-off-by: Guchun Chen 
---
  drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
  1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index dbfd92984655..4848922667f2 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
adev->vcn.num_enc_rings = 0;
else
adev->vcn.num_enc_rings = 2;
+
+   /*
+* Fix ME.
+* VCN instance number is limited to 1 for below ASIC due to
+* VCN instnace 1 is permanently power gated.
+*/
+   if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) &&
+   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
+   adev->vcn.num_vcn_inst = 1;
}
  
  	vcn_v3_0_set_dec_ring_funcs(adev);

Re: [PATCH] drm/amd/amdgpu: add dummy_page_addr to sriov msg

2021-10-21 Thread Christian König


Am 21.10.21 um 10:50 schrieb Jingwen Chen:

Add dummy_page_addr to sriov msg for host driver to set
GCVM_L2_PROTECTION_DEFAULT_ADDR* registers correctly.

Signed-off-by: Jingwen Chen 


Acked-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c| 1 +
  drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 4 +++-
  2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 88c4177b708a..99c149397aae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -584,6 +584,7 @@ static int amdgpu_virt_write_vf2pf_data(struct 
amdgpu_device *adev)
vf2pf_info->encode_usage = 0;
vf2pf_info->decode_usage = 0;
  
+	vf2pf_info->dummy_page_addr = (uint64_t)adev->dummy_page_addr;

vf2pf_info->checksum =
amd_sriov_msg_checksum(
vf2pf_info, vf2pf_info->header.size, 0, 0);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h 
b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
index 995899191288..5e3d8ecfa968 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
@@ -206,8 +206,10 @@ struct amd_sriov_msg_pf2vf_info {
struct amd_sriov_msg_uuid_info uuid_info;
/* pcie atomic Ops info */
uint32_t pcie_atomic_ops_enabled_flags;
+   /* dummy page addr */
+   uint64_t dummy_page_addr;
/* reserved */
-   uint32_t reserved[256 - 48];
+   uint32_t reserved[256 - 50];
  };
  
  struct amd_sriov_msg_vf2pf_info_header {

[PATCH] drm/amd/amdgpu: add dummy_page_addr to sriov msg

2021-10-21 Thread Jingwen Chen

Add dummy_page_addr to sriov msg for host driver to set
GCVM_L2_PROTECTION_DEFAULT_ADDR* registers correctly.

Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c| 1 +
 drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 88c4177b708a..99c149397aae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -584,6 +584,7 @@ static int amdgpu_virt_write_vf2pf_data(struct 
amdgpu_device *adev)
vf2pf_info->encode_usage = 0;
vf2pf_info->decode_usage = 0;
 
+   vf2pf_info->dummy_page_addr = (uint64_t)adev->dummy_page_addr;
vf2pf_info->checksum =
amd_sriov_msg_checksum(
vf2pf_info, vf2pf_info->header.size, 0, 0);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h 
b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
index 995899191288..5e3d8ecfa968 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgv_sriovmsg.h
@@ -206,8 +206,10 @@ struct amd_sriov_msg_pf2vf_info {
struct amd_sriov_msg_uuid_info uuid_info;
/* pcie atomic Ops info */
uint32_t pcie_atomic_ops_enabled_flags;
+   /* dummy page addr */
+   uint64_t dummy_page_addr;
/* reserved */
-   uint32_t reserved[256 - 48];
+   uint32_t reserved[256 - 50];
 };
 
 struct amd_sriov_msg_vf2pf_info_header {
-- 
2.30.2

[RFC PATCH 3/4] drm/amd/dc: Use DPCD 248h DP 2.0 new name

2021-10-21 Thread Khaled Almahallawy

Use the new definition of DPCD 248h (DP_LINK_QUAL_PATTERN_SELECT)
No functional changes.

Cc: Harry Wentland 
Cc: Alex Deucher 
Signed-off-by: Khaled Almahallawy 
---
 drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
index 54662d74c65a..d34187bb42dd 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
@@ -3604,7 +3604,7 @@ static void dp_test_send_phy_test_pattern(struct dc_link 
*link)
/* get phy test pattern and pattern parameters from DP receiver */
core_link_read_dpcd(
link,
-   DP_PHY_TEST_PATTERN,
+   DP_LINK_QUAL_PATTERN_SELECT,
_test_pattern.raw,
sizeof(dpcd_test_pattern));
core_link_read_dpcd(
-- 
2.25.1

[RFC PATCH 1/4] drm/dp: Rename DPCD 248h according to DP 2.0 specs

2021-10-21 Thread Khaled Almahallawy

DPCD 248h name was changed from “PHY_TEST_PATTERN” in DP 1.4 to 
“LINK_QUAL_PATTERN_SELECT” in DP 2.0.

Also, DPCD 248h [6:0] is the same as DPCDs 10Bh/10Ch/10Dh/10Eh [6:0]. So 
removed the repeated definition of PHY patterns.

Reference: “DPCD 248h/10Bh/10Ch/10Dh/10Eh Name/Description Consistency”
https://groups.vesa.org/wg/AllMem/documentComment/2738

Signed-off-by: Khaled Almahallawy 
---
 drivers/gpu/drm/drm_dp_helper.c |  6 +++---
 include/drm/drm_dp_helper.h | 13 +++--
 2 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/drm_dp_helper.c b/drivers/gpu/drm/drm_dp_helper.c
index ada0a1ff262d..c9c928c08026 100644
--- a/drivers/gpu/drm/drm_dp_helper.c
+++ b/drivers/gpu/drm/drm_dp_helper.c
@@ -2489,19 +2489,19 @@ int drm_dp_get_phy_test_pattern(struct drm_dp_aux *aux,
if (lanes & DP_ENHANCED_FRAME_CAP)
data->enhanced_frame_cap = true;
 
-   err = drm_dp_dpcd_readb(aux, DP_PHY_TEST_PATTERN, >phy_pattern);
+   err = drm_dp_dpcd_readb(aux, DP_LINK_QUAL_PATTERN_SELECT, 
>phy_pattern);
if (err < 0)
return err;
 
switch (data->phy_pattern) {
-   case DP_PHY_TEST_PATTERN_80BIT_CUSTOM:
+   case DP_LINK_QUAL_PATTERN_80BIT_CUSTOM:
err = drm_dp_dpcd_read(aux, DP_TEST_80BIT_CUSTOM_PATTERN_7_0,
   >custom80, sizeof(data->custom80));
if (err < 0)
return err;
 
break;
-   case DP_PHY_TEST_PATTERN_CP2520:
+   case DP_LINK_QUAL_PATTERN_CP2520_PAT_1:
err = drm_dp_dpcd_read(aux, DP_TEST_HBR2_SCRAMBLER_RESET,
   >hbr2_reset,
   sizeof(data->hbr2_reset));
diff --git a/include/drm/drm_dp_helper.h b/include/drm/drm_dp_helper.h
index afdf7f4183f9..ef915bb75bb4 100644
--- a/include/drm/drm_dp_helper.h
+++ b/include/drm/drm_dp_helper.h
@@ -862,16 +862,9 @@ struct drm_panel;
 # define DP_TEST_CRC_SUPPORTED (1 << 5)
 # define DP_TEST_COUNT_MASK0xf
 
-#define DP_PHY_TEST_PATTERN 0x248
-# define DP_PHY_TEST_PATTERN_SEL_MASK   0x7
-# define DP_PHY_TEST_PATTERN_NONE   0x0
-# define DP_PHY_TEST_PATTERN_D10_2  0x1
-# define DP_PHY_TEST_PATTERN_ERROR_COUNT0x2
-# define DP_PHY_TEST_PATTERN_PRBS7  0x3
-# define DP_PHY_TEST_PATTERN_80BIT_CUSTOM   0x4
-# define DP_PHY_TEST_PATTERN_CP2520 0x5
-
-#define DP_PHY_SQUARE_PATTERN  0x249
+#define DP_LINK_QUAL_PATTERN_SELECT 0x248
+
+#define DP_PHY_SQUARE_PATTERN   0x249
 
 #define DP_TEST_HBR2_SCRAMBLER_RESET0x24A
 #define DP_TEST_80BIT_CUSTOM_PATTERN_7_00x250
-- 
2.25.1

[RFC PATCH 2/4] drm/i915/dp: Use DP 2.0 LINK_QUAL_PATTERN_* Phy test pattern definitions

2021-10-21 Thread Khaled Almahallawy

Update selected phy test pattern names to use the new names/definitions of DPCD 
248h in DP2.0/drm_dp_helpers.h
No functional changes

Cc: Manasi Navare 
CC: Jani Nikula 
Cc: Imre Deak 
Signed-off-by: Khaled Almahallawy 
---
 drivers/gpu/drm/i915/display/intel_dp.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_dp.c 
b/drivers/gpu/drm/i915/display/intel_dp.c
index f5dc2126d140..931e8083e54a 100644
--- a/drivers/gpu/drm/i915/display/intel_dp.c
+++ b/drivers/gpu/drm/i915/display/intel_dp.c
@@ -3367,27 +3367,27 @@ static void intel_dp_phy_pattern_update(struct intel_dp 
*intel_dp,
u32 pattern_val;
 
switch (data->phy_pattern) {
-   case DP_PHY_TEST_PATTERN_NONE:
+   case DP_LINK_QUAL_PATTERN_DISABLE:
DRM_DEBUG_KMS("Disable Phy Test Pattern\n");
intel_de_write(dev_priv, DDI_DP_COMP_CTL(pipe), 0x0);
break;
-   case DP_PHY_TEST_PATTERN_D10_2:
+   case DP_LINK_QUAL_PATTERN_D10_2:
DRM_DEBUG_KMS("Set D10.2 Phy Test Pattern\n");
intel_de_write(dev_priv, DDI_DP_COMP_CTL(pipe),
   DDI_DP_COMP_CTL_ENABLE | DDI_DP_COMP_CTL_D10_2);
break;
-   case DP_PHY_TEST_PATTERN_ERROR_COUNT:
+   case DP_LINK_QUAL_PATTERN_ERROR_RATE:
DRM_DEBUG_KMS("Set Error Count Phy Test Pattern\n");
intel_de_write(dev_priv, DDI_DP_COMP_CTL(pipe),
   DDI_DP_COMP_CTL_ENABLE |
   DDI_DP_COMP_CTL_SCRAMBLED_0);
break;
-   case DP_PHY_TEST_PATTERN_PRBS7:
+   case DP_LINK_QUAL_PATTERN_PRBS7:
DRM_DEBUG_KMS("Set PRBS7 Phy Test Pattern\n");
intel_de_write(dev_priv, DDI_DP_COMP_CTL(pipe),
   DDI_DP_COMP_CTL_ENABLE | DDI_DP_COMP_CTL_PRBS7);
break;
-   case DP_PHY_TEST_PATTERN_80BIT_CUSTOM:
+   case DP_LINK_QUAL_PATTERN_80BIT_CUSTOM:
/*
 * FIXME: Ideally pattern should come from DPCD 0x250. As
 * current firmware of DPR-100 could not set it, so hardcoding
@@ -3404,7 +3404,7 @@ static void intel_dp_phy_pattern_update(struct intel_dp 
*intel_dp,
   DDI_DP_COMP_CTL_ENABLE |
   DDI_DP_COMP_CTL_CUSTOM80);
break;
-   case DP_PHY_TEST_PATTERN_CP2520:
+   case DP_LINK_QUAL_PATTERN_CP2520_PAT_1:
/*
 * FIXME: Ideally pattern should come from DPCD 0x24A. As
 * current firmware of DPR-100 could not set it, so hardcoding
-- 
2.25.1

[RFC PATCH 0/4] drm/dp: Use DP2.0 DPCD 248h updated register/field names for DP PHY CTS

2021-10-21 Thread Khaled Almahallawy

This series updates DPCD 248h register name and PHY test patterns names to 
follow DP 2.0 Specs.
Also updates the DP PHY CTS codes of the affected drivers (i915, amd, msm)
No functional changes expected.
 
Reference: “DPCD 248h/10Bh/10Ch/10Dh/10Eh Name/Description Consistency”
https://groups.vesa.org/wg/AllMem/documentComment/2738

Khaled Almahallawy (4):
  drm/dp: Rename DPCD 248h according to DP 2.0 specs
  drm/i915/dp: Use DP 2.0 LINK_QUAL_PATTERN_* Phy test pattern
definitions
  drm/amd/dc: Use DPCD 248h DP 2.0 new name
  drm/msm/dp: Use DPCD 248h DP 2.0 new names/definitions

 drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c |  2 +-
 drivers/gpu/drm/drm_dp_helper.c  |  6 +++---
 drivers/gpu/drm/i915/display/intel_dp.c  | 12 ++--
 drivers/gpu/drm/msm/dp/dp_catalog.c  | 12 ++--
 drivers/gpu/drm/msm/dp/dp_ctrl.c | 12 ++--
 drivers/gpu/drm/msm/dp/dp_link.c | 16 
 include/drm/drm_dp_helper.h  | 13 +++--
 7 files changed, 33 insertions(+), 40 deletions(-)

-- 
2.25.1

[RFC PATCH 4/4] drm/msm/dp: Use DPCD 248h DP 2.0 new names/definitions

2021-10-21 Thread Khaled Almahallawy

Use DP 2.0 DPCD 248h new name (LINK_QUAL_PATTERN_SELECT) and rename selected 
phy test patterns to LINK_QUAL_PATTERN_*

Note: TPS4 LT pattern is CP2520 Pattern 3 (refer to DP2.0 spaces Table 3-11, 
DPCD 00248h
LINK_QUAL_PATTERN_SELECT, and DP PHY 1.4 CTS - Appendix A - Compliance EYE 
Pattern(CP2520; Normative))
That is why the change from DP_PHY_TEST_PATTERN_SEL_MASK to 
DP_LINK_QUAL_PATTERN_CP2520_PAT_3
No functional changes

Cc: Chandan Uddaraju 
Cc: Kuogee Hsieh 
Signed-off-by: Khaled Almahallawy 
---
 drivers/gpu/drm/msm/dp/dp_catalog.c | 12 ++--
 drivers/gpu/drm/msm/dp/dp_ctrl.c| 12 ++--
 drivers/gpu/drm/msm/dp/dp_link.c| 16 
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/msm/dp/dp_catalog.c 
b/drivers/gpu/drm/msm/dp/dp_catalog.c
index cc2bb8295329..2076439ac2a2 100644
--- a/drivers/gpu/drm/msm/dp/dp_catalog.c
+++ b/drivers/gpu/drm/msm/dp/dp_catalog.c
@@ -690,11 +690,11 @@ void dp_catalog_ctrl_send_phy_pattern(struct dp_catalog 
*dp_catalog,
 
DRM_DEBUG_DP("pattern: %#x\n", pattern);
switch (pattern) {
-   case DP_PHY_TEST_PATTERN_D10_2:
+   case DP_LINK_QUAL_PATTERN_D10_2:
dp_write_link(catalog, REG_DP_STATE_CTRL,
DP_STATE_CTRL_LINK_TRAINING_PATTERN1);
break;
-   case DP_PHY_TEST_PATTERN_ERROR_COUNT:
+   case DP_LINK_QUAL_PATTERN_ERROR_RATE:
value &= ~(1 << 16);
dp_write_link(catalog, REG_DP_HBR2_COMPLIANCE_SCRAMBLER_RESET,
value);
@@ -706,11 +706,11 @@ void dp_catalog_ctrl_send_phy_pattern(struct dp_catalog 
*dp_catalog,
dp_write_link(catalog, REG_DP_STATE_CTRL,
DP_STATE_CTRL_LINK_SYMBOL_ERR_MEASURE);
break;
-   case DP_PHY_TEST_PATTERN_PRBS7:
+   case DP_LINK_QUAL_PATTERN_PRBS7:
dp_write_link(catalog, REG_DP_STATE_CTRL,
DP_STATE_CTRL_LINK_PRBS7);
break;
-   case DP_PHY_TEST_PATTERN_80BIT_CUSTOM:
+   case DP_LINK_QUAL_PATTERN_80BIT_CUSTOM:
dp_write_link(catalog, REG_DP_STATE_CTRL,
DP_STATE_CTRL_LINK_TEST_CUSTOM_PATTERN);
/* 00101010 */
@@ -723,7 +723,7 @@ void dp_catalog_ctrl_send_phy_pattern(struct dp_catalog 
*dp_catalog,
dp_write_link(catalog, REG_DP_TEST_80BIT_CUSTOM_PATTERN_REG2,
0xF83E);
break;
-   case DP_PHY_TEST_PATTERN_CP2520:
+   case DP_LINK_QUAL_PATTERN_CP2520_PAT_1:
value = dp_read_link(catalog, REG_DP_MAINLINK_CTRL);
value &= ~DP_MAINLINK_CTRL_SW_BYPASS_SCRAMBLER;
dp_write_link(catalog, REG_DP_MAINLINK_CTRL, value);
@@ -742,7 +742,7 @@ void dp_catalog_ctrl_send_phy_pattern(struct dp_catalog 
*dp_catalog,
value |= DP_MAINLINK_CTRL_ENABLE;
dp_write_link(catalog, REG_DP_MAINLINK_CTRL, value);
break;
-   case DP_PHY_TEST_PATTERN_SEL_MASK:
+   case DP_LINK_QUAL_PATTERN_CP2520_PAT_3:
dp_write_link(catalog, REG_DP_MAINLINK_CTRL,
DP_MAINLINK_CTRL_ENABLE);
dp_write_link(catalog, REG_DP_STATE_CTRL,
diff --git a/drivers/gpu/drm/msm/dp/dp_ctrl.c b/drivers/gpu/drm/msm/dp/dp_ctrl.c
index 62e75dc8afc6..a97f9dd03a8c 100644
--- a/drivers/gpu/drm/msm/dp/dp_ctrl.c
+++ b/drivers/gpu/drm/msm/dp/dp_ctrl.c
@@ -1553,25 +1553,25 @@ static bool dp_ctrl_send_phy_test_pattern(struct 
dp_ctrl_private *ctrl)
switch (pattern_sent) {
case MR_LINK_TRAINING1:
success = (pattern_requested ==
-   DP_PHY_TEST_PATTERN_D10_2);
+   DP_LINK_QUAL_PATTERN_D10_2);
break;
case MR_LINK_SYMBOL_ERM:
success = ((pattern_requested ==
-   DP_PHY_TEST_PATTERN_ERROR_COUNT) ||
+   DP_LINK_QUAL_PATTERN_ERROR_RATE) ||
(pattern_requested ==
-   DP_PHY_TEST_PATTERN_CP2520));
+   DP_LINK_QUAL_PATTERN_CP2520_PAT_1));
break;
case MR_LINK_PRBS7:
success = (pattern_requested ==
-   DP_PHY_TEST_PATTERN_PRBS7);
+   DP_LINK_QUAL_PATTERN_PRBS7);
break;
case MR_LINK_CUSTOM80:
success = (pattern_requested ==
-   DP_PHY_TEST_PATTERN_80BIT_CUSTOM);
+   DP_LINK_QUAL_PATTERN_80BIT_CUSTOM);
break;
case MR_LINK_TRAINING4:
success = (pattern_requested ==
-   DP_PHY_TEST_PATTERN_SEL_MASK);
+

RE: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()

2021-10-21 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Koenig, Christian 
>Sent: Thursday, October 21, 2021 3:27 PM
>To: Yu, Lang ; Grodzovsky, Andrey
>
>Cc: Deucher, Alexander ; Huang, Ray
>
>Subject: Re: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in
>amdgpu_device_fini_sw()
>
>Is there any reason you are sending that around only internally and not to the
>public mailing list?

Sorry, I missed that. It’s a mistake.

Regards,
Lang

>Christian.
>
>Am 21.10.21 um 09:17 schrieb Lang Yu:
>> amdgpu_fence_driver_sw_fini() should be executed before
>> amdgpu_device_ip_fini(), otherwise fence driver resource won't be
>> properly freed as adev->rings have been tore down.
>>
>> Fixes: 72c8c97b1522 ("drm/amdgpu: Split amdgpu_device_fini into early
>> and late")
>>
>> Signed-off-by: Lang Yu 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 41ce86244144..5654c4790773 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3843,8 +3843,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device
>> *adev)
>>
>>   void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>>   {
>> -amdgpu_device_ip_fini(adev);
>>  amdgpu_fence_driver_sw_fini(adev);
>> +amdgpu_device_ip_fini(adev);
>>  release_firmware(adev->firmware.gpu_info_fw);
>>  adev->firmware.gpu_info_fw = NULL;
>>  adev->accel_working = false;

FW: [PATCH 2/3] drm/amdgpu: use some wrapper functions in amdgpu_device_fini_sw()

2021-10-21 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Yu, Lang 
>Sent: Thursday, October 21, 2021 3:18 PM
>To: Grodzovsky, Andrey 
>Cc: Deucher, Alexander ; Koenig, Christian
>; Huang, Ray ; Yu, Lang
>
>Subject: [PATCH 2/3] drm/amdgpu: use some wrapper functions in
>amdgpu_device_fini_sw()
>
>Add some wrapper functions to make amdgpu_device_fini_sw() more clear.
>
>Fix an error handling in amdgpu_device_parse_gpu_info_fw().
>
>Signed-off-by: Lang Yu 
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu.h| 10 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 32 --
> 2 files changed, 34 insertions(+), 8 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>index d58e37fd01f4..5df194259e15 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>@@ -372,6 +372,11 @@ int amdgpu_device_ip_block_add(struct amdgpu_device
>*adev,
>  */
> bool amdgpu_get_bios(struct amdgpu_device *adev);  bool
>amdgpu_read_bios(struct amdgpu_device *adev);
>+static inline void amdgpu_free_bios(struct amdgpu_device *adev) {
>+  kfree(adev->bios);
>+  adev->bios = NULL;
>+}
>
> /*
>  * Clocks
>@@ -1440,6 +1445,11 @@ void amdgpu_pci_resume(struct pci_dev *pdev);
>
> bool amdgpu_device_cache_pci_state(struct pci_dev *pdev);  bool
>amdgpu_device_load_pci_state(struct pci_dev *pdev);
>+static inline void amdgpu_device_free_pci_state(struct amdgpu_device
>+*adev) {
>+  kfree(adev->pci_state);
>+  adev->pci_state = NULL;
>+}
>
> bool amdgpu_device_skip_hw_access(struct amdgpu_device *adev);
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 5654c4790773..be64861ed19a 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -1871,6 +1871,19 @@ static void
>amdgpu_device_enable_virtual_display(struct amdgpu_device *adev)
>   }
> }
>
>+/**
>+ * amdgpu_device_release_gpu_info_fw - release gpu info firmware
>+ *
>+ * @adev: amdgpu_device pointer
>+ *
>+ *  Wrapper to release gpu info firmware  */ static inline void
>+amdgpu_device_release_gpu_info_fw(struct amdgpu_device *adev) {
>+  release_firmware(adev->firmware.gpu_info_fw);
>+  adev->firmware.gpu_info_fw = NULL;
>+}
>+
> /**
>  * amdgpu_device_parse_gpu_info_fw - parse gpu info firmware
>  *
>@@ -1987,7 +2000,7 @@ static int amdgpu_device_parse_gpu_info_fw(struct
>amdgpu_device *adev)
>   dev_err(adev->dev,
>   "Failed to validate gpu_info firmware \"%s\"\n",
>   fw_name);
>-  goto out;
>+  goto release_fw;
>   }
>
>   hdr = (const struct gpu_info_firmware_header_v1_0 *)adev-
>>firmware.gpu_info_fw->data;
>@@ -2051,8 +2064,12 @@ static int amdgpu_device_parse_gpu_info_fw(struct
>amdgpu_device *adev)
>   dev_err(adev->dev,
>   "Unsupported gpu_info table %d\n", hdr-
>>header.ucode_version);
>   err = -EINVAL;
>-  goto out;
>+  goto release_fw;
>   }
>+
>+  return 0;
>+release_fw:
>+  amdgpu_device_release_gpu_info_fw(adev);
> out:
>   return err;
> }
>@@ -3845,8 +3862,8 @@ void amdgpu_device_fini_sw(struct amdgpu_device
>*adev)  {
>   amdgpu_fence_driver_sw_fini(adev);
>   amdgpu_device_ip_fini(adev);
>-  release_firmware(adev->firmware.gpu_info_fw);
>-  adev->firmware.gpu_info_fw = NULL;
>+  amdgpu_device_release_gpu_info_fw(adev);
>+
>   adev->accel_working = false;
>
>   amdgpu_reset_fini(adev);
>@@ -3858,8 +3875,8 @@ void amdgpu_device_fini_sw(struct amdgpu_device
>*adev)
>   if (amdgpu_emu_mode != 1)
>   amdgpu_atombios_fini(adev);
>
>-  kfree(adev->bios);
>-  adev->bios = NULL;
>+  amdgpu_free_bios(adev);
>+
>   if (amdgpu_device_supports_px(adev_to_drm(adev))) {
>   vga_switcheroo_unregister_client(adev->pdev);
>   vga_switcheroo_fini_domain_pm_ops(adev->dev);
>@@ -3872,8 +3889,7 @@ void amdgpu_device_fini_sw(struct amdgpu_device
>*adev)
>   if (adev->mman.discovery_bin)
>   amdgpu_discovery_fini(adev);
>
>-  kfree(adev->pci_state);
>-
>+  amdgpu_device_free_pci_state(adev);
> }
>
> /**
>--
>2.25.1

FW: [PATCH 3/3] drm/amdgpu: remove unnecessary NULL check in amdgpu_device.c

2021-10-21 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Yu, Lang 
>Sent: Thursday, October 21, 2021 3:18 PM
>To: Grodzovsky, Andrey 
>Cc: Deucher, Alexander ; Koenig, Christian
>; Huang, Ray ; Yu, Lang
>
>Subject: [PATCH 3/3] drm/amdgpu: remove unnecessary NULL check in
>amdgpu_device.c
>
>NULL is safe for these functions.
>
>Signed-off-by: Lang Yu 
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
> 1 file changed, 7 insertions(+), 10 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index be64861ed19a..dd979db93399 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -1091,12 +1091,9 @@ static void amdgpu_device_doorbell_fini(struct
>amdgpu_device *adev)
>  */
> static void amdgpu_device_wb_fini(struct amdgpu_device *adev)  {
>-  if (adev->wb.wb_obj) {
>-  amdgpu_bo_free_kernel(>wb.wb_obj,
>->wb.gpu_addr,
>-(void **)>wb.wb);
>-  adev->wb.wb_obj = NULL;
>-  }
>+  amdgpu_bo_free_kernel(>wb.wb_obj,
>+>wb.gpu_addr,
>+(void **)>wb.wb);
> }
>
> /**
>@@ -3794,8 +3791,8 @@ static void amdgpu_device_unmap_mmio(struct
>amdgpu_device *adev)
>
>   iounmap(adev->rmmio);
>   adev->rmmio = NULL;
>-  if (adev->mman.aper_base_kaddr)
>-  iounmap(adev->mman.aper_base_kaddr);
>+
>+  iounmap(adev->mman.aper_base_kaddr);
>   adev->mman.aper_base_kaddr = NULL;
>
>   /* Memory manager related */
>@@ -3886,8 +3883,8 @@ void amdgpu_device_fini_sw(struct amdgpu_device
>*adev)
>
>   if (IS_ENABLED(CONFIG_PERF_EVENTS))
>   amdgpu_pmu_fini(adev);
>-  if (adev->mman.discovery_bin)
>-  amdgpu_discovery_fini(adev);
>+
>+  amdgpu_discovery_fini(adev);
>
>   amdgpu_device_free_pci_state(adev);
> }
>--
>2.25.1

FW: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()

2021-10-21 Thread Yu, Lang

[AMD Official Use Only]



>-Original Message-
>From: Yu, Lang 
>Sent: Thursday, October 21, 2021 3:18 PM
>To: Grodzovsky, Andrey 
>Cc: Deucher, Alexander ; Koenig, Christian
>; Huang, Ray ; Yu, Lang
>
>Subject: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in
>amdgpu_device_fini_sw()
>
>amdgpu_fence_driver_sw_fini() should be executed before
>amdgpu_device_ip_fini(), otherwise fence driver resource won't be properly 
>freed
>as adev->rings have been tore down.
>
>Fixes: 72c8c97b1522 ("drm/amdgpu: Split amdgpu_device_fini into early and 
>late")
>
>Signed-off-by: Lang Yu 
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 41ce86244144..5654c4790773 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -3843,8 +3843,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device
>*adev)
>
> void amdgpu_device_fini_sw(struct amdgpu_device *adev)  {
>-  amdgpu_device_ip_fini(adev);
>   amdgpu_fence_driver_sw_fini(adev);
>+  amdgpu_device_ip_fini(adev);
>   release_firmware(adev->firmware.gpu_info_fw);
>   adev->firmware.gpu_info_fw = NULL;
>   adev->accel_working = false;
>--
>2.25.1

[PATCH] drm/amdgpu: limit VCN instance number to 1 for NAVY_FLOUNDER

2021-10-21 Thread Guchun Chen

VCN instance 1 is power gated permanently by SMU.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1743

Fixes: f6b6d7d6bc2d("drm/amdgpu/vcn: remove manual instance setting")
Signed-off-by: Guchun Chen 
---
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index dbfd92984655..4848922667f2 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -103,6 +103,15 @@ static int vcn_v3_0_early_init(void *handle)
adev->vcn.num_enc_rings = 0;
else
adev->vcn.num_enc_rings = 2;
+
+   /*
+* Fix ME.
+* VCN instance number is limited to 1 for below ASIC due to
+* VCN instnace 1 is permanently power gated.
+*/
+   if ((adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 0)) &&
+   (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 3, 2)))
+   adev->vcn.num_vcn_inst = 1;
}
 
vcn_v3_0_set_dec_ring_funcs(adev);
-- 
2.17.1

Re: [PATCH 1/1] drm/amdgpu: fix BO leak after successful move test

2021-10-21 Thread Christian König





Am 20.10.21 um 14:55 schrieb Das, Nirmoy:


On 10/20/2021 1:51 PM, Christian König wrote:

Am 20.10.21 um 13:50 schrieb Christian König:



Am 13.10.21 um 17:09 schrieb Nirmoy Das:

GTT BO cleanup code is with in the test for loop and
we would skip cleaning up GTT BO on success.

Reported-by: zhang 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_test.c | 25 


  1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c

index 909d830b513e..5fe7ff680c29 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c
@@ -35,6 +35,7 @@ static void amdgpu_do_test_moves(struct 
amdgpu_device *adev)

  struct amdgpu_bo *vram_obj = NULL;
  struct amdgpu_bo **gtt_obj = NULL;
  struct amdgpu_bo_param bp;
+    struct dma_fence *fence = NULL;
  uint64_t gart_addr, vram_addr;
  unsigned n, size;
  int i, r;
@@ -82,7 +83,6 @@ static void amdgpu_do_test_moves(struct 
amdgpu_device *adev)

  void *gtt_map, *vram_map;
  void **gart_start, **gart_end;
  void **vram_start, **vram_end;
-    struct dma_fence *fence = NULL;
    bp.domain = AMDGPU_GEM_DOMAIN_GTT;
  r = amdgpu_bo_create(adev, , gtt_obj + i);
@@ -212,24 +212,23 @@ static void amdgpu_do_test_moves(struct 
amdgpu_device *adev)
    DRM_INFO("Tested GTT->VRAM and VRAM->GTT copy for GTT 
offset 0x%llx\n",

   gart_addr - adev->gmc.gart_start);
-    continue;
+    }
  +    --i;
  out_lclean_unpin:
-    amdgpu_bo_unpin(gtt_obj[i]);
+    amdgpu_bo_unpin(gtt_obj[i]);
  out_lclean_unres:
-    amdgpu_bo_unreserve(gtt_obj[i]);
+    amdgpu_bo_unreserve(gtt_obj[i]);
  out_lclean_unref:
-    amdgpu_bo_unref(_obj[i]);
+    amdgpu_bo_unref(_obj[i]);
  out_lclean:
-    for (--i; i >= 0; --i) {
-    amdgpu_bo_unpin(gtt_obj[i]);
-    amdgpu_bo_unreserve(gtt_obj[i]);
-    amdgpu_bo_unref(_obj[i]);
-    }
-    if (fence)
-    dma_fence_put(fence);
-    break;
+    for (--i; i >= 0; --i) {


The usual idiom for cleanups like that is "while (i--)..." because 
that also works with an unsigned i.


Apart from that looks good to me.


But I'm not sure that we would want to keep the in kernel tests 
around anyway.


We now have my amdgpu_stress tool to test memory bandwidth and mesa 
has an option for that for a long time as well.



Shall I then remove amdgpu_test.c ?


Please double check if the amdgpu_stress utility gives you the same 
functionality, if yes we should probably remove this test here.


Thanks,
Christian.




Nirmoy




Christian.



Christian.


+    amdgpu_bo_unpin(gtt_obj[i]);
+    amdgpu_bo_unreserve(gtt_obj[i]);
+    amdgpu_bo_unref(_obj[i]);
  }
+    if (fence)
+    dma_fence_put(fence);
    amdgpu_bo_unpin(vram_obj);
  out_unres:

Re: [PATCH 1/1] drm/amdgpu: fix BO leak after successful move test

2021-10-21 Thread Christian König


Am 21.10.21 um 04:07 schrieb zhang:


On 2021/10/20 19:51, Christian König wrote:

Am 20.10.21 um 13:50 schrieb Christian König:



Am 13.10.21 um 17:09 schrieb Nirmoy Das:

GTT BO cleanup code is with in the test for loop and
we would skip cleaning up GTT BO on success.

Reported-by: zhang 
Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_test.c | 25 


  1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c

index 909d830b513e..5fe7ff680c29 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_test.c
@@ -35,6 +35,7 @@ static void amdgpu_do_test_moves(struct 
amdgpu_device *adev)

  struct amdgpu_bo *vram_obj = NULL;
  struct amdgpu_bo **gtt_obj = NULL;
  struct amdgpu_bo_param bp;
+    struct dma_fence *fence = NULL;
  uint64_t gart_addr, vram_addr;
  unsigned n, size;
  int i, r;
@@ -82,7 +83,6 @@ static void amdgpu_do_test_moves(struct 
amdgpu_device *adev)

  void *gtt_map, *vram_map;
  void **gart_start, **gart_end;
  void **vram_start, **vram_end;
-    struct dma_fence *fence = NULL;
    bp.domain = AMDGPU_GEM_DOMAIN_GTT;
  r = amdgpu_bo_create(adev, , gtt_obj + i);
@@ -212,24 +212,23 @@ static void amdgpu_do_test_moves(struct 
amdgpu_device *adev)
    DRM_INFO("Tested GTT->VRAM and VRAM->GTT copy for GTT 
offset 0x%llx\n",

   gart_addr - adev->gmc.gart_start);
-    continue;
+    }
  +    --i;
  out_lclean_unpin:
-    amdgpu_bo_unpin(gtt_obj[i]);
+    amdgpu_bo_unpin(gtt_obj[i]);
  out_lclean_unres:
-    amdgpu_bo_unreserve(gtt_obj[i]);
+    amdgpu_bo_unreserve(gtt_obj[i]);
  out_lclean_unref:
-    amdgpu_bo_unref(_obj[i]);
+    amdgpu_bo_unref(_obj[i]);
  out_lclean:
-    for (--i; i >= 0; --i) {
-    amdgpu_bo_unpin(gtt_obj[i]);
-    amdgpu_bo_unreserve(gtt_obj[i]);
-    amdgpu_bo_unref(_obj[i]);
-    }
-    if (fence)
-    dma_fence_put(fence);
-    break;
+    for (--i; i >= 0; --i) {


The usual idiom for cleanups like that is "while (i--)..." because 
that also works with an unsigned i.


Apart from that looks good to me.


But I'm not sure that we would want to keep the in kernel tests 
around anyway.


We now have my amdgpu_stress tool to test memory bandwidth and mesa 
has an option for that for a long time as well.


Christian.

  I found a  testsuit about "bo eviction Test"  for amdgpu . in 
libdrm  tests.


But I couldn't found  amdgpu_stress tool to test memory bandwid anywhere


That was merged just yesterday. See upstream libdrm.

Christian.





Christian.


+    amdgpu_bo_unpin(gtt_obj[i]);
+    amdgpu_bo_unreserve(gtt_obj[i]);
+    amdgpu_bo_unref(_obj[i]);
  }
+    if (fence)
+    dma_fence_put(fence);
    amdgpu_bo_unpin(vram_obj);
  out_unres:

Re: Lockdep spalt on killing a processes

2021-10-21 Thread Christian König





Am 20.10.21 um 21:32 schrieb Andrey Grodzovsky:

On 2021-10-04 4:14 a.m., Christian König wrote:


The problem is a bit different.

The callback is on the dependent fence, while we need to signal the 
scheduler fence.


Daniel is right that this needs an irq_work struct to handle this 
properly.


Christian.



So we had some discussions with Christian regarding irq_work and 
agreed I should look into doing it but stepping back for a sec -


Why we insist on calling the dma_fence_cb  with fence->lock locked ? 
Is it because of dma_fence_add_callback ?
Because we first test for DMA_FENCE_FLAG_SIGNALED_BIT and only after 
that lock the fence->lock ? If so, can't we
move DMA_FENCE_FLAG_SIGNALED_BIT  check inside the locked section ? 
Because if in theory
we could call the cb with unlocked fence->lock (i.e. this kind of 
iteration 
https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/ttm/ttm_resource.c#L117)

we wouldn't have the lockdep splat. And in general, is it really
the correct approach to call a third party code from a call back with 
locked spinlock ? We don't know what the cb does inside
and I don't see any explicit restrictions in documentation of 
dma_fence_func_t what can and cannot be done there.


Yeah, that's exactly what I meant with using the irq_work directly in 
the fence code.


The problem is dma_fence_signal_locked() which is used by quite a number 
of drivers to signal the fence while holding the lock.


Otherwise we could indeed simplify the fence handling a lot.

Christian.



Andrey




Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky:
From what I see here you supposed to have actual deadlock and not 
only warning, sched_fence->finished is  first signaled from within
hw fence done callback (drm_sched_job_done_cb) but then again from 
within it's own callback (drm_sched_entity_kill_jobs_cb) and so
looks like same fence  object is recursively signaled twice. This 
leads to attempt to lock fence->lock second time while it's already
locked. I don't see a need to call drm_sched_fence_finished from 
within drm_sched_entity_kill_jobs_cb as this callback already 
registered
on sched_fence->finished fence (entity->last_scheduled == 
s_fence->finished) and hence the signaling already took place.


Andrey

On 2021-10-01 6:50 a.m., Christian König wrote:

Hey, Andrey.

while investigating some memory management problems I've got the 
logdep splat below.


Looks like something is wrong with drm_sched_entity_kill_jobs_cb(), 
can you investigate?


Thanks,
Christian.

[11176.741052] 
[11176.741056] WARNING: possible recursive locking detected
[11176.741060] 5.15.0-rc1-00031-g9d546d600800 #171 Not tainted
[11176.741066] 
[11176.741070] swapper/12/0 is trying to acquire lock:
[11176.741074] 9c337ed175a8 (>lock){-.-.}-{3:3}, at: 
dma_fence_signal+0x28/0x80

[11176.741088]
   but task is already holding lock:
[11176.741092] 9c337ed172a8 (>lock){-.-.}-{3:3}, at: 
dma_fence_signal+0x28/0x80

[11176.741100]
   other info that might help us debug this:
[11176.741104]  Possible unsafe locking scenario:

[11176.741108]    CPU0
[11176.741110]    
[11176.741113]   lock(>lock);
[11176.741118]   lock(>lock);
[11176.741122]
    *** DEADLOCK ***

[11176.741125]  May be due to missing lock nesting notation

[11176.741128] 2 locks held by swapper/12/0:
[11176.741133]  #0: 9c339c30f768 
(>fence_drv.lock){-.-.}-{3:3}, at: dma_fence_signal+0x28/0x80
[11176.741142]  #1: 9c337ed172a8 (>lock){-.-.}-{3:3}, 
at: dma_fence_signal+0x28/0x80

[11176.741151]
   stack backtrace:
[11176.741155] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 
5.15.0-rc1-00031-g9d546d600800 #171
[11176.741160] Hardware name: System manufacturer System Product 
Name/PRIME X399-A, BIOS 0808 10/12/2018

[11176.741165] Call Trace:
[11176.741169]  
[11176.741173]  dump_stack_lvl+0x5b/0x74
[11176.741181]  dump_stack+0x10/0x12
[11176.741186]  __lock_acquire.cold+0x208/0x2df
[11176.741197]  lock_acquire+0xc6/0x2d0
[11176.741204]  ? dma_fence_signal+0x28/0x80
[11176.741212]  _raw_spin_lock_irqsave+0x4d/0x70
[11176.741219]  ? dma_fence_signal+0x28/0x80
[11176.741225]  dma_fence_signal+0x28/0x80
[11176.741230]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[11176.741240]  drm_sched_entity_kill_jobs_cb+0x1c/0x50 [gpu_sched]
[11176.741248]  dma_fence_signal_timestamp_locked+0xac/0x1a0
[11176.741254]  dma_fence_signal+0x3b/0x80
[11176.741260]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[11176.741268]  drm_sched_job_done.isra.0+0x7f/0x1a0 [gpu_sched]
[11176.741277]  drm_sched_job_done_cb+0x12/0x20 [gpu_sched]
[11176.741284]  dma_fence_signal_timestamp_locked+0xac/0x1a0
[11176.741290]  dma_fence_signal+0x3b/0x80
[11176.741296]  amdgpu_fence_process+0xd1/0x140 [amdgpu]
[11176.741504]  sdma_v4_0_process_trap_irq+0x8c/0xb0 [amdgpu]
[11176.741731]  amdgpu_irq_dispatch+0xce/0x250 [amdgpu]

87 matches

Mail list logo