Re: [PATCH] drm/amdgpu: invert the logic in amdgpu_device_should_recover_gpu()

2022-01-12 Thread Christian König

Am 11.01.22 um 23:45 schrieb Alex Deucher:

Rather than opting into GPU recovery support, default to on, and
opt out if it's not working on a particular GPU.  This avoids the
need to add new asics to this list since this is a core feature.

Signed-off-by: Alex Deucher 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 44 +-
  1 file changed, 17 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f33e43018616..32ad50b86248 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4453,34 +4453,24 @@ bool amdgpu_device_should_recover_gpu(struct 
amdgpu_device *adev)
  
  	if (amdgpu_gpu_recovery == -1) {

switch (adev->asic_type) {
-   case CHIP_BONAIRE:
-   case CHIP_HAWAII:
-   case CHIP_TOPAZ:
-   case CHIP_TONGA:
-   case CHIP_FIJI:
-   case CHIP_POLARIS10:
-   case CHIP_POLARIS11:
-   case CHIP_POLARIS12:
-   case CHIP_VEGAM:
-   case CHIP_VEGA20:
-   case CHIP_VEGA10:
-   case CHIP_VEGA12:
-   case CHIP_RAVEN:
-   case CHIP_ARCTURUS:
-   case CHIP_RENOIR:
-   case CHIP_NAVI10:
-   case CHIP_NAVI14:
-   case CHIP_NAVI12:
-   case CHIP_SIENNA_CICHLID:
-   case CHIP_NAVY_FLOUNDER:
-   case CHIP_DIMGREY_CAVEFISH:
-   case CHIP_BEIGE_GOBY:
-   case CHIP_VANGOGH:
-   case CHIP_ALDEBARAN:
-   case CHIP_YELLOW_CARP:
-   break;
-   default:
+#ifdef CONFIG_DRM_AMDGPU_SI
+   case CHIP_VERDE:
+   case CHIP_TAHITI:
+   case CHIP_PITCAIRN:
+   case CHIP_OLAND:
+   case CHIP_HAINAN:
+#endif
+#ifdef CONFIG_DRM_AMDGPU_CIK
+   case CHIP_KAVERI:
+   case CHIP_KABINI:
+   case CHIP_MULLINS:
+#endif
+   case CHIP_CARRIZO:
+   case CHIP_STONEY:
+   case CHIP_CYAN_SKILLFISH:
goto disabled;
+   default:
+   break;
}
}
  




RE: [PATCH 1/2] drm/amdgpu: Add a filter condition to restrict the SW ras function to be registered only by asics whose hardware supports the ras function

2022-01-12 Thread Zhou1, Tao
[AMD Official Use Only]



> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, January 12, 2022 3:48 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Clements,
> John ; Chai, Thomas 
> Subject: [PATCH 1/2] drm/amdgpu: Add a filter condition to restrict the SW ras
> function to be registered only by asics whose hardware supports the ras 
> function

[Tao] The subject is too long, I think "add ras supported check for 
register_ras_block" is enough.

> 
> Add a filter condition to restrict the SW ras function to be registered only 
> by
> asics whose hardware supports the ras function.
> 
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index b1bedfd4febc..62be0b4909b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2754,7 +2754,7 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device
> *adev)  int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
>   struct amdgpu_ras_block_object* ras_block_obj)  {
> - if (!adev || !ras_block_obj)
> + if (!adev || !amdgpu_ras_asic_supported(adev) || !ras_block_obj)
>   return -EINVAL;

[Tao] Can we return 0 if !amdgpu_ras_asic_supported(adev)? It's not an error.

> 
>   INIT_LIST_HEAD(&ras_block_obj->node);
> --
> 2.25.1


RE: [PATCH 2/2] drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list

2022-01-12 Thread Zhou1, Tao
[AMD Official Use Only]



> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, January 12, 2022 3:48 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Clements,
> John ; Chai, Thomas 
> Subject: [PATCH 2/2] drm/amdgpu: No longer insert ras blocks into ras_list if 
> it
> already exists in ras_list
> 
> No longer insert ras blocks into ras_list if it already exists in ras_list.
> 
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 62be0b4909b3..e6d3bb4b56e4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2754,9 +2754,17 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device
> *adev)  int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
>   struct amdgpu_ras_block_object* ras_block_obj)  {
> + struct amdgpu_ras_block_object *obj, *tmp;
>   if (!adev || !amdgpu_ras_asic_supported(adev) || !ras_block_obj)
>   return -EINVAL;
> 
> + /* If the ras object had been in ras_list, doesn't add it to ras_list 
> again */
[Tao] How about "If the ras object is in ras_list, don't add it again"

> + list_for_each_entry_safe(obj, tmp, &adev->ras_list, node) {
> + if (obj == ras_block_obj) {
> + return 0;
> + }
> + }

[Tao] The patch is OK for me currently, but I think the root cause is we 
initialize adev->gmc.xgmi.ras in gmc_ras_late_init, the initialization should 
be called only in modprobe stage and we can create a general gmc_early_init for 
it.

> +
>   INIT_LIST_HEAD(&ras_block_obj->node);
>   list_add_tail(&ras_block_obj->node, &adev->ras_list);
> 
> --
> 2.25.1


RE: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into critical regions

2022-01-12 Thread Zhou1, Tao
[AMD Official Use Only]



> -Original Message-
> From: Stanley.Yang 
> Sent: Wednesday, January 12, 2022 9:43 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Clements, John
> ; Zhou1, Tao ; Yang,
> Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into
> critical regions
> 
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 10 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  2 +-
> drivers/gpu/drm/amd/amdgpu/ta_ras_if.h  |  3 ++-
>  3 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index c742d1aacf5a..8e0ea582b9c7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -1309,6 +1309,12 @@ static void psp_ras_ta_check_status(struct
> psp_context *psp)
>   break;
>   case TA_RAS_STATUS__SUCCESS:
>   break;
> + case TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED:
> + if (ras_cmd->cmd_id == TA_RAS_COMMAND__TRIGGER_ERROR)
> {
> + dev_info(psp->adev->dev,
[Tao] Is dev_warn better? But either way is OK for me.

> + "RAS INFO: Inject error to critical
> region is not allowed\n");
> + }
[Tao] The {} can be removed.

> + break;
>   default:
>   dev_warn(psp->adev->dev,
>   "RAS WARNING: ras status = 0x%X\n",
> ras_cmd->ras_status); @@ -1521,7 +1527,9 @@ int
> psp_ras_trigger_error(struct psp_context *psp,
>   if (amdgpu_ras_intr_triggered())
>   return 0;
> 
> - if (ras_cmd->ras_status)
> + if (ras_cmd->ras_status ==
> TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED)
> + return -EACCES;
> + else if (ras_cmd->ras_status)
>   return -EINVAL;
> 
>   return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e674dbed3615..8bdc2e85cb20 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -449,7 +449,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file
> *f,
>   }
> 
>   if (ret)
> - return -EINVAL;
> + return ret;
> 
>   return size;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> index 5093826a43d1..509d8a1945eb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> +++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> @@ -64,7 +64,8 @@ enum ta_ras_status {
>   TA_RAS_STATUS__ERROR_PCS_STATE_ERROR= 0xA016,
>   TA_RAS_STATUS__ERROR_PCS_STATE_HANG = 0xA017,
>   TA_RAS_STATUS__ERROR_PCS_STATE_UNKNOWN  = 0xA018,
> - TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019
> + TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019,
> + TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED  = 0xA01A
>  };
> 
>  enum ta_ras_block {
> --
> 2.17.1


RE: [PATCH 1/2] drm/amdgpu: Add a filter condition to restrict the SW ras function to be registered only by asics whose hardware supports the ras function

2022-01-12 Thread Chai, Thomas



-Original Message-
From: Zhou1, Tao  
Sent: Wednesday, January 12, 2022 4:28 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Clements, John 

Subject: RE: [PATCH 1/2] drm/amdgpu: Add a filter condition to restrict the SW 
ras function to be registered only by asics whose hardware supports the ras 
function

[AMD Official Use Only]



> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, January 12, 2022 3:48 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking 
> ; Zhou1, Tao ; Clements, 
> John ; Chai, Thomas 
> Subject: [PATCH 1/2] drm/amdgpu: Add a filter condition to restrict 
> the SW ras function to be registered only by asics whose hardware 
> supports the ras function

>[Tao] The subject is too long, I think "add ras supported check for 
>register_ras_block" is enough.
[Thomas] Ok.

> 
> Add a filter condition to restrict the SW ras function to be 
> registered only by asics whose hardware supports the ras function.
> 
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index b1bedfd4febc..62be0b4909b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2754,7 +2754,7 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device
> *adev)  int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
>   struct amdgpu_ras_block_object* ras_block_obj)  {
> - if (!adev || !ras_block_obj)
> + if (!adev || !amdgpu_ras_asic_supported(adev) || !ras_block_obj)
>   return -EINVAL;

>[Tao] Can we return 0 if !amdgpu_ras_asic_supported(adev)? It's not an error.
[Thomas] OK.

> 
>   INIT_LIST_HEAD(&ras_block_obj->node);
> --
> 2.25.1


RE: [PATCH 2/2] drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list

2022-01-12 Thread Chai, Thomas



-Original Message-
From: Zhou1, Tao  
Sent: Wednesday, January 12, 2022 4:37 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Clements, John 

Subject: RE: [PATCH 2/2] drm/amdgpu: No longer insert ras blocks into ras_list 
if it already exists in ras_list

[AMD Official Use Only]



> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, January 12, 2022 3:48 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking 
> ; Zhou1, Tao ; Clements, 
> John ; Chai, Thomas 
> Subject: [PATCH 2/2] drm/amdgpu: No longer insert ras blocks into 
> ras_list if it already exists in ras_list
> 
> No longer insert ras blocks into ras_list if it already exists in ras_list.
> 
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 62be0b4909b3..e6d3bb4b56e4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2754,9 +2754,17 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device
> *adev)  int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
>   struct amdgpu_ras_block_object* ras_block_obj)  {
> + struct amdgpu_ras_block_object *obj, *tmp;
>   if (!adev || !amdgpu_ras_asic_supported(adev) || !ras_block_obj)
>   return -EINVAL;
> 
> + /* If the ras object had been in ras_list, doesn't add it to 
> +ras_list again */
>[Tao] How about "If the ras object is in ras_list, don't add it again"

[Thomas] OK

> + list_for_each_entry_safe(obj, tmp, &adev->ras_list, node) {
> + if (obj == ras_block_obj) {
> + return 0;
> + }
> + }

>[Tao] The patch is OK for me currently, but I think the root cause is we 
>initialize adev->gmc.xgmi.ras in gmc_ras_late_init, the initialization should 
>be called only in modprobe stage and we can create a general gmc_early_init 
>for it.

[Thomas] This can create a new task to do it.

> +
>   INIT_LIST_HEAD(&ras_block_obj->node);
>   list_add_tail(&ras_block_obj->node, &adev->ras_list);
> 
> --
> 2.25.1


Re: [PATCH] Revert "i2c: core: support bus regulator controlling in adapter"

2022-01-12 Thread Wolfram Sang
Hi everyone,

On Thu, Jan 06, 2022 at 01:24:52PM +0100, Wolfram Sang wrote:
> This largely reverts commit 5a7b95fb993ec399c8a685552aa6a8fc995c40bd. It
> breaks suspend with AMD GPUs, and we couldn't incrementally fix it. So,
> let's remove the code and go back to the drawing board. We keep the
> header extension to not break drivers already populating the regulator.
> We expect to re-add the code handling it soon.
> 
> Reported-by: "Tareque Md.Hanif" 
> Link: https://lore.kernel.org/r/1295184560.182511.1639075777...@mail.yahoo.com
> Reported-by: Konstantin Kharlamov 
> Link: 
> https://lore.kernel.org/r/7143a7147978f4104171072d9f5225d2ce355ec1.ca...@yandex.ru
> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1850
> Signed-off-by: Wolfram Sang 

So, it has been reverted now. Is someone of the original patch
submitters interested in re-adding it? And would the reporters of the
regression be available for further testing?

Thanks and happy hacking,

   Wolfram



signature.asc
Description: PGP signature


[PATCH V2 1/2] drm/amdgpu: Add ras supported check for register_ras_block

2022-01-12 Thread yipechai
Add ras supported check for register_ras_block.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index b1bedfd4febc..614ae8455c9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2757,6 +2757,9 @@ int amdgpu_ras_register_ras_block(struct amdgpu_device 
*adev,
if (!adev || !ras_block_obj)
return -EINVAL;
 
+   if (!amdgpu_ras_asic_supported(adev))
+   return 0;
+
INIT_LIST_HEAD(&ras_block_obj->node);
list_add_tail(&ras_block_obj->node, &adev->ras_list);
 
-- 
2.25.1



[PATCH V2 2/2] drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list

2022-01-12 Thread yipechai
No longer insert ras blocks into ras_list if it already exists in ras_list.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 614ae8455c9f..d208fde509de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2754,12 +2754,20 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
 int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
struct amdgpu_ras_block_object* ras_block_obj)
 {
+   struct amdgpu_ras_block_object *obj, *tmp;
if (!adev || !ras_block_obj)
return -EINVAL;
 
if (!amdgpu_ras_asic_supported(adev))
return 0;
 
+   /* If the ras object is in ras_list, don't add it again */
+   list_for_each_entry_safe(obj, tmp, &adev->ras_list, node) {
+   if (obj == ras_block_obj) {
+   return 0;
+   }
+   }
+
INIT_LIST_HEAD(&ras_block_obj->node);
list_add_tail(&ras_block_obj->node, &adev->ras_list);
 
-- 
2.25.1



Re: [PATCH 3/3] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

2022-01-12 Thread Sharma, Shashank




On 1/11/2022 12:26 PM, Christian König wrote:

Am 11.01.22 um 08:12 schrieb Somalapuram Amaranath:

AMDGPURESET uevent added to notify userspace,
collect dump_stack and amdgpu_reset_reg_dumps

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/nv.c | 31 +++
  1 file changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
b/drivers/gpu/drm/amd/amdgpu/nv.c

index 2ec1ffb36b1f..41a2c37e825f 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -529,10 +529,41 @@ nv_asic_reset_method(struct amdgpu_device *adev)
  }
  }
+/**
+ * drm_sysfs_reset_event - generate a DRM uevent
+ * @dev: DRM device
+ *
+ * Send a uevent for the DRM device specified by @dev.  Currently we 
only
+ * set AMDGPURESET=1 in the uevent environment, but this could be 
expanded to

+ * deal with other types of events.
+ *
+ * Any new uapi should be using the drm_sysfs_connector_status_event()
+ * for uevents on connector status change.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev)
+{
+    char *event_string = "AMDGPURESET=1";
+    char *envp[2] = { event_string, NULL };
+
+    kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);


That won't work like this.

kobject_uevent_env() needs to allocate memory to send the event to 
userspace and that is not allowed while we do an reset. The Intel guys 
felt into the same trap already.


What we could maybe do is to teach kobject_uevent_env() gfp flags and 
make all allocations from the atomic pool.


Regards,
Christian.


Hi Amar,

I see another problem here,

We are sending the event at the GPU reset, but we are collecting the 
register values only when the corresponding userspace agent calls a 
read() on the respective sysfs entry.


There is a very fair possibility that the register values are reset by 
the HW by then, and we are reading re-programmed values. At least there 
will be a race().


I think we should change this design in such a way:
1. Get into gpu_reset()
2. collect the register values and save this context into a separate 
file/node. Probably sending a trace_event here would be easiest way.

3. Send the drm event to the userspace client
4. The client reads from the trace file, and gets the data.

- Shashank




+}
+
+void amdgpu_reset_dumps(struct amdgpu_device *adev)
+{
+    struct drm_device *ddev = adev_to_drm(adev);
+    /* original raven doesn't have full asic reset */
+    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+    !(adev->apu_flags & AMD_APU_IS_RAVEN2))
+    return;
+    drm_sysfs_reset_event(ddev);
+    dump_stack();
+}
+
  static int nv_asic_reset(struct amdgpu_device *adev)
  {
  int ret = 0;
+    amdgpu_reset_dumps(adev);
  switch (nv_asic_reset_method(adev)) {
  case AMD_RESET_METHOD_PCI:
  dev_info(adev->dev, "PCI reset\n");




Re: [PATCH v3 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-01-12 Thread David Hildenbrand
On 10.01.22 23:31, Alex Sierra wrote:
> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> owned by a device that can be mapped into CPU page tables like
> MEMORY_DEVICE_GENERIC and can also be migrated like
> MEMORY_DEVICE_PRIVATE.
> 
> Christoph, the suggestion to incorporate Ralph Campbell’s refcount
> cleanup patch into our hardware page migration patchset originally came
> from you, but it proved impractical to do things in that order because
> the refcount cleanup introduced a bug with wide ranging structural
> implications. Instead, we amended Ralph’s patch so that it could be
> applied after merging the migration work. As we saw from the recent
> discussion, merging the refcount work is going to take some time and
> cooperation between multiple development groups, while the migration
> work is ready now and is needed now. So we propose to merge this
> patchset first and continue to work with Ralph and others to merge the
> refcount cleanup separately, when it is ready.
> 
> This patch series is mostly self-contained except for a few places where
> it needs to update other subsystems to handle the new memory type.
> System stability and performance are not affected according to our
> ongoing testing, including xfstests.
> 
> How it works: The system BIOS advertises the GPU device memory
> (aka VRAM) as SPM (special purpose memory) in the UEFI system address
> map.
> 
> The amdgpu driver registers the memory with devmap as
> MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
> this hardware page migration capability is the Frontier supercomputer
> project. This functionality is not AMD-specific. We expect other GPU
> vendors to find this functionality useful, and possibly other hardware
> types in the future.
> 
> Our test nodes in the lab are similar to the Frontier configuration,
> with .5 TB of system memory plus 256 GB of device memory split across
> 4 GPUs, all in a single coherent address space. Page migration is
> expected to improve application efficiency significantly. We will
> report empirical results as they become available.

Hi,

might be a dumb question because I'm not too familiar with
MEMORY_DEVICE_COHERENT, but who's in charge of migrating *to* that
memory? Or how does a process ever get a grab on such pages?

And where does migration come into play? I assume migration is only
required to migrate off of that device memory to ordinary system RAM
when required because the device memory has to be freed up, correct?

(a high level description on how this is exploited from users space
would be great)

-- 
Thanks,

David / dhildenb



Re: [PATCH v3 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-01-12 Thread Alistair Popple
I have been looking at this in relation to the migration code and noticed we
have the following in try_to_migrate():

if (is_zone_device_page(page) && !is_device_private_page(page))
return;

Which if I'm understanding correctly means that migration of device coherent
pages will always fail. Given that I do wonder how hmm-tests are passing, but
I assume you must always be hitting this fast path in
migrate_vma_collect_pmd():

/*
 * Optimize for the common case where page is only mapped once
 * in one process. If we can lock the page, then we can safely
 * set up a special migration page table entry now.
 */

Meaning that try_to_migrate() never gets called from migrate_vma_unmap(). So
you will also need some changes to try_to_migrate() and possibly
try_to_migrate_one() to make this reliable.

 - Alistair

On Tuesday, 11 January 2022 9:31:51 AM AEDT Alex Sierra wrote:
> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> owned by a device that can be mapped into CPU page tables like
> MEMORY_DEVICE_GENERIC and can also be migrated like
> MEMORY_DEVICE_PRIVATE.
> 
> Christoph, the suggestion to incorporate Ralph Campbell’s refcount
> cleanup patch into our hardware page migration patchset originally came
> from you, but it proved impractical to do things in that order because
> the refcount cleanup introduced a bug with wide ranging structural
> implications. Instead, we amended Ralph’s patch so that it could be
> applied after merging the migration work. As we saw from the recent
> discussion, merging the refcount work is going to take some time and
> cooperation between multiple development groups, while the migration
> work is ready now and is needed now. So we propose to merge this
> patchset first and continue to work with Ralph and others to merge the
> refcount cleanup separately, when it is ready.
> 
> This patch series is mostly self-contained except for a few places where
> it needs to update other subsystems to handle the new memory type.
> System stability and performance are not affected according to our
> ongoing testing, including xfstests.
> 
> How it works: The system BIOS advertises the GPU device memory
> (aka VRAM) as SPM (special purpose memory) in the UEFI system address
> map.
> 
> The amdgpu driver registers the memory with devmap as
> MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
> this hardware page migration capability is the Frontier supercomputer
> project. This functionality is not AMD-specific. We expect other GPU
> vendors to find this functionality useful, and possibly other hardware
> types in the future.
> 
> Our test nodes in the lab are similar to the Frontier configuration,
> with .5 TB of system memory plus 256 GB of device memory split across
> 4 GPUs, all in a single coherent address space. Page migration is
> expected to improve application efficiency significantly. We will
> report empirical results as they become available.
> 
> We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
> patch set builds on HMM and our SVM memory manager already merged in
> 5.15.
> 
> v2:
> - test_hmm is now able to create private and coherent device mirror
> instances in the same driver probe. This adds more usability to the hmm
> test by not having to remove the kernel module for each device type
> test (private/coherent type). This is done by passing the module
> parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create
> four instances of device_mirror. The first two correspond to private
> device type, the last two to coherent type. Then, they can be easily
> accessed from user space through /dev/hmm_mirror. Usually
> num_device 0 and 1 are for private, and 2 and 3 for coherent types.
> 
> - Coherent device type pages at gup are now migrated back to system
> memory if they have been long term pinned (FOLL_LONGTERM). The reason
> is these pages could eventually interfere with their own device memory
> manager. A new hmm_gup_test has been added to the hmm-test to test this
> functionality. It makes use of the gup_test module to long term pin
> user pages that have been migrate to device memory first.
> 
> - Other patch corrections made by Felix, Alistair and Christoph.
> 
> v3:
> - Based on last v2 feedback we got from Alistair, we've decided to
> remove migration logic for FOLL_LONGTERM coherent device type pages at
> gup for now. Ideally, this should be done through the kernel mm,
> instead of calling the device driver to do it. Currently, there's no
> support for migrating device pages based on pfn, mainly because
> migrate_pages() relies on pages being LRU pages. Alistair mentioned, he
> has started to work on adding this migrate device pages logic. For now,
> we fail on get_user_pages call with FOLL_LONGTERM for DEVICE_COHERENT
> pages.
> 
> - Also, hmm_gup_test has been removed from hmm-test. We plan to

Re: [PATCH] Revert "i2c: core: support bus regulator controlling in adapter"

2022-01-12 Thread Hsin-Yi Wang
On Wed, Jan 12, 2022 at 6:58 PM Hsin-Yi Wang  wrote:
>
> hi Konstantin and Tareque,
>
> Can you help provide logs if we apply
> 5a7b95fb993ec399c8a685552aa6a8fc995c40bd but revert
> 8d35a2596164c1c9d34d4656fd42b445cd1e247f?
>
Another thing might be helpful to test with:

after apply 5a7b95fb993ec399c8a685552aa6a8fc995c40bd
1. delete SET_LATE_SYSTEM_SLEEP_PM_OPS(i2c_suspend_late,
i2c_resume_early) and function i2c_suspend_late() and
i2c_resume_early().
2. delete SET_RUNTIME_PM_OPS(i2c_runtime_suspend, i2c_runtime_resume,
NULL) and function i2c_runtime_suspend() and i2c_runtime_resume().

Does it still fail if we do 1 or 2?

Sorry that we don't have a platform with intel CPU and amd GPU
combination to test with.


> Thanks
>
> On Wed, Jan 12, 2022 at 6:02 PM Tareque Md Hanif
>  wrote:
> >
> >
> > On 1/12/22 15:51, Wolfram Sang wrote:
> > > would the reporters of the
> > > regression be available for further testing?
> > Sure. I am available.


[PATCH] drm/amd/display: Force link_rate as LINK_RATE_RBR2 for 2018 15" Apple Retina panels

2022-01-12 Thread Aditya Garg
From: Aun-Ali Zaidi 
 
The eDP link rate reported by the DP_MAX_LINK_RATE dpcd register (0xa) is
contradictory to the highest rate supported reported by
EDID (0xc = LINK_RATE_RBR2). The effects of this compounded with commit
'4a8ca46bae8a ("drm/amd/display: Default max bpc to 16 for eDP")' results
in no display modes being found and a dark panel.

For now, simply force the maximum supported link rate for the eDP attached
2018 15" Apple Retina panels.

Additionally, we must also check the firmware revision since the device ID
reported by the DPCD is identical to that of the more capable 16,1,
incorrectly quirking it. We also use said firmware check to quirk the
refreshed 15,1 models with Vega graphics as they use a slightly newer
firmware version.

Tested-by: Aun-Ali Zaidi 
Signed-off-by: Aun-Ali Zaidi 
Signed-off-by: Aditya Garg 
---
 .../gpu/drm/amd/display/dc/core/dc_link_dp.c  | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
index 05e216524..17939ad17 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
@@ -5597,6 +5597,25 @@ static bool retrieve_link_cap(struct dc_link *link)
dp_hw_fw_revision.ieee_fw_rev,
sizeof(dp_hw_fw_revision.ieee_fw_rev));
 
+   /* Quirk for Apple MBP 2018 15" Retina panels: wrong DP_MAX_LINK_RATE */
+   {
+   uint8_t str_mbp_2018[] = { 101, 68, 21, 103, 98, 97 };
+   uint8_t fwrev_mbp_2018[] = { 7, 4 };
+   uint8_t fwrev_mbp_2018_vega[] = { 8, 4 };
+
+   // We also check for the firmware revision as 16,1 models have 
an
+   // identical device id and are incorrectly quirked otherwise.
+   if ((link->dpcd_caps.sink_dev_id == 0x0010fa) &&
+   !memcmp(link->dpcd_caps.sink_dev_id_str, str_mbp_2018,
+sizeof(str_mbp_2018)) &&
+   (!memcmp(link->dpcd_caps.sink_fw_revision, fwrev_mbp_2018,
+sizeof(fwrev_mbp_2018)) ||
+   !memcmp(link->dpcd_caps.sink_fw_revision, 
fwrev_mbp_2018_vega,
+sizeof(fwrev_mbp_2018_vega {
+   link->reported_link_cap.link_rate = LINK_RATE_RBR2;
+   }
+   }
+
memset(&link->dpcd_caps.dsc_caps, '\0',
sizeof(link->dpcd_caps.dsc_caps));
memset(&link->dpcd_caps.fec_cap, '\0', sizeof(link->dpcd_caps.fec_cap));
-- 
2.25.1




Re: [PATCH] Revert "i2c: core: support bus regulator controlling in adapter"

2022-01-12 Thread Konstantin Kharlamov
On Wed, 2022-01-12 at 10:32 +0100, Wolfram Sang wrote:
> Hi everyone,
> 
> On Thu, Jan 06, 2022 at 01:24:52PM +0100, Wolfram Sang wrote:
> > This largely reverts commit 5a7b95fb993ec399c8a685552aa6a8fc995c40bd. It
> > breaks suspend with AMD GPUs, and we couldn't incrementally fix it. So,
> > let's remove the code and go back to the drawing board. We keep the
> > header extension to not break drivers already populating the regulator.
> > We expect to re-add the code handling it soon.
> > 
> > Reported-by: "Tareque Md.Hanif" 
> > Link:
> > https://lore.kernel.org/r/1295184560.182511.1639075777...@mail.yahoo.com
> > Reported-by: Konstantin Kharlamov 
> > Link:
> > https://lore.kernel.org/r/7143a7147978f4104171072d9f5225d2ce355ec1.ca...@yandex.ru
> > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1850
> > Signed-off-by: Wolfram Sang 
> 
> So, it has been reverted now. Is someone of the original patch
> submitters interested in re-adding it? And would the reporters of the
> regression be available for further testing?

I am available for further testing.

> Thanks and happy hacking,
> 
>    Wolfram
> 



Re: [PATCH] Revert "i2c: core: support bus regulator controlling in adapter"

2022-01-12 Thread Hsin-Yi Wang
hi Konstantin and Tareque,

Can you help provide logs if we apply
5a7b95fb993ec399c8a685552aa6a8fc995c40bd but revert
8d35a2596164c1c9d34d4656fd42b445cd1e247f?

Thanks

On Wed, Jan 12, 2022 at 6:02 PM Tareque Md Hanif
 wrote:
>
>
> On 1/12/22 15:51, Wolfram Sang wrote:
> > would the reporters of the
> > regression be available for further testing?
> Sure. I am available.


Re: [PATCH V2 1/2] drm/amdgpu: Add ras supported check for register_ras_block

2022-01-12 Thread Lazar, Lijo




On 1/12/2022 4:08 PM, yipechai wrote:

Add ras supported check for register_ras_block.

Signed-off-by: yipechai 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index b1bedfd4febc..614ae8455c9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2757,6 +2757,9 @@ int amdgpu_ras_register_ras_block(struct amdgpu_device 
*adev,
if (!adev || !ras_block_obj)
return -EINVAL;
  
+	if (!amdgpu_ras_asic_supported(adev))

+   return 0;
+


Why to do this check here? This check can be done prior and IP's ras 
block can be set to NULL so that this function itself won't be called.


Thanks,
Lijo


INIT_LIST_HEAD(&ras_block_obj->node);
list_add_tail(&ras_block_obj->node, &adev->ras_list);
  



RE: [PATCH] drm/amdgpu: improve debug VRAM access performance using sdma

2022-01-12 Thread Kim, Jonathan
[Public]

Thanks Christian.  I've already merged based on Felix's review.
I'll send your suggested cleanup for review out soon.

Jon

> -Original Message-
> From: Koenig, Christian 
> Sent: January 12, 2022 2:33 AM
> To: Kim, Jonathan ; amd-
> g...@lists.freedesktop.org
> Cc: Kuehling, Felix 
> Subject: Re: [PATCH] drm/amdgpu: improve debug VRAM access
> performance using sdma
>
> Am 04.01.22 um 20:12 schrieb Jonathan Kim:
> > For better performance during VRAM access for debugged processes, do
> > read/write copies over SDMA.
> >
> > In order to fulfill post mortem debugging on a broken device, fallback
> > to stable MMIO access when gpu recovery is disabled or when job
> > submission time outs are set to max.  Failed SDMA access should
> > automatically fall back to MMIO access.
> >
> > Use a pre-allocated GTT bounce buffer pre-mapped into GART to avoid
> > page-table updates and TLB flushes on access.
> >
> > Signed-off-by: Jonathan Kim 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 78
> +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h |  5 +-
> >   2 files changed, 82 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > index 367abed1d6e6..512df4c09772 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > @@ -48,6 +48,7 @@
> >   #include 
> >
> >   #include 
> > +#include 
> >
> >   #include "amdgpu.h"
> >   #include "amdgpu_object.h"
> > @@ -1429,6 +1430,70 @@ static void
> amdgpu_ttm_vram_mm_access(struct amdgpu_device *adev, loff_t pos,
> > }
> >   }
> >
> > +static int amdgpu_ttm_access_memory_sdma(struct ttm_buffer_object
> *bo,
> > +   unsigned long offset, void *buf, int
> len, int write) {
> > +   struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
> > +   struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
> > +   struct amdgpu_job *job;
> > +   struct dma_fence *fence;
> > +   uint64_t src_addr, dst_addr;
> > +   unsigned int num_dw;
> > +   int r, idx;
> > +
> > +   if (len != PAGE_SIZE)
> > +   return -EINVAL;
> > +
> > +   if (!adev->mman.sdma_access_ptr)
> > +   return -EACCES;
> > +
> > +   r = drm_dev_enter(adev_to_drm(adev), &idx);
> > +   if (r)
> > +   return r;
> > +
> > +   if (write)
> > +   memcpy(adev->mman.sdma_access_ptr, buf, len);
> > +
> > +   num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
> > +   r = amdgpu_job_alloc_with_ib(adev, num_dw * 4,
> AMDGPU_IB_POOL_DELAYED, &job);
> > +   if (r)
> > +   goto out;
> > +
> > +   src_addr = write ? amdgpu_bo_gpu_offset(adev-
> >mman.sdma_access_bo) :
> > +   amdgpu_bo_gpu_offset(abo);
> > +   dst_addr = write ? amdgpu_bo_gpu_offset(abo) :
> > +   amdgpu_bo_gpu_offset(adev-
> >mman.sdma_access_bo);
>
> I suggest to write this as
>
> src_addr = a;
> dst_addr = b;
> if (write)
>  swap(src_addr, dst_addr);
>
> This way we are not duplicating getting the different offsets.
>
> > +   amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr,
> > +PAGE_SIZE, false);
> > +
> > +   amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job-
> >ibs[0]);
> > +   WARN_ON(job->ibs[0].length_dw > num_dw);
> > +
> > +   r = amdgpu_job_submit(job, &adev->mman.entity,
> AMDGPU_FENCE_OWNER_UNDEFINED, &fence);
> > +   if (r) {
> > +   amdgpu_job_free(job);
> > +   goto out;
> > +   }
> > +
> > +   if (!dma_fence_wait_timeout(fence, false, adev->sdma_timeout))
> > +   r = -ETIMEDOUT;
> > +   dma_fence_put(fence);
> > +
> > +   if (!(r || write))
> > +   memcpy(buf, adev->mman.sdma_access_ptr, len);
> > +out:
> > +   drm_dev_exit(idx);
> > +   return r;
> > +}
> > +
> > +static inline bool amdgpu_ttm_allow_post_mortem_debug(struct
> > +amdgpu_device *adev) {
> > +   return amdgpu_gpu_recovery == 0 ||
> > +   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
> > +   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
> > +   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
> > +   adev->video_timeout == MAX_SCHEDULE_TIMEOUT; }
>
> This should probably be inside amdgpu_device.c
>
> > +
> >   /**
> >* amdgpu_ttm_access_memory - Read or Write memory that backs a
> buffer object.
> >*
> > @@ -1453,6 +1518,10 @@ static int amdgpu_ttm_access_memory(struct
> ttm_buffer_object *bo,
> > if (bo->resource->mem_type != TTM_PL_VRAM)
> > return -EIO;
> >
> > +   if (!amdgpu_ttm_allow_post_mortem_debug(adev) &&
> > +   !amdgpu_ttm_access_memory_sdma(bo, offset,
> buf, len, write))
> > +   return len;
> > +
> > amdgpu_res_first(bo->resource, offset, len, &cursor);
> > while (cursor.remaining) {
> > size_t count, size = cursor.size;
> > @@ -1793,6 +1862,12 @@ int amdgpu_ttm_init(struct amdgpu_device
> *adev)
> > return r;
> > }
> >

[PATCH 0/2] Fixing bad merge in OTG synchronization logic

2022-01-12 Thread Harry Wentland
A bad merge of
1abaa75bae9e ("drm/amd/display: Fix for otg synchronization logic")
caused Linus to see a lot of underflow on his two 4k displays.

This set pulls his revert and fixes up the original patch.

Linus Torvalds (1):
  Revert "drm/amd/display: Fix for otg synchronization logic"

Meenakshikumar Somasundaram (1):
  drm/amd/display: Fix for otg synchronization logic

 drivers/gpu/drm/amd/display/dc/core/dc.c  | 15 +++
 drivers/gpu/drm/amd/display/dc/inc/resource.h |  1 +
 2 files changed, 12 insertions(+), 4 deletions(-)

--
2.34.1



[PATCH 1/2] Revert "drm/amd/display: Fix for otg synchronization logic"

2022-01-12 Thread Harry Wentland
From: Linus Torvalds 

This reverts commit a896f870f8a5f23ec961d16baffd3fda1f8be57c.

It causes odd flickering on my Radeon RX580 (PCI ID 1002:67df rev e7,
subsystem ID 1da2:e353).

Bisected right to this commit, and reverting it fixes things.

Link: 
https://lore.kernel.org/all/CAHk-=wg9hDde_L3bK9tAfdJ4N=TJJ+SjO3ZDONqH5=bvoy_...@mail.gmail.com/
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Harry Wentland 
Cc: Dave Airlie 
Cc: Christian Koenig 
Cc: Jun Lei 
Cc: Mustapha Ghaddar 
Cc: Bhawanpreet Lakha 
Cc: meenakshikumar somasundaram 
Cc: Daniel Wheeler 
Signed-off-by: Linus Torvalds 
---
 drivers/gpu/drm/amd/display/dc/core/dc.c  | 35 +---
 .../gpu/drm/amd/display/dc/core/dc_resource.c | 54 ---
 drivers/gpu/drm/amd/display/dc/dc.h   |  1 -
 .../display/dc/dce110/dce110_hw_sequencer.c   |  8 ---
 .../drm/amd/display/dc/dcn31/dcn31_resource.c |  3 --
 .../gpu/drm/amd/display/dc/inc/core_types.h   |  1 -
 drivers/gpu/drm/amd/display/dc/inc/resource.h | 10 
 7 files changed, 14 insertions(+), 98 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 91c4874473d6..01c8849b9db2 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1404,29 +1404,22 @@ static void program_timing_sync(
status->timing_sync_info.master = false;
 
}
+   /* remove any other unblanked pipes as they have already been 
synced */
+   for (j = j + 1; j < group_size; j++) {
+   bool is_blanked;
 
-   /* remove any other pipes that are already been synced */
-   if (dc->config.use_pipe_ctx_sync_logic) {
-   /* check pipe's syncd to decide which pipe to be 
removed */
-   for (j = 1; j < group_size; j++) {
-   if (pipe_set[j]->pipe_idx_syncd == 
pipe_set[0]->pipe_idx_syncd) {
-   group_size--;
-   pipe_set[j] = pipe_set[group_size];
-   j--;
-   } else
-   /* link slave pipe's syncd with master 
pipe */
-   pipe_set[j]->pipe_idx_syncd = 
pipe_set[0]->pipe_idx_syncd;
+   if (pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked)
+   is_blanked =
+   
pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked(pipe_set[j]->stream_res.opp);
+   else
+   is_blanked =
+   
pipe_set[j]->stream_res.tg->funcs->is_blanked(pipe_set[j]->stream_res.tg);
+   if (!is_blanked) {
+   group_size--;
+   pipe_set[j] = pipe_set[group_size];
+   j--;
}
-   } else {
-   /* remove any other pipes by checking valid plane */
-   for (j = j + 1; j < group_size; j++) {
-   if (pipe_set[j]->plane_state) {
-   group_size--;
-   pipe_set[j] = pipe_set[group_size];
-   j--;
-   }
-   }
-   }
+   }
 
if (group_size > 1) {
if (sync_type == TIMING_SYNCHRONIZABLE) {
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
index b3912ff9dc91..d4ff6cc6b8d9 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
@@ -3217,60 +3217,6 @@ struct hpo_dp_link_encoder 
*resource_get_hpo_dp_link_enc_for_det_lt(
 }
 #endif
 
-void reset_syncd_pipes_from_disabled_pipes(struct dc *dc,
-   struct dc_state *context)
-{
-   int i, j;
-   struct pipe_ctx *pipe_ctx_old, *pipe_ctx, *pipe_ctx_syncd;
-
-   /* If pipe backend is reset, need to reset pipe syncd status */
-   for (i = 0; i < dc->res_pool->pipe_count; i++) {
-   pipe_ctx_old =  &dc->current_state->res_ctx.pipe_ctx[i];
-   pipe_ctx = &context->res_ctx.pipe_ctx[i];
-
-   if (!pipe_ctx_old->stream)
-   continue;
-
-   if (pipe_ctx_old->top_pipe || pipe_ctx_old->prev_odm_pipe)
-   continue;
-
-   if (!pipe_ctx->stream ||
-   pipe_need_reprogram(pipe_ctx_old, pipe_ctx)) {
-
-   /* Reset all the syncd pipes from the disabled pipe */
-   for (j = 0; j < dc->res_pool->pipe_count; j++) {
-   pipe_ctx_sy

[PATCH 2/2] drm/amd/display: Fix for otg synchronization logic

2022-01-12 Thread Harry Wentland
From: Meenakshikumar Somasundaram 

[Why]
During otg sync trigger, plane states are used to decide whether the otg
is already synchronized or not. There are scenarions when otgs are
disabled without plane state getting disabled and in such case the otg is
excluded from synchronization.

[How]
Introduced pipe_idx_syncd in pipe_ctx that tracks each otgs master pipe.
When a otg is disabled/enabled, pipe_idx_syncd is reset to itself.
On sync trigger, pipe_idx_syncd is checked to decide whether a otg is
already synchronized and the otg is further included or excluded from
synchronization.

v2:
  Don't drop is_blanked logic

Reviewed-by: Jun Lei 
Reviewed-by: Mustapha Ghaddar 
Acked-by: Bhawanpreet Lakha 
Signed-off-by: meenakshikumar somasundaram 
Tested-by: Daniel Wheeler 
Signed-off-by: Alex Deucher 
Signed-off-by: Harry Wentland 
Cc: torva...@linux-foundation.org
---
 drivers/gpu/drm/amd/display/dc/core/dc.c  | 40 +-
 .../gpu/drm/amd/display/dc/core/dc_resource.c | 54 +++
 drivers/gpu/drm/amd/display/dc/dc.h   |  1 +
 .../display/dc/dce110/dce110_hw_sequencer.c   |  8 +++
 .../drm/amd/display/dc/dcn31/dcn31_resource.c |  3 ++
 .../gpu/drm/amd/display/dc/inc/core_types.h   |  1 +
 drivers/gpu/drm/amd/display/dc/inc/resource.h | 11 
 7 files changed, 105 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 01c8849b9db2..6f5528d34093 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1404,20 +1404,34 @@ static void program_timing_sync(
status->timing_sync_info.master = false;
 
}
-   /* remove any other unblanked pipes as they have already been 
synced */
-   for (j = j + 1; j < group_size; j++) {
-   bool is_blanked;
 
-   if (pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked)
-   is_blanked =
-   
pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked(pipe_set[j]->stream_res.opp);
-   else
-   is_blanked =
-   
pipe_set[j]->stream_res.tg->funcs->is_blanked(pipe_set[j]->stream_res.tg);
-   if (!is_blanked) {
-   group_size--;
-   pipe_set[j] = pipe_set[group_size];
-   j--;
+   /* remove any other pipes that are already been synced */
+   if (dc->config.use_pipe_ctx_sync_logic) {
+   /* check pipe's syncd to decide which pipe to be 
removed */
+   for (j = 1; j < group_size; j++) {
+   if (pipe_set[j]->pipe_idx_syncd == 
pipe_set[0]->pipe_idx_syncd) {
+   group_size--;
+   pipe_set[j] = pipe_set[group_size];
+   j--;
+   } else
+   /* link slave pipe's syncd with master 
pipe */
+   pipe_set[j]->pipe_idx_syncd = 
pipe_set[0]->pipe_idx_syncd;
+   }
+   } else {
+   for (j = j + 1; j < group_size; j++) {
+   bool is_blanked;
+
+   if 
(pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked)
+   is_blanked =
+   
pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked(pipe_set[j]->stream_res.opp);
+   else
+   is_blanked =
+   
pipe_set[j]->stream_res.tg->funcs->is_blanked(pipe_set[j]->stream_res.tg);
+   if (!is_blanked) {
+   group_size--;
+   pipe_set[j] = pipe_set[group_size];
+   j--;
+   }
}
}
 
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
index d4ff6cc6b8d9..b3912ff9dc91 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
@@ -3217,6 +3217,60 @@ struct hpo_dp_link_encoder 
*resource_get_hpo_dp_link_enc_for_det_lt(
 }
 #endif
 
+void reset_syncd_pipes_from_disabled_pipes(struct dc *dc,
+   struct dc_state *context)
+{
+   int i, j;
+   struct pipe_ctx *pipe_ctx_old, *pipe_ctx, *pipe_ctx_syncd;
+
+   /* If pipe backend is reset, need to reset pipe syncd status */
+   for (i = 0; i < dc->res_pool->pipe_count; i++) {
+   pipe_ctx

Re: [PATCH] drm/amdgpu: improve debug VRAM access performance using sdma

2022-01-12 Thread Christian König

Yeah, that's basically my fault.

I haven't even worked myself through all the mails which piled up during 
the xmas break :(


Christian.

Am 12.01.22 um 15:21 schrieb Kim, Jonathan:

[Public]

Thanks Christian.  I've already merged based on Felix's review.
I'll send your suggested cleanup for review out soon.

Jon


-Original Message-
From: Koenig, Christian 
Sent: January 12, 2022 2:33 AM
To: Kim, Jonathan ; amd-
g...@lists.freedesktop.org
Cc: Kuehling, Felix 
Subject: Re: [PATCH] drm/amdgpu: improve debug VRAM access
performance using sdma

Am 04.01.22 um 20:12 schrieb Jonathan Kim:

For better performance during VRAM access for debugged processes, do
read/write copies over SDMA.

In order to fulfill post mortem debugging on a broken device, fallback
to stable MMIO access when gpu recovery is disabled or when job
submission time outs are set to max.  Failed SDMA access should
automatically fall back to MMIO access.

Use a pre-allocated GTT bounce buffer pre-mapped into GART to avoid
page-table updates and TLB flushes on access.

Signed-off-by: Jonathan Kim 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 78

+

   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h |  5 +-
   2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 367abed1d6e6..512df4c09772 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -48,6 +48,7 @@
   #include 

   #include 
+#include 

   #include "amdgpu.h"
   #include "amdgpu_object.h"
@@ -1429,6 +1430,70 @@ static void

amdgpu_ttm_vram_mm_access(struct amdgpu_device *adev, loff_t pos,

 }
   }

+static int amdgpu_ttm_access_memory_sdma(struct ttm_buffer_object

*bo,

+   unsigned long offset, void *buf, int

len, int write) {

+   struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
+   struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
+   struct amdgpu_job *job;
+   struct dma_fence *fence;
+   uint64_t src_addr, dst_addr;
+   unsigned int num_dw;
+   int r, idx;
+
+   if (len != PAGE_SIZE)
+   return -EINVAL;
+
+   if (!adev->mman.sdma_access_ptr)
+   return -EACCES;
+
+   r = drm_dev_enter(adev_to_drm(adev), &idx);
+   if (r)
+   return r;
+
+   if (write)
+   memcpy(adev->mman.sdma_access_ptr, buf, len);
+
+   num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8);
+   r = amdgpu_job_alloc_with_ib(adev, num_dw * 4,

AMDGPU_IB_POOL_DELAYED, &job);

+   if (r)
+   goto out;
+
+   src_addr = write ? amdgpu_bo_gpu_offset(adev-
mman.sdma_access_bo) :
+   amdgpu_bo_gpu_offset(abo);
+   dst_addr = write ? amdgpu_bo_gpu_offset(abo) :
+   amdgpu_bo_gpu_offset(adev-
mman.sdma_access_bo);

I suggest to write this as

src_addr = a;
dst_addr = b;
if (write)
  swap(src_addr, dst_addr);

This way we are not duplicating getting the different offsets.


+   amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr,
+PAGE_SIZE, false);
+
+   amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job-
ibs[0]);
+   WARN_ON(job->ibs[0].length_dw > num_dw);
+
+   r = amdgpu_job_submit(job, &adev->mman.entity,

AMDGPU_FENCE_OWNER_UNDEFINED, &fence);

+   if (r) {
+   amdgpu_job_free(job);
+   goto out;
+   }
+
+   if (!dma_fence_wait_timeout(fence, false, adev->sdma_timeout))
+   r = -ETIMEDOUT;
+   dma_fence_put(fence);
+
+   if (!(r || write))
+   memcpy(buf, adev->mman.sdma_access_ptr, len);
+out:
+   drm_dev_exit(idx);
+   return r;
+}
+
+static inline bool amdgpu_ttm_allow_post_mortem_debug(struct
+amdgpu_device *adev) {
+   return amdgpu_gpu_recovery == 0 ||
+   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->video_timeout == MAX_SCHEDULE_TIMEOUT; }

This should probably be inside amdgpu_device.c


+
   /**
* amdgpu_ttm_access_memory - Read or Write memory that backs a

buffer object.

*
@@ -1453,6 +1518,10 @@ static int amdgpu_ttm_access_memory(struct

ttm_buffer_object *bo,

 if (bo->resource->mem_type != TTM_PL_VRAM)
 return -EIO;

+   if (!amdgpu_ttm_allow_post_mortem_debug(adev) &&
+   !amdgpu_ttm_access_memory_sdma(bo, offset,

buf, len, write))

+   return len;
+
 amdgpu_res_first(bo->resource, offset, len, &cursor);
 while (cursor.remaining) {
 size_t count, size = cursor.size;
@@ -1793,6 +1862,12 @@ int amdgpu_ttm_init(struct amdgpu_device

*adev)

 return r;
 }

+   if (amdgpu_bo_create_kernel(adev, PAGE_SIZE, PAGE_SIZE,
+   AMDGPU_GEM_DOMAIN_GTT,
+   &adev->mman.sdma_access_bo, NULL,
+   adev->mman.sdma_access_ptr))
+   DRM_WARN("Debug VRAM access will use slowpath 

RE: [PATCH] drm/amdgpu: Add interface to load SRIOV cap FW

2022-01-12 Thread Chen, Guchun
+   int err = 0;
+   const struct psp_firmware_header_v1_0 *cap_hdr_v1_0;
+   struct amdgpu_firmware_info *info = NULL;

Please put short variable declaration last. With this fixed, the patch is:
Acked-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Bokun Zhang
Sent: Tuesday, January 11, 2022 4:46 AM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Bokun ; Liu, Monk 
Subject: [PATCH] drm/amdgpu: Add interface to load SRIOV cap FW

- Add interface to load SRIOV cap FW. If the FW does not
  exist, simply skip this FW loading routine.
  This FW will only be loaded under SRIOV. Other driver
  setup will not be affected.
  By adding this interface, it will make us easier to
  prepare SRIOV Linux guest driver for different users.

- Update sysfs interface to read cap FW version.

- Refactor PSP FW loading routine under SRIOV to use a
  unified SWITCH statement instead of using IF statement

- Remove redundant amdgpu_sriov_vf() check in FW loading
  routine

Ack-by: Monk Liu 
Signed-off-by: Bokun Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c   |  14 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c   | 108 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h   |   9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h |   1 +
 drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h   |   1 +
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c|   4 +-
 drivers/gpu/drm/amd/amdgpu/psp_v3_1.c |   1 +
 include/uapi/drm/amdgpu_drm.h |   2 +
 8 files changed, 125 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 35bee9dabe1c..dc7f24a98c44 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -400,6 +400,10 @@ static int amdgpu_firmware_info(struct 
drm_amdgpu_info_firmware *fw_info,
fw_info->ver = adev->psp.toc.fw_version;
fw_info->feature = adev->psp.toc.feature_version;
break;
+   case AMDGPU_INFO_FW_CAP:
+   fw_info->ver = adev->psp.cap_fw_version;
+   fw_info->feature = adev->psp.cap_feature_version;
+   break;
default:
return -EINVAL;
}
@@ -1665,6 +1669,16 @@ static int amdgpu_debugfs_firmware_info_show(struct 
seq_file *m, void *unused)
seq_printf(m, "TOC feature version: %u, firmware version: 0x%08x\n",
   fw_info.feature, fw_info.ver);
 
+   /* CAP */
+   if (adev->psp.cap_fw) {
+   query_fw.fw_type = AMDGPU_INFO_FW_CAP;
+   ret = amdgpu_firmware_info(&fw_info, &query_fw, adev);
+   if (ret)
+   return ret;
+   seq_printf(m, "CAP feature version: %u, firmware version: 
0x%08x\n",
+   fw_info.feature, fw_info.ver);
+   }
+
seq_printf(m, "VBIOS version: %s\n", ctx->vbios_version);
 
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 07d563c6641f..2095866bdf81 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -259,6 +259,32 @@ static bool psp_get_runtime_db_entry(struct amdgpu_device 
*adev,
return ret;
 }
 
+static int psp_init_sriov_microcode(struct psp_context *psp) {
+   struct amdgpu_device *adev = psp->adev;
+   int ret = 0;
+
+   switch (adev->ip_versions[MP0_HWIP][0]) {
+   case IP_VERSION(9, 0, 0):
+   ret = psp_init_cap_microcode(psp, "vega10");
+   break;
+   case IP_VERSION(11, 0, 9):
+   ret = psp_init_cap_microcode(psp, "navi12");
+   break;
+   case IP_VERSION(11, 0, 7):
+   ret = psp_init_cap_microcode(psp, "sienna_cichlid");
+   break;
+   case IP_VERSION(13, 0, 2):
+   ret = psp_init_ta_microcode(psp, "aldebaran");
+   break;
+   default:
+   BUG();
+   break;
+   }
+
+   return ret;
+}
+
 static int psp_sw_init(void *handle)
 {
struct amdgpu_device *adev = (struct amdgpu_device *)handle; @@ -273,19 
+299,13 @@ static int psp_sw_init(void *handle)
ret = -ENOMEM;
}
 
-   if (!amdgpu_sriov_vf(adev)) {
+   if (amdgpu_sriov_vf(adev))
+   ret = psp_init_sriov_microcode(psp);
+   else
ret = psp_init_microcode(psp);
-   if (ret) {
-   DRM_ERROR("Failed to load psp firmware!\n");
-   return ret;
-   }
-   } else if (amdgpu_sriov_vf(adev) &&
-  adev->ip_versions[MP0_HWIP][0] == IP_VERSION(13, 0, 2)) {
-   ret = psp_init_ta_microcode(psp, "aldebaran");
-   if (ret) {
-   DRM_ERROR("Failed to initialize ta microcode!\n");
-   return ret;
-   }
+   if (ret) {
+   

Re: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into critical regions

2022-01-12 Thread Lazar, Lijo




On 1/12/2022 7:12 AM, Stanley.Yang wrote:

Signed-off-by: Stanley.Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 10 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/ta_ras_if.h  |  3 ++-
  3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index c742d1aacf5a..8e0ea582b9c7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1309,6 +1309,12 @@ static void psp_ras_ta_check_status(struct psp_context 
*psp)
break;
case TA_RAS_STATUS__SUCCESS:
break;
+   case TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED:
+   if (ras_cmd->cmd_id == TA_RAS_COMMAND__TRIGGER_ERROR) {
+   dev_info(psp->adev->dev,
+   "RAS INFO: Inject error to critical region 
is not allowed\n");
+   }


Instead of doing this, why not print this in psp_ras_trigger_error(). 
i.e. caller interprets the error code and prints the appropriate 
message. I guess that is the single entry point to send TRIGGER_ERROR 
command.


Thanks,
Lijo


+   break;
default:
dev_warn(psp->adev->dev,
"RAS WARNING: ras status = 0x%X\n", 
ras_cmd->ras_status);
@@ -1521,7 +1527,9 @@ int psp_ras_trigger_error(struct psp_context *psp,
if (amdgpu_ras_intr_triggered())
return 0;
  
-	if (ras_cmd->ras_status)

+   if (ras_cmd->ras_status == TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED)
+   return -EACCES;
+   else if (ras_cmd->ras_status)
return -EINVAL;
  
  	return 0;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e674dbed3615..8bdc2e85cb20 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -449,7 +449,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f,
}
  
  	if (ret)

-   return -EINVAL;
+   return ret;
  
  	return size;

  }
diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h 
b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
index 5093826a43d1..509d8a1945eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
+++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
@@ -64,7 +64,8 @@ enum ta_ras_status {
TA_RAS_STATUS__ERROR_PCS_STATE_ERROR= 0xA016,
TA_RAS_STATUS__ERROR_PCS_STATE_HANG = 0xA017,
TA_RAS_STATUS__ERROR_PCS_STATE_UNKNOWN  = 0xA018,
-   TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019
+   TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019,
+   TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED  = 0xA01A
  };
  
  enum ta_ras_block {




Re: [git pull] drm for 5.17-rc1 (pre-merge window pull)

2022-01-12 Thread Harry Wentland



On 2022-01-11 15:51, Linus Torvalds wrote:
> On Tue, Jan 11, 2022 at 7:38 AM Harry Wentland  wrote:
>>
>> Attached is a v2 of the buggy patch that should get this right.
>> If you have a chance to try it out let us know
> 
> I can confirm that I do not see the horribly flickering behavior with
> this patch.

Thanks for testing it. The patch is up for review on the amd-gfx
mailing list.

Harry


Re: [PATCH 2/2] drm/amd/display: Fix for otg synchronization logic

2022-01-12 Thread Alex Deucher
On Wed, Jan 12, 2022 at 9:28 AM Harry Wentland  wrote:
>
> From: Meenakshikumar Somasundaram 
>
> [Why]
> During otg sync trigger, plane states are used to decide whether the otg
> is already synchronized or not. There are scenarions when otgs are
> disabled without plane state getting disabled and in such case the otg is
> excluded from synchronization.
>
> [How]
> Introduced pipe_idx_syncd in pipe_ctx that tracks each otgs master pipe.
> When a otg is disabled/enabled, pipe_idx_syncd is reset to itself.
> On sync trigger, pipe_idx_syncd is checked to decide whether a otg is
> already synchronized and the otg is further included or excluded from
> synchronization.
>
> v2:
>   Don't drop is_blanked logic
>
> Reviewed-by: Jun Lei 
> Reviewed-by: Mustapha Ghaddar 
> Acked-by: Bhawanpreet Lakha 
> Signed-off-by: meenakshikumar somasundaram 
> 
> Tested-by: Daniel Wheeler 
> Signed-off-by: Alex Deucher 
> Signed-off-by: Harry Wentland 
> Cc: torva...@linux-foundation.org

Series is:
Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/display/dc/core/dc.c  | 40 +-
>  .../gpu/drm/amd/display/dc/core/dc_resource.c | 54 +++
>  drivers/gpu/drm/amd/display/dc/dc.h   |  1 +
>  .../display/dc/dce110/dce110_hw_sequencer.c   |  8 +++
>  .../drm/amd/display/dc/dcn31/dcn31_resource.c |  3 ++
>  .../gpu/drm/amd/display/dc/inc/core_types.h   |  1 +
>  drivers/gpu/drm/amd/display/dc/inc/resource.h | 11 
>  7 files changed, 105 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
> b/drivers/gpu/drm/amd/display/dc/core/dc.c
> index 01c8849b9db2..6f5528d34093 100644
> --- a/drivers/gpu/drm/amd/display/dc/core/dc.c
> +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
> @@ -1404,20 +1404,34 @@ static void program_timing_sync(
> status->timing_sync_info.master = false;
>
> }
> -   /* remove any other unblanked pipes as they have already been 
> synced */
> -   for (j = j + 1; j < group_size; j++) {
> -   bool is_blanked;
>
> -   if 
> (pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked)
> -   is_blanked =
> -   
> pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked(pipe_set[j]->stream_res.opp);
> -   else
> -   is_blanked =
> -   
> pipe_set[j]->stream_res.tg->funcs->is_blanked(pipe_set[j]->stream_res.tg);
> -   if (!is_blanked) {
> -   group_size--;
> -   pipe_set[j] = pipe_set[group_size];
> -   j--;
> +   /* remove any other pipes that are already been synced */
> +   if (dc->config.use_pipe_ctx_sync_logic) {
> +   /* check pipe's syncd to decide which pipe to be 
> removed */
> +   for (j = 1; j < group_size; j++) {
> +   if (pipe_set[j]->pipe_idx_syncd == 
> pipe_set[0]->pipe_idx_syncd) {
> +   group_size--;
> +   pipe_set[j] = pipe_set[group_size];
> +   j--;
> +   } else
> +   /* link slave pipe's syncd with 
> master pipe */
> +   pipe_set[j]->pipe_idx_syncd = 
> pipe_set[0]->pipe_idx_syncd;
> +   }
> +   } else {
> +   for (j = j + 1; j < group_size; j++) {
> +   bool is_blanked;
> +
> +   if 
> (pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked)
> +   is_blanked =
> +   
> pipe_set[j]->stream_res.opp->funcs->dpg_is_blanked(pipe_set[j]->stream_res.opp);
> +   else
> +   is_blanked =
> +   
> pipe_set[j]->stream_res.tg->funcs->is_blanked(pipe_set[j]->stream_res.tg);
> +   if (!is_blanked) {
> +   group_size--;
> +   pipe_set[j] = pipe_set[group_size];
> +   j--;
> +   }
> }
> }
>
> diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c 
> b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
> index d4ff6cc6b8d9..b3912ff9dc91 100644
> --- a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
> +++ b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
> @@ -3217,6 +3217,60 @@ struct hpo_dp_link_encoder 
> *resource_get_hpo_dp_link_enc_for_det_lt(
>  }
>  #endif
>
> +void reset_syncd_pipes_from_disab

[PATCH] drm/amdgpu: cleanup ttm debug sdma vram access function

2022-01-12 Thread Jonathan Kim
Some suggested cleanups to declutter ttm when doing debug VRAM access over
SDMA.

Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 23 +++
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a675dde81ce0..4d77842f2183 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1448,6 +1448,15 @@ int amdgpu_device_set_cg_state(struct amdgpu_device 
*adev,
 int amdgpu_device_set_pg_state(struct amdgpu_device *adev,
   enum amd_powergating_state state);
 
+static inline bool amdgpu_allow_post_mortem_debug(struct amdgpu_device *adev)
+{
+   return amdgpu_gpu_recovery == 0 ||
+   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->video_timeout == MAX_SCHEDULE_TIMEOUT;
+}
+
 #include "amdgpu_object.h"
 
 static inline bool amdgpu_is_tmz(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 33781509838c..02515f1ea5fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1460,10 +1460,11 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (r)
goto out;
 
-   src_addr = write ? amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo) :
-   amdgpu_bo_gpu_offset(abo);
-   dst_addr = write ? amdgpu_bo_gpu_offset(abo) :
-   amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   src_addr = amdgpu_bo_gpu_offset(abo);
+   dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   if (write)
+   swap(src_addr, dst_addr);
+
amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr, 
PAGE_SIZE, false);
 
amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job->ibs[0]);
@@ -1486,15 +1487,6 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
return r;
 }
 
-static inline bool amdgpu_ttm_allow_post_mortem_debug(struct amdgpu_device 
*adev)
-{
-   return amdgpu_gpu_recovery == 0 ||
-   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->video_timeout == MAX_SCHEDULE_TIMEOUT;
-}
-
 /**
  * amdgpu_ttm_access_memory - Read or Write memory that backs a buffer object.
  *
@@ -1519,7 +1511,7 @@ static int amdgpu_ttm_access_memory(struct 
ttm_buffer_object *bo,
if (bo->resource->mem_type != TTM_PL_VRAM)
return -EIO;
 
-   if (!amdgpu_ttm_allow_post_mortem_debug(adev) &&
+   if (!amdgpu_allow_post_mortem_debug(adev) &&
!amdgpu_ttm_access_memory_sdma(bo, offset, buf, len, 
write))
return len;
 
@@ -1909,8 +1901,7 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_OA);
ttm_device_fini(&adev->mman.bdev);
adev->mman.initialized = false;
-   if (adev->mman.sdma_access_ptr)
-   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
+   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
&adev->mman.sdma_access_ptr);
DRM_INFO("amdgpu: ttm finalized\n");
 }
-- 
2.25.1



Re: [PATCH] drm/amdgpu: cleanup ttm debug sdma vram access function

2022-01-12 Thread Christian König




Am 12.01.22 um 16:59 schrieb Jonathan Kim:

Some suggested cleanups to declutter ttm when doing debug VRAM access over
SDMA.

Signed-off-by: Jonathan Kim 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |  9 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 23 +++
  2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a675dde81ce0..4d77842f2183 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1448,6 +1448,15 @@ int amdgpu_device_set_cg_state(struct amdgpu_device 
*adev,
  int amdgpu_device_set_pg_state(struct amdgpu_device *adev,
   enum amd_powergating_state state);
  
+static inline bool amdgpu_allow_post_mortem_debug(struct amdgpu_device *adev)


Give that a better name, something like 
amdgpu_device_are_timeouts_enabled().


Apart from that looks good to me,
Christian.


+{
+   return amdgpu_gpu_recovery == 0 ||
+   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
+   adev->video_timeout == MAX_SCHEDULE_TIMEOUT;
+}
+
  #include "amdgpu_object.h"
  
  static inline bool amdgpu_is_tmz(struct amdgpu_device *adev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 33781509838c..02515f1ea5fa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1460,10 +1460,11 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (r)
goto out;
  
-	src_addr = write ? amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo) :

-   amdgpu_bo_gpu_offset(abo);
-   dst_addr = write ? amdgpu_bo_gpu_offset(abo) :
-   amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   src_addr = amdgpu_bo_gpu_offset(abo);
+   dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   if (write)
+   swap(src_addr, dst_addr);
+
amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr, 
PAGE_SIZE, false);
  
  	amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job->ibs[0]);

@@ -1486,15 +1487,6 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
return r;
  }
  
-static inline bool amdgpu_ttm_allow_post_mortem_debug(struct amdgpu_device *adev)

-{
-   return amdgpu_gpu_recovery == 0 ||
-   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->video_timeout == MAX_SCHEDULE_TIMEOUT;
-}
-
  /**
   * amdgpu_ttm_access_memory - Read or Write memory that backs a buffer object.
   *
@@ -1519,7 +1511,7 @@ static int amdgpu_ttm_access_memory(struct 
ttm_buffer_object *bo,
if (bo->resource->mem_type != TTM_PL_VRAM)
return -EIO;
  
-	if (!amdgpu_ttm_allow_post_mortem_debug(adev) &&

+   if (!amdgpu_allow_post_mortem_debug(adev) &&
!amdgpu_ttm_access_memory_sdma(bo, offset, buf, len, 
write))
return len;
  
@@ -1909,8 +1901,7 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)

ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_OA);
ttm_device_fini(&adev->mman.bdev);
adev->mman.initialized = false;
-   if (adev->mman.sdma_access_ptr)
-   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
+   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
&adev->mman.sdma_access_ptr);
DRM_INFO("amdgpu: ttm finalized\n");
  }




Re: [PATCH v3 00/10] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2022-01-12 Thread Felix Kuehling
Am 2022-01-12 um 6:16 a.m. schrieb David Hildenbrand:
> On 10.01.22 23:31, Alex Sierra wrote:
>> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
>> owned by a device that can be mapped into CPU page tables like
>> MEMORY_DEVICE_GENERIC and can also be migrated like
>> MEMORY_DEVICE_PRIVATE.
>>
>> Christoph, the suggestion to incorporate Ralph Campbell’s refcount
>> cleanup patch into our hardware page migration patchset originally came
>> from you, but it proved impractical to do things in that order because
>> the refcount cleanup introduced a bug with wide ranging structural
>> implications. Instead, we amended Ralph’s patch so that it could be
>> applied after merging the migration work. As we saw from the recent
>> discussion, merging the refcount work is going to take some time and
>> cooperation between multiple development groups, while the migration
>> work is ready now and is needed now. So we propose to merge this
>> patchset first and continue to work with Ralph and others to merge the
>> refcount cleanup separately, when it is ready.
>>
>> This patch series is mostly self-contained except for a few places where
>> it needs to update other subsystems to handle the new memory type.
>> System stability and performance are not affected according to our
>> ongoing testing, including xfstests.
>>
>> How it works: The system BIOS advertises the GPU device memory
>> (aka VRAM) as SPM (special purpose memory) in the UEFI system address
>> map.
>>
>> The amdgpu driver registers the memory with devmap as
>> MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
>> this hardware page migration capability is the Frontier supercomputer
>> project. This functionality is not AMD-specific. We expect other GPU
>> vendors to find this functionality useful, and possibly other hardware
>> types in the future.
>>
>> Our test nodes in the lab are similar to the Frontier configuration,
>> with .5 TB of system memory plus 256 GB of device memory split across
>> 4 GPUs, all in a single coherent address space. Page migration is
>> expected to improve application efficiency significantly. We will
>> report empirical results as they become available.
> Hi,
>
> might be a dumb question because I'm not too familiar with
> MEMORY_DEVICE_COHERENT, but who's in charge of migrating *to* that
> memory? Or how does a process ever get a grab on such pages?

Device memory management and migration to device memory work the same as
MEMORY_DEVICE_PRIVATE. The device driver is in charge of managing the
memory and migrating data to it in response to application requests
(e.g. hipMemPrefetchAsync) or device page faults.

The nice thing about MEMORY_DEVICE_COHERENT is, that the CPU, or a 3rd
party device (e.g. a NIC) can access the memory without migrations
disrupting execution of high performance application code on the GPU.


>
> And where does migration come into play? I assume migration is only
> required to migrate off of that device memory to ordinary system RAM
> when required because the device memory has to be freed up, correct?

That's one case. For example memory pressure can force the GPU driver to
evict some device-coherent memory back to system memory. Also,
applications can request a migration to system memory explicitly (again
with something like hipMemPrefetchAsync).

Regards,
  Felix


>
> (a high level description on how this is exploited from users space
> would be great)
>


Re: [PATCH 2/2] drm/amd/display: Fix for otg synchronization logic

2022-01-12 Thread Harry Wentland



On 2022-01-12 10:53, Alex Deucher wrote:
> On Wed, Jan 12, 2022 at 9:28 AM Harry Wentland  wrote:
>>
>> From: Meenakshikumar Somasundaram 
>>
>> [Why]
>> During otg sync trigger, plane states are used to decide whether the otg
>> is already synchronized or not. There are scenarions when otgs are
>> disabled without plane state getting disabled and in such case the otg is
>> excluded from synchronization.
>>
>> [How]
>> Introduced pipe_idx_syncd in pipe_ctx that tracks each otgs master pipe.
>> When a otg is disabled/enabled, pipe_idx_syncd is reset to itself.
>> On sync trigger, pipe_idx_syncd is checked to decide whether a otg is
>> already synchronized and the otg is further included or excluded from
>> synchronization.
>>
>> v2:
>>   Don't drop is_blanked logic
>>
>> Reviewed-by: Jun Lei 
>> Reviewed-by: Mustapha Ghaddar 
>> Acked-by: Bhawanpreet Lakha 
>> Signed-off-by: meenakshikumar somasundaram 
>> 
>> Tested-by: Daniel Wheeler 
>> Signed-off-by: Alex Deucher 
>> Signed-off-by: Harry Wentland 
>> Cc: torva...@linux-foundation.org
> 
> Series is:
> Reviewed-by: Alex Deucher 
> 

And merged. Thanks.

Harry



Re: [PATCH 2/2] drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list

2022-01-12 Thread Felix Kuehling


Am 2022-01-12 um 2:48 a.m. schrieb yipechai:
> No longer insert ras blocks into ras_list if it already exists in ras_list.
>
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 62be0b4909b3..e6d3bb4b56e4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2754,9 +2754,17 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
>  int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
>   struct amdgpu_ras_block_object* ras_block_obj)
>  {
> + struct amdgpu_ras_block_object *obj, *tmp;
>   if (!adev || !amdgpu_ras_asic_supported(adev) || !ras_block_obj)
>   return -EINVAL;
>  
> + /* If the ras object had been in ras_list, doesn't add it to ras_list 
> again */
> + list_for_each_entry_safe(obj, tmp, &adev->ras_list, node) {
> + if (obj == ras_block_obj) {
Instead of a loop, can't this be done more efficiently with "if
(!list_empty(&ras_block_obj->node))"?

Of course this would require that you move the INIT_LIST_HEAD to some
earlier stage so that list_empty is reliable.

Regards,
  Felix


> + return 0;
> + }
> + }
> +
>   INIT_LIST_HEAD(&ras_block_obj->node);
>   list_add_tail(&ras_block_obj->node, &adev->ras_list);
>  


[PATCH] drm/amdgpu: cleanup ttm debug sdma vram access function

2022-01-12 Thread Jonathan Kim
Some suggested cleanups to declutter ttm when doing debug VRAM access over
SDMA.

v2: rename post_mortem_allowed func to has_timeouts_enable.

Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 23 +++
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a675dde81ce0..747d310aa72f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1448,6 +1448,15 @@ int amdgpu_device_set_cg_state(struct amdgpu_device 
*adev,
 int amdgpu_device_set_pg_state(struct amdgpu_device *adev,
   enum amd_powergating_state state);
 
+static inline bool amdgpu_device_has_timeouts_enabled(struct amdgpu_device 
*adev)
+{
+   return amdgpu_gpu_recovery != 0 &&
+   adev->gfx_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->compute_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->sdma_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->video_timeout != MAX_SCHEDULE_TIMEOUT;
+}
+
 #include "amdgpu_object.h"
 
 static inline bool amdgpu_is_tmz(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 33781509838c..b489cd8abe31 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1460,10 +1460,11 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (r)
goto out;
 
-   src_addr = write ? amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo) :
-   amdgpu_bo_gpu_offset(abo);
-   dst_addr = write ? amdgpu_bo_gpu_offset(abo) :
-   amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   src_addr = amdgpu_bo_gpu_offset(abo);
+   dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   if (write)
+   swap(src_addr, dst_addr);
+
amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr, 
PAGE_SIZE, false);
 
amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job->ibs[0]);
@@ -1486,15 +1487,6 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
return r;
 }
 
-static inline bool amdgpu_ttm_allow_post_mortem_debug(struct amdgpu_device 
*adev)
-{
-   return amdgpu_gpu_recovery == 0 ||
-   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->video_timeout == MAX_SCHEDULE_TIMEOUT;
-}
-
 /**
  * amdgpu_ttm_access_memory - Read or Write memory that backs a buffer object.
  *
@@ -1519,7 +1511,7 @@ static int amdgpu_ttm_access_memory(struct 
ttm_buffer_object *bo,
if (bo->resource->mem_type != TTM_PL_VRAM)
return -EIO;
 
-   if (!amdgpu_ttm_allow_post_mortem_debug(adev) &&
+   if (amdgpu_device_has_timeouts_enabled(adev) &&
!amdgpu_ttm_access_memory_sdma(bo, offset, buf, len, 
write))
return len;
 
@@ -1909,8 +1901,7 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)
ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_OA);
ttm_device_fini(&adev->mman.bdev);
adev->mman.initialized = false;
-   if (adev->mman.sdma_access_ptr)
-   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
+   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
&adev->mman.sdma_access_ptr);
DRM_INFO("amdgpu: ttm finalized\n");
 }
-- 
2.25.1



Re: [PATCH] drm/amdgpu: cleanup ttm debug sdma vram access function

2022-01-12 Thread Das, Nirmoy

LGTM acked-by: Nirmoy Das 


On 1/12/2022 7:52 PM, Jonathan Kim wrote:

Some suggested cleanups to declutter ttm when doing debug VRAM access over
SDMA.

v2: rename post_mortem_allowed func to has_timeouts_enable.

Signed-off-by: Jonathan Kim 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |  9 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 23 +++
  2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a675dde81ce0..747d310aa72f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1448,6 +1448,15 @@ int amdgpu_device_set_cg_state(struct amdgpu_device 
*adev,
  int amdgpu_device_set_pg_state(struct amdgpu_device *adev,
   enum amd_powergating_state state);
  
+static inline bool amdgpu_device_has_timeouts_enabled(struct amdgpu_device *adev)

+{
+   return amdgpu_gpu_recovery != 0 &&
+   adev->gfx_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->compute_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->sdma_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->video_timeout != MAX_SCHEDULE_TIMEOUT;
+}
+
  #include "amdgpu_object.h"
  
  static inline bool amdgpu_is_tmz(struct amdgpu_device *adev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 33781509838c..b489cd8abe31 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1460,10 +1460,11 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (r)
goto out;
  
-	src_addr = write ? amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo) :

-   amdgpu_bo_gpu_offset(abo);
-   dst_addr = write ? amdgpu_bo_gpu_offset(abo) :
-   amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   src_addr = amdgpu_bo_gpu_offset(abo);
+   dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   if (write)
+   swap(src_addr, dst_addr);
+
amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr, 
PAGE_SIZE, false);
  
  	amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job->ibs[0]);

@@ -1486,15 +1487,6 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
return r;
  }
  
-static inline bool amdgpu_ttm_allow_post_mortem_debug(struct amdgpu_device *adev)

-{
-   return amdgpu_gpu_recovery == 0 ||
-   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->video_timeout == MAX_SCHEDULE_TIMEOUT;
-}
-
  /**
   * amdgpu_ttm_access_memory - Read or Write memory that backs a buffer object.
   *
@@ -1519,7 +1511,7 @@ static int amdgpu_ttm_access_memory(struct 
ttm_buffer_object *bo,
if (bo->resource->mem_type != TTM_PL_VRAM)
return -EIO;
  
-	if (!amdgpu_ttm_allow_post_mortem_debug(adev) &&

+   if (amdgpu_device_has_timeouts_enabled(adev) &&
!amdgpu_ttm_access_memory_sdma(bo, offset, buf, len, 
write))
return len;
  
@@ -1909,8 +1901,7 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)

ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_OA);
ttm_device_fini(&adev->mman.bdev);
adev->mman.initialized = false;
-   if (adev->mman.sdma_access_ptr)
-   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
+   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
&adev->mman.sdma_access_ptr);
DRM_INFO("amdgpu: ttm finalized\n");
  }


[PATCH] drm/amdgpu/swsmu: make sienna cichlid function static

2022-01-12 Thread Alex Deucher
Unused outside of the file.

Reported-by: kernel test robot 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 9766870987db..4e37cd8025ed 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -3810,9 +3810,9 @@ static void sienna_cichlid_stb_init(struct smu_context 
*smu)
 
 }
 
-int sienna_cichlid_stb_get_data_direct(struct smu_context *smu,
-  void *buf,
-  uint32_t size)
+static int sienna_cichlid_stb_get_data_direct(struct smu_context *smu,
+ void *buf,
+ uint32_t size)
 {
uint32_t *p = buf;
struct amdgpu_device *adev = smu->adev;
-- 
2.34.1



[PATCH] drm/amdkfd: use proper interrupt handling for gfx10

2022-01-12 Thread Jonathan Kim
GFX has the following changes when handling interrupts in the KFD:
- no pasid workaround required
- SQ interrupt auto has different events
- SQ interrupt word is continguous and only has 23-bit data.
Also SH is labelled as SA and workgroup id replaces CU id.
- SQ interrupt word is continguos and only has 23 bits for err type and
err details.
- Sienna Cichlid uses a different client ID for SDMA3
(see soc15_ih_clients).

Signed-off-by: Jonathan Kim 
---
 drivers/gpu/drm/amd/amdkfd/Makefile   |   1 +
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   |   4 +-
 .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c  | 313 ++
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   1 +
 4 files changed, 318 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c

diff --git a/drivers/gpu/drm/amd/amdkfd/Makefile 
b/drivers/gpu/drm/amd/amdkfd/Makefile
index c4f3aff11072..87851840e9bd 100644
--- a/drivers/gpu/drm/amd/amdkfd/Makefile
+++ b/drivers/gpu/drm/amd/amdkfd/Makefile
@@ -51,6 +51,7 @@ AMDKFD_FILES  := $(AMDKFD_PATH)/kfd_module.o \
$(AMDKFD_PATH)/kfd_events.o \
$(AMDKFD_PATH)/cik_event_interrupt.o \
$(AMDKFD_PATH)/kfd_int_process_v9.o \
+   $(AMDKFD_PATH)/kfd_int_process_v10.o \
$(AMDKFD_PATH)/kfd_dbgdev.o \
$(AMDKFD_PATH)/kfd_dbgmgr.o \
$(AMDKFD_PATH)/kfd_smi_events.o \
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 5a47f437b455..7926e3b5a3e1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -107,6 +107,8 @@ static void 
kfd_device_info_set_event_interrupt_class(struct kfd_dev *kfd)
case IP_VERSION(9, 4, 0): /* VEGA20 */
case IP_VERSION(9, 4, 1): /* ARCTURUS */
case IP_VERSION(9, 4, 2): /* ALDEBARAN */
+   kfd->device_info.event_interrupt_class = 
&event_interrupt_class_v9;
+   break;
case IP_VERSION(10, 3, 1): /* VANGOGH */
case IP_VERSION(10, 3, 3): /* YELLOW_CARP */
case IP_VERSION(10, 1, 3): /* CYAN_SKILLFISH */
@@ -117,7 +119,7 @@ static void 
kfd_device_info_set_event_interrupt_class(struct kfd_dev *kfd)
case IP_VERSION(10, 3, 2): /* NAVY_FLOUNDER */
case IP_VERSION(10, 3, 4): /* DIMGREY_CAVEFISH */
case IP_VERSION(10, 3, 5): /* BEIGE_GOBY */
-   kfd->device_info.event_interrupt_class = 
&event_interrupt_class_v9;
+   kfd->device_info.event_interrupt_class = 
&event_interrupt_class_v10;
break;
default:
dev_warn(kfd_device, "v9 event interrupt handler is set due to "
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
new file mode 100644
index ..c9475f07dddf
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v10.c
@@ -0,0 +1,313 @@
+/*
+ * Copyright 2022 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include "kfd_priv.h"
+#include "kfd_events.h"
+#include "soc15_int.h"
+#include "kfd_device_queue_manager.h"
+#include "kfd_smi_events.h"
+
+enum SQ_INTERRUPT_WORD_ENCODING {
+   SQ_INTERRUPT_WORD_ENCODING_AUTO = 0x0,
+   SQ_INTERRUPT_WORD_ENCODING_INST,
+   SQ_INTERRUPT_WORD_ENCODING_ERROR,
+};
+
+enum SQ_INTERRUPT_ERROR_TYPE {
+   SQ_INTERRUPT_ERROR_TYPE_EDC_FUE = 0x0,
+   SQ_INTERRUPT_ERROR_TYPE_ILLEGAL_INST,
+   SQ_INTERRUPT_ERROR_TYPE_MEMVIOL,
+   SQ_INTERRUPT_ERROR_TYPE_EDC_FED,
+};
+
+/* SQ_INTERRUPT_WORD_AUTO_CTXID */
+#define SQ_INTERRUPT_WORD_AUTO_CTXID__THREAD_TRACE__SHIFT 0
+#define SQ_INTERRUPT_WORD_AUTO_CTXID__WLT__SHIFT 1
+#define SQ_INTERRUPT_WORD_AUTO_CTXID__THREAD_TRACE_BUF0_FULL__SHIFT 2
+#define SQ_INTERRUPT_WORD_AUTO_CTXID__THREAD_TRACE_BUF1_FULL__SHIFT 3
+#define SQ_INTERRUPT_WORD_AUTO_CTXID

[PATCH Review 1/1] drm/amdgpu: handle denied inject error into critical regions v2

2022-01-12 Thread Stanley . Yang
Changed from v1:
remove unused brace

Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/ta_ras_if.h  | 3 ++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index c742d1aacf5a..144176779f9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1309,6 +1309,11 @@ static void psp_ras_ta_check_status(struct psp_context 
*psp)
break;
case TA_RAS_STATUS__SUCCESS:
break;
+   case TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED:
+   if (ras_cmd->cmd_id == TA_RAS_COMMAND__TRIGGER_ERROR)
+   dev_warn(psp->adev->dev,
+   "RAS INFO: Inject error to critical 
region is not allowed\n");
+   break;
default:
dev_warn(psp->adev->dev,
"RAS WARNING: ras status = 0x%X\n", 
ras_cmd->ras_status);
@@ -1521,7 +1526,9 @@ int psp_ras_trigger_error(struct psp_context *psp,
if (amdgpu_ras_intr_triggered())
return 0;
 
-   if (ras_cmd->ras_status)
+   if (ras_cmd->ras_status == TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED)
+   return -EACCES;
+   else if (ras_cmd->ras_status)
return -EINVAL;
 
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e674dbed3615..8bdc2e85cb20 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -449,7 +449,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f,
}
 
if (ret)
-   return -EINVAL;
+   return ret;
 
return size;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h 
b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
index 5093826a43d1..509d8a1945eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
+++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
@@ -64,7 +64,8 @@ enum ta_ras_status {
TA_RAS_STATUS__ERROR_PCS_STATE_ERROR= 0xA016,
TA_RAS_STATUS__ERROR_PCS_STATE_HANG = 0xA017,
TA_RAS_STATUS__ERROR_PCS_STATE_UNKNOWN  = 0xA018,
-   TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019
+   TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019,
+   TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED  = 0xA01A
 };
 
 enum ta_ras_block {
-- 
2.17.1



RE: [PATCH] drm/amdgpu/swsmu: make sienna cichlid function static

2022-01-12 Thread Quan, Evan
[AMD Official Use Only]

Reviewed-by: Evan Quan 

> -Original Message-
> From: amd-gfx  On Behalf Of Alex
> Deucher
> Sent: Thursday, January 13, 2022 5:26 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; kernel test robot
> 
> Subject: [PATCH] drm/amdgpu/swsmu: make sienna cichlid function static
> 
> Unused outside of the file.
> 
> Reported-by: kernel test robot 
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> index 9766870987db..4e37cd8025ed 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> @@ -3810,9 +3810,9 @@ static void sienna_cichlid_stb_init(struct
> smu_context *smu)
> 
>  }
> 
> -int sienna_cichlid_stb_get_data_direct(struct smu_context *smu,
> -void *buf,
> -uint32_t size)
> +static int sienna_cichlid_stb_get_data_direct(struct smu_context *smu,
> +   void *buf,
> +   uint32_t size)
>  {
>   uint32_t *p = buf;
>   struct amdgpu_device *adev = smu->adev;
> --
> 2.34.1


[PATCH -next 1/2] drm/amdgpu: remove unneeded semicolon

2022-01-12 Thread Yang Li
Eliminate the following coccicheck warning:
./drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2725:16-17: Unneeded semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d4d9b9ea8bbd..7d9d99e581da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2722,7 +2722,7 @@ struct amdgpu_ras* amdgpu_ras_get_context(struct 
amdgpu_device *adev)
 int amdgpu_ras_set_context(struct amdgpu_device *adev, struct amdgpu_ras* 
ras_con)
 {
if (!adev)
-   return -EINVAL;;
+   return -EINVAL;
 
adev->psp.ras_context.ras = ras_con;
return 0;
-- 
2.20.1.7.g153144c



[PATCH -next 2/2] drm/amdgpu: clean up some inconsistent indenting

2022-01-12 Thread Yang Li
Eliminate the follow smatch warnings:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3504 amdgpu_device_init()
warn: inconsistent indenting
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1716
amdgpu_ras_error_status_query() warn: if statement not indented
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2724 amdgpu_ras_set_context()
warn: if statement not indented
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1058 amdgpu_ras_error_inject()
warn: inconsistent indenting

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 10 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 33388041c354..64d6c0af4c76 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3499,7 +3499,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
mutex_init(&adev->notifier_lock);
mutex_init(&adev->pm.stable_pstate_ctx_lock);
 
-amdgpu_device_init_apu_flags(adev);
+   amdgpu_device_init_apu_flags(adev);
 
r = amdgpu_device_check_arguments(adev);
if (r)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 7d9d99e581da..6d84749698c8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1055,8 +1055,10 @@ int amdgpu_ras_error_inject(struct amdgpu_device *adev,
.address = info->address,
.value = info->value,
};
-int ret = -EINVAL;
-struct amdgpu_ras_block_object* block_obj = amdgpu_ras_get_ras_block(adev, 
info->head.block, info->head.sub_block_index);
+   int ret = -EINVAL;
+   struct amdgpu_ras_block_object *block_obj = 
amdgpu_ras_get_ras_block(adev,
+   info->head.block,
+   
info->head.sub_block_index);
 
if (!obj)
return -EINVAL;
@@ -1714,7 +1716,7 @@ static void amdgpu_ras_error_status_query(struct 
amdgpu_device *adev,
}
 
if (block_obj->hw_ops->query_ras_error_status)
-   block_obj->hw_ops->query_ras_error_status(adev);
+   block_obj->hw_ops->query_ras_error_status(adev);
 
 }
 
@@ -2722,7 +2724,7 @@ struct amdgpu_ras* amdgpu_ras_get_context(struct 
amdgpu_device *adev)
 int amdgpu_ras_set_context(struct amdgpu_device *adev, struct amdgpu_ras* 
ras_con)
 {
if (!adev)
-   return -EINVAL;
+   return -EINVAL;
 
adev->psp.ras_context.ras = ras_con;
return 0;
-- 
2.20.1.7.g153144c



RE: [PATCH 2/2] drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list

2022-01-12 Thread Chai, Thomas
Hi Felix:
 amdgpu_ras_register_ras_block was called by all IP ras blocks,  and every 
ip also has different ras versions.  We do common work together, which can 
reduce the chance of the ras function going wrong.

-Original Message-
From: Kuehling, Felix  
Sent: Thursday, January 13, 2022 12:39 AM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Zhang, Hawking ; 
Clements, John ; Chai, Thomas 
Subject: Re: [PATCH 2/2] drm/amdgpu: No longer insert ras blocks into ras_list 
if it already exists in ras_list


Am 2022-01-12 um 2:48 a.m. schrieb yipechai:
> No longer insert ras blocks into ras_list if it already exists in ras_list.
>
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 62be0b4909b3..e6d3bb4b56e4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2754,9 +2754,17 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device 
> *adev)  int amdgpu_ras_register_ras_block(struct amdgpu_device *adev,
>   struct amdgpu_ras_block_object* ras_block_obj)  {
> + struct amdgpu_ras_block_object *obj, *tmp;
>   if (!adev || !amdgpu_ras_asic_supported(adev) || !ras_block_obj)
>   return -EINVAL;
>  
> + /* If the ras object had been in ras_list, doesn't add it to ras_list 
> again */
> + list_for_each_entry_safe(obj, tmp, &adev->ras_list, node) {
> + if (obj == ras_block_obj) {
Instead of a loop, can't this be done more efficiently with "if 
(!list_empty(&ras_block_obj->node))"?

Of course this would require that you move the INIT_LIST_HEAD to some earlier 
stage so that list_empty is reliable.

Regards,
  Felix


> + return 0;
> + }
> + }
> +
>   INIT_LIST_HEAD(&ras_block_obj->node);
>   list_add_tail(&ras_block_obj->node, &adev->ras_list);
>  


[pull] amdgpu, amdkfd drm-next-5.17

2022-01-12 Thread Alex Deucher
Hi Dave, Daniel,

Fixes for 5.17.

The following changes since commit cb6846fbb83b574c85c2a80211b402a6347b60b1:

  Merge tag 'amd-drm-next-5.17-2021-12-30' of 
ssh://gitlab.freedesktop.org/agd5f/linux into drm-next (2021-12-31 10:59:17 
+1000)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-next-5.17-2022-01-12

for you to fetch changes up to 5eb877b282fecc8b8a6ac6d4ce0d5057f9d3bad0:

  drm/amdkfd: Fix ASIC name typos (2022-01-11 15:44:28 -0500)


amd-drm-next-5.17-2022-01-12:

amdgpu:
- SR-IOV fixes
- Suspend/resume fixes
- Display fixes
- DMCUB fixes
- DP alt mode fixes
- RAS fixes
- UBSAN fix
- Navy Flounder VCN fix
- ttm resource manager cleanup
- default_groups change for kobj_type
- vkms fix
- Aldebaran fixes

amdkfd:
- SDMA ECC interrupt fix
- License clarification
- Pointer check fix
- DQM fixes for hawaii
- default_groups change for kobj_type
- Typo fixes


Charlene Liu (1):
  drm/amd/display: Add check for forced_clocks debug option

Evan Quan (1):
  drm/amd/pm: keep the BACO feature enabled for suspend

Felix Kuehling (3):
  drm/amdkfd: Use prange->list head for insert_list
  drm/amdkfd: Use prange->update_list head for remove_list
  drm/amdkfd: Fix DQM asserts on Hawaii

Greg Kroah-Hartman (2):
  drm/amdgpu: use default_groups in kobj_type
  drm/amdkfd: use default_groups in kobj_type

Guchun Chen (1):
  drm/amdgpu: use spin_lock_irqsave to avoid deadlock by local interrupt

Harry Wentland (1):
  drm/amdgpu: Use correct VIEWPORT_DIMENSION for DCN2

James Yao (1):
  drm/amdgpu: add dummy event6 for vega10

Jiasheng Jiang (1):
  drm/amdkfd: Check for null pointer after calling kmemdup

Jiawei Gu (1):
  drm/amdgpu: Clear garbage data in err_data before usage

José Expósito (1):
  drm/amd/display: invalid parameter check in dmub_hpd_callback

Kent Russell (1):
  drm/amdkfd: Fix ASIC name typos

Leslie Shi (1):
  drm/amdgpu: Unmap MMIO mappings when device is not unplugged

Lukas Bulwahn (1):
  drm/amdkfd: make SPDX License expression more sound

Mario Limonciello (4):
  drm/amdgpu: explicitly check for s0ix when evicting resources
  drm/amdgpu: don't set s3 and s0ix at the same time
  drm/amd/display: explicitly set is_dsc_supported to false before use
  drm/amd/display: reset dcn31 SMU mailbox on failures

Mikita Lipski (1):
  drm/amd/display: introduce mpo detection flags

Nicholas Kazlauskas (2):
  drm/amd/display: Don't reinitialize DMCUB on s0ix resume
  drm/amd/display: Add version check before using DP alt query interface

Nirmoy Das (4):
  drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr
  drm/amdkfd: remove unused function
  drm/amdgpu: do not pass ttm_resource_manager to vram_mgr
  drm/amdgpu: recover gart table at resume

Peng Ju Zhou (1):
  drm/amdgpu: Enable second VCN for certain Navy Flounder.

Prike Liang (1):
  drm/amdgpu: not return error on the init_apu_flags

Rajneesh Bhardwaj (1):
  Revert "drm/amdgpu: Don't inherit GEM object VMAs in child process"

Tao Zhou (1):
  drm/amd/pm: only send GmiPwrDnControl msg on master die (v3)

Tom St Denis (1):
  drm/amd/amdgpu: Add pcie indirect support to amdgpu_mm_wreg_mmio_rlc()

Wenjing Liu (1):
  drm/amd/display: unhard code link to phy idx mapping in dc link and clean 
up

Yi-Ling Chen (1):
  drm/amd/display: Fix underflow for fused display pipes case

yipechai (1):
  drm/amdkfd: enable sdma ecc interrupt event can be handled by 
event_interrupt_wq_v9

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |   7 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |   1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  36 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c  |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  84 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c|   3 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c|  17 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c|  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  12 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c|   7 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|  11 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h|  12 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c   |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c   |  40 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |   3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |   3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c  |   3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c   

RE: [PATCH V2 1/2] drm/amdgpu: Add ras supported check for register_ras_block

2022-01-12 Thread Zhou1, Tao
[AMD Official Use Only]

The series is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, January 12, 2022 6:39 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Clements,
> John ; Chai, Thomas 
> Subject: [PATCH V2 1/2] drm/amdgpu: Add ras supported check for
> register_ras_block
> 
> Add ras supported check for register_ras_block.
> 
> Signed-off-by: yipechai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index b1bedfd4febc..614ae8455c9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2757,6 +2757,9 @@ int amdgpu_ras_register_ras_block(struct
> amdgpu_device *adev,
>   if (!adev || !ras_block_obj)
>   return -EINVAL;
> 
> + if (!amdgpu_ras_asic_supported(adev))
> + return 0;
> +
>   INIT_LIST_HEAD(&ras_block_obj->node);
>   list_add_tail(&ras_block_obj->node, &adev->ras_list);
> 
> --
> 2.25.1


RE: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into critical regions v2

2022-01-12 Thread Zhou1, Tao
[AMD Official Use Only]

Since you use dev_warn, "RAS WARNING" is better than "RAS INFO" in the print 
message, with this fixed the patch is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Stanley.Yang 
> Sent: Thursday, January 13, 2022 9:28 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhang, Hawking ; Clements, John
> ; Zhou1, Tao ; Yang,
> Stanley 
> Subject: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into
> critical regions v2
> 
> Changed from v1:
> remove unused brace
> 
> Signed-off-by: Stanley.Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 -
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/ta_ras_if.h  | 3 ++-
>  3 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index c742d1aacf5a..144176779f9e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -1309,6 +1309,11 @@ static void psp_ras_ta_check_status(struct
> psp_context *psp)
>   break;
>   case TA_RAS_STATUS__SUCCESS:
>   break;
> + case TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED:
> + if (ras_cmd->cmd_id == TA_RAS_COMMAND__TRIGGER_ERROR)
> + dev_warn(psp->adev->dev,
> + "RAS INFO: Inject error to critical
> region is not allowed\n");
> + break;
>   default:
>   dev_warn(psp->adev->dev,
>   "RAS WARNING: ras status = 0x%X\n",
> ras_cmd->ras_status); @@ -1521,7 +1526,9 @@ int
> psp_ras_trigger_error(struct psp_context *psp,
>   if (amdgpu_ras_intr_triggered())
>   return 0;
> 
> - if (ras_cmd->ras_status)
> + if (ras_cmd->ras_status ==
> TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED)
> + return -EACCES;
> + else if (ras_cmd->ras_status)
>   return -EINVAL;
> 
>   return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e674dbed3615..8bdc2e85cb20 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -449,7 +449,7 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file
> *f,
>   }
> 
>   if (ret)
> - return -EINVAL;
> + return ret;
> 
>   return size;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> index 5093826a43d1..509d8a1945eb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> +++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> @@ -64,7 +64,8 @@ enum ta_ras_status {
>   TA_RAS_STATUS__ERROR_PCS_STATE_ERROR= 0xA016,
>   TA_RAS_STATUS__ERROR_PCS_STATE_HANG = 0xA017,
>   TA_RAS_STATUS__ERROR_PCS_STATE_UNKNOWN  = 0xA018,
> - TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019
> + TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019,
> + TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED  = 0xA01A
>  };
> 
>  enum ta_ras_block {
> --
> 2.17.1


RE: [PATCH -next 1/2] drm/amdgpu: remove unneeded semicolon

2022-01-12 Thread Chen, Guchun
Thanks for your patch, Yang. Can you pls also fix the original indentation 
problem as well?

if (!adev)
-   return -EINVAL;;
+   return -EINVAL;

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Yang Li
Sent: Thursday, January 13, 2022 9:22 AM
To: airl...@linux.ie
Cc: Pan, Xinhui ; Abaci Robot ; 
linux-ker...@vger.kernel.org; dri-de...@lists.freedesktop.org; Yang Li 
; amd-gfx@lists.freedesktop.org; dan...@ffwll.ch; 
Deucher, Alexander ; Koenig, Christian 

Subject: [PATCH -next 1/2] drm/amdgpu: remove unneeded semicolon

Eliminate the following coccicheck warning:
./drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2725:16-17: Unneeded semicolon

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d4d9b9ea8bbd..7d9d99e581da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2722,7 +2722,7 @@ struct amdgpu_ras* amdgpu_ras_get_context(struct 
amdgpu_device *adev)  int amdgpu_ras_set_context(struct amdgpu_device *adev, 
struct amdgpu_ras* ras_con)  {
if (!adev)
-   return -EINVAL;;
+   return -EINVAL;
 
adev->psp.ras_context.ras = ras_con;
return 0;
--
2.20.1.7.g153144c



[PATCH] drm/amdgpu: don't do resets on APUs which don't support it

2022-01-12 Thread Alex Deucher
It can cause a hang.  This is normally not enabled for GPU
hangs on these asics, but was recently enabled for handling
aborted suspends.  This causes hangs on some platforms
on suspend.

Fixes: daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)")
Cc: sta...@vger.kernel.org
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1858
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/cik.c | 4 
 drivers/gpu/drm/amd/amdgpu/vi.c  | 4 
 2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index 54f28c075f21..f10ce740a29c 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -1428,6 +1428,10 @@ static int cik_asic_reset(struct amdgpu_device *adev)
 {
int r;
 
+   /* APUs don't have full asic reset */
+   if (adev->flags & AMD_IS_APU)
+   return 0;
+
if (cik_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) {
dev_info(adev->dev, "BACO reset\n");
r = amdgpu_dpm_baco_reset(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index fe9a7cc8d9eb..6645ebbd2696 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -956,6 +956,10 @@ static int vi_asic_reset(struct amdgpu_device *adev)
 {
int r;
 
+   /* APUs don't have full asic reset */
+   if (adev->flags & AMD_IS_APU)
+   return 0;
+
if (vi_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) {
dev_info(adev->dev, "BACO reset\n");
r = amdgpu_dpm_baco_reset(adev);
-- 
2.34.1



RE: [PATCH] drm/amdgpu: don't do resets on APUs which don't support it

2022-01-12 Thread Chen, Guchun
Acked-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Thursday, January 13, 2022 12:01 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; sta...@vger.kernel.org
Subject: [PATCH] drm/amdgpu: don't do resets on APUs which don't support it

It can cause a hang.  This is normally not enabled for GPU hangs on these 
asics, but was recently enabled for handling aborted suspends.  This causes 
hangs on some platforms on suspend.

Fixes: daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)")
Cc: sta...@vger.kernel.org
Bug: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1858&data=04%7C01%7Cguchun.chen%40amd.com%7C2462de07d629440dbe5d08d9d6495d8a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637776432879987023%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WrqnZKTZCDx729gO5TXEr6IOhBFa%2FkGqMa5VDjbSx%2Bk%3D&reserved=0
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/cik.c | 4   drivers/gpu/drm/amd/amdgpu/vi.c  | 
4 
 2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index 54f28c075f21..f10ce740a29c 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -1428,6 +1428,10 @@ static int cik_asic_reset(struct amdgpu_device *adev)  {
int r;
 
+   /* APUs don't have full asic reset */
+   if (adev->flags & AMD_IS_APU)
+   return 0;
+
if (cik_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) {
dev_info(adev->dev, "BACO reset\n");
r = amdgpu_dpm_baco_reset(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c 
index fe9a7cc8d9eb..6645ebbd2696 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -956,6 +956,10 @@ static int vi_asic_reset(struct amdgpu_device *adev)  {
int r;
 
+   /* APUs don't have full asic reset */
+   if (adev->flags & AMD_IS_APU)
+   return 0;
+
if (vi_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) {
dev_info(adev->dev, "BACO reset\n");
r = amdgpu_dpm_baco_reset(adev);
--
2.34.1



[bug report] drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops

2022-01-12 Thread Dan Carpenter
Hello yipechai,

The patch d51ce4db0747: "drm/amdgpu: Modify gfx block to fit for the
unified ras block data and ops" from Jan 4, 2022, leads to the
following Smatch static checker warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1058 amdgpu_ras_error_inject()
warn: inconsistent indenting

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
1047 int amdgpu_ras_error_inject(struct amdgpu_device *adev,
1048 struct ras_inject_if *info)
1049 {
1050 struct ras_manager *obj = amdgpu_ras_find_obj(adev, 
&info->head);
1051 struct ta_ras_trigger_error_input block_info = {
1052 .block_id =  amdgpu_ras_block_to_ta(info->head.block),
1053 .inject_error_type = 
amdgpu_ras_error_to_ta(info->head.type),
1054 .sub_block_index = info->head.sub_block_index,
1055 .address = info->address,
1056 .value = info->value,
1057 };
--> 1058 int ret = -EINVAL;
1059 struct amdgpu_ras_block_object* block_obj = 
amdgpu_ras_get_ras_block(adev, info->head.block, info->head.sub_block_index);

Really?  You can't be bothered to run checkpatch on your code?  AMD drm
code is uniquely bad in this regards.  It's the only place outside of
drivers/staging/ where you see stuff like this.

In theory, it's admirable to be this informal and free from bureaucracy
and rules.  But in another way, this kind of code is like plumber crack.
You might be a good plumber but it's not attractive.  And we might not
point it out, but we all see it.

1060 
1061 if (!obj)
1062 return -EINVAL;
1063 
1064 if (!block_obj || !block_obj->hw_ops){
1065 dev_info(adev->dev, "%s doesn't config ras function 
\n", get_ras_block_str(&info->head));
1066 return -EINVAL;
1067 }

regards,
dan carpenter


[PATCH] drm/amdgpu: Indent some if statements

2022-01-12 Thread Dan Carpenter
These if statements need to be indented.

Signed-off-by: Dan Carpenter 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index d4d9b9ea8bbd..777def770dc8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1714,8 +1714,7 @@ static void amdgpu_ras_error_status_query(struct 
amdgpu_device *adev,
}
 
if (block_obj->hw_ops->query_ras_error_status)
-   block_obj->hw_ops->query_ras_error_status(adev);
-
+   block_obj->hw_ops->query_ras_error_status(adev);
 }
 
 static void amdgpu_ras_query_err_status(struct amdgpu_device *adev)
@@ -2722,7 +2721,7 @@ struct amdgpu_ras* amdgpu_ras_get_context(struct 
amdgpu_device *adev)
 int amdgpu_ras_set_context(struct amdgpu_device *adev, struct amdgpu_ras* 
ras_con)
 {
if (!adev)
-   return -EINVAL;;
+   return -EINVAL;
 
adev->psp.ras_context.ras = ras_con;
return 0;
-- 
2.20.1



回复: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into critical regions v2

2022-01-12 Thread Yang, Stanley
[AMD Official Use Only]

Thanks, will update before submit.

Regards,
Stanley
> -邮件原件-
> 发件人: Zhou1, Tao 
> 发送时间: Thursday, January 13, 2022 11:29 AM
> 收件人: Yang, Stanley ; amd-
> g...@lists.freedesktop.org
> 抄送: Zhang, Hawking ; Clements, John
> ; Yang, Stanley 
> 主题: RE: [PATCH Review 1/1] drm/amdgpu: handle denied inject error into
> critical regions v2
> 
> [AMD Official Use Only]
> 
> Since you use dev_warn, "RAS WARNING" is better than "RAS INFO" in the
> print message, with this fixed the patch is:
> 
> Reviewed-by: Tao Zhou 
> 
> > -Original Message-
> > From: Stanley.Yang 
> > Sent: Thursday, January 13, 2022 9:28 AM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Zhang, Hawking ; Clements, John
> > ; Zhou1, Tao ; Yang,
> Stanley
> > 
> > Subject: [PATCH Review 1/1] drm/amdgpu: handle denied inject error
> > into critical regions v2
> >
> > Changed from v1:
> > remove unused brace
> >
> > Signed-off-by: Stanley.Yang 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 -
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/ta_ras_if.h  | 3 ++-
> >  3 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index c742d1aacf5a..144176779f9e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -1309,6 +1309,11 @@ static void psp_ras_ta_check_status(struct
> > psp_context *psp)
> > break;
> > case TA_RAS_STATUS__SUCCESS:
> > break;
> > +   case TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED:
> > +   if (ras_cmd->cmd_id ==
> TA_RAS_COMMAND__TRIGGER_ERROR)
> > +   dev_warn(psp->adev->dev,
> > +   "RAS INFO: Inject error to critical
> > region is not allowed\n");
> > +   break;
> > default:
> > dev_warn(psp->adev->dev,
> > "RAS WARNING: ras status = 0x%X\n",
> ras_cmd->ras_status); @@
> > -1521,7 +1526,9 @@ int psp_ras_trigger_error(struct psp_context *psp,
> > if (amdgpu_ras_intr_triggered())
> > return 0;
> >
> > -   if (ras_cmd->ras_status)
> > +   if (ras_cmd->ras_status ==
> > TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED)
> > +   return -EACCES;
> > +   else if (ras_cmd->ras_status)
> > return -EINVAL;
> >
> > return 0;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e674dbed3615..8bdc2e85cb20 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -449,7 +449,7 @@ static ssize_t
> > amdgpu_ras_debugfs_ctrl_write(struct file *f,
> > }
> >
> > if (ret)
> > -   return -EINVAL;
> > +   return ret;
> >
> > return size;
> >  }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > index 5093826a43d1..509d8a1945eb 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
> > @@ -64,7 +64,8 @@ enum ta_ras_status {
> > TA_RAS_STATUS__ERROR_PCS_STATE_ERROR= 0xA016,
> > TA_RAS_STATUS__ERROR_PCS_STATE_HANG = 0xA017,
> > TA_RAS_STATUS__ERROR_PCS_STATE_UNKNOWN  = 0xA018,
> > -   TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019
> > +   TA_RAS_STATUS__ERROR_UNSUPPORTED_ERROR_INJ  = 0xA019,
> > +   TA_RAS_STATUS__TEE_ERROR_ACCESS_DENIED  = 0xA01A
> >  };
> >
> >  enum ta_ras_block {
> > --
> > 2.17.1


Re: [PATCH] drm/amdgpu: don't do resets on APUs which don't support it

2022-01-12 Thread Lazar, Lijo

Hi Alex,

What about something like this?

bool amdgpu_device_reset_on_suspend(struct amdgpu_device *adev)
{
if (adev->in_s0ix || adev->gmc.xgmi.num_physical_nodes > 1)
return false;

switch (amdgpu_asic_reset_method(adev)) {
case AMD_RESET_METHOD_BACO:
case AMD_RESET_METHOD_MODE1:
case AMD_RESET_METHOD_MODE2:
return true;
}

return false;
}

Thanks,
Lijo

On 1/13/2022 9:31 AM, Alex Deucher wrote:

It can cause a hang.  This is normally not enabled for GPU
hangs on these asics, but was recently enabled for handling
aborted suspends.  This causes hangs on some platforms
on suspend.

Fixes: daf8de0874ab5b ("drm/amdgpu: always reset the asic in suspend (v2)")
Cc: sta...@vger.kernel.org
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1858
Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/cik.c | 4 
  drivers/gpu/drm/amd/amdgpu/vi.c  | 4 
  2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index 54f28c075f21..f10ce740a29c 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -1428,6 +1428,10 @@ static int cik_asic_reset(struct amdgpu_device *adev)
  {
int r;
  
+	/* APUs don't have full asic reset */

+   if (adev->flags & AMD_IS_APU)
+   return 0;
+
if (cik_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) {
dev_info(adev->dev, "BACO reset\n");
r = amdgpu_dpm_baco_reset(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index fe9a7cc8d9eb..6645ebbd2696 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -956,6 +956,10 @@ static int vi_asic_reset(struct amdgpu_device *adev)
  {
int r;
  
+	/* APUs don't have full asic reset */

+   if (adev->flags & AMD_IS_APU)
+   return 0;
+
if (vi_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) {
dev_info(adev->dev, "BACO reset\n");
r = amdgpu_dpm_baco_reset(adev);



[PATCH] drm/amdgpu: fix null ptr access

2022-01-12 Thread Flora Cui
check null ptr first before access its element

Signed-off-by: Flora Cui 
---
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 2 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index f0daa66f5b3d..5fc33893a68c 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -463,7 +463,7 @@ int amdgpu_pm_load_smu_firmware(struct amdgpu_device *adev, 
uint32_t *smu_versio
const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
int r = 0;
 
-   if (!pp_funcs->load_firmware)
+   if (!pp_funcs || !pp_funcs->load_firmware)
return 0;
 
mutex_lock(&adev->pm.mutex);
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index 828cb932f6a9..aa640a9c6137 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -3281,7 +3281,7 @@ void amdgpu_smu_stb_debug_fs_init(struct amdgpu_device 
*adev)
 
struct smu_context *smu = adev->powerplay.pp_handle;
 
-   if (!smu->stb_context.stb_buf_size)
+   if (!smu || !smu->stb_context.stb_buf_size)
return;
 
debugfs_create_file_size("amdgpu_smu_stb_dump",
-- 
2.25.1



Re: [PATCH] drm/amdgpu: cleanup ttm debug sdma vram access function

2022-01-12 Thread Christian König

Am 12.01.22 um 19:52 schrieb Jonathan Kim:

Some suggested cleanups to declutter ttm when doing debug VRAM access over
SDMA.

v2: rename post_mortem_allowed func to has_timeouts_enable.

Signed-off-by: Jonathan Kim 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h |  9 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 23 +++
  2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a675dde81ce0..747d310aa72f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1448,6 +1448,15 @@ int amdgpu_device_set_cg_state(struct amdgpu_device 
*adev,
  int amdgpu_device_set_pg_state(struct amdgpu_device *adev,
   enum amd_powergating_state state);
  
+static inline bool amdgpu_device_has_timeouts_enabled(struct amdgpu_device *adev)

+{
+   return amdgpu_gpu_recovery != 0 &&
+   adev->gfx_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->compute_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->sdma_timeout != MAX_SCHEDULE_TIMEOUT &&
+   adev->video_timeout != MAX_SCHEDULE_TIMEOUT;
+}
+
  #include "amdgpu_object.h"
  
  static inline bool amdgpu_is_tmz(struct amdgpu_device *adev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 33781509838c..b489cd8abe31 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1460,10 +1460,11 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
if (r)
goto out;
  
-	src_addr = write ? amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo) :

-   amdgpu_bo_gpu_offset(abo);
-   dst_addr = write ? amdgpu_bo_gpu_offset(abo) :
-   amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   src_addr = amdgpu_bo_gpu_offset(abo);
+   dst_addr = amdgpu_bo_gpu_offset(adev->mman.sdma_access_bo);
+   if (write)
+   swap(src_addr, dst_addr);
+
amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, dst_addr, 
PAGE_SIZE, false);
  
  	amdgpu_ring_pad_ib(adev->mman.buffer_funcs_ring, &job->ibs[0]);

@@ -1486,15 +1487,6 @@ static int amdgpu_ttm_access_memory_sdma(struct 
ttm_buffer_object *bo,
return r;
  }
  
-static inline bool amdgpu_ttm_allow_post_mortem_debug(struct amdgpu_device *adev)

-{
-   return amdgpu_gpu_recovery == 0 ||
-   adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
-   adev->video_timeout == MAX_SCHEDULE_TIMEOUT;
-}
-
  /**
   * amdgpu_ttm_access_memory - Read or Write memory that backs a buffer object.
   *
@@ -1519,7 +1511,7 @@ static int amdgpu_ttm_access_memory(struct 
ttm_buffer_object *bo,
if (bo->resource->mem_type != TTM_PL_VRAM)
return -EIO;
  
-	if (!amdgpu_ttm_allow_post_mortem_debug(adev) &&

+   if (amdgpu_device_has_timeouts_enabled(adev) &&
!amdgpu_ttm_access_memory_sdma(bo, offset, buf, len, 
write))
return len;
  
@@ -1909,8 +1901,7 @@ void amdgpu_ttm_fini(struct amdgpu_device *adev)

ttm_range_man_fini(&adev->mman.bdev, AMDGPU_PL_OA);
ttm_device_fini(&adev->mman.bdev);
adev->mman.initialized = false;
-   if (adev->mman.sdma_access_ptr)
-   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
+   amdgpu_bo_free_kernel(&adev->mman.sdma_access_bo, NULL,
&adev->mman.sdma_access_ptr);
DRM_INFO("amdgpu: ttm finalized\n");
  }




[PATCH 2/3] drm/amdgpu: Fix compile warnings

2022-01-12 Thread yipechai
Fix compile warnings.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 394a18e3c6af..7afeec4255bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -869,7 +869,8 @@ static int amdgpu_ras_enable_all_features(struct 
amdgpu_device *adev,
 }
 /* feature ctl end */
 
-int amdgpu_ras_block_match_default(struct amdgpu_ras_block_object* block_obj, 
enum amdgpu_ras_block block)
+static int amdgpu_ras_block_match_default(struct amdgpu_ras_block_object 
*block_obj,
+   enum amdgpu_ras_block block)
 {
if(!block_obj)
return -EINVAL;
-- 
2.25.1



[PATCH 1/3] drm/amdgpu: Use ARRAY_SIZE to get array length

2022-01-12 Thread yipechai
Use ARRAY_SIZE to get array length.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 23f4290b2fde..394a18e3c6af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -89,7 +89,8 @@ const char *get_ras_block_str(struct ras_common_if *ras_block)
return ras_block_string[ras_block->block];
 }
 
-#define ras_block_str(_BLOCK_)  (((_BLOCK_) < 
(sizeof(*ras_block_string)/sizeof(const char*))) ? ras_block_string[_BLOCK_] : 
"Out Of Range")
+#define ras_block_str(_BLOCK_) \
+   (((_BLOCK_) < ARRAY_SIZE(ras_block_string)) ? ras_block_string[_BLOCK_] 
: "Out Of Range")
 
 #define ras_err_str(i) (ras_error_string[ffs(i)])
 
-- 
2.25.1



[PATCH 3/3] drm/amdgpu: Adjust the code format

2022-01-12 Thread yipechai
Adjust the code format.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 7afeec4255bd..54d807b021fe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2764,9 +2764,8 @@ int amdgpu_ras_register_ras_block(struct amdgpu_device 
*adev,
 
/* If the ras object is in ras_list, don't add it again */
list_for_each_entry_safe(obj, tmp, &adev->ras_list, node) {
-   if (obj == ras_block_obj) {
+   if (obj == ras_block_obj)
return 0;
-   }
}
 
INIT_LIST_HEAD(&ras_block_obj->node);
-- 
2.25.1



[PATCH 1/2] drm/admgpu: add data struct for vram check

2022-01-12 Thread Xiaojian Du
This patch is to add data struct for vram check.

Signed-off-by: Xiaojian Du 
Reviewed-by: Huang Rui 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 64cd80d050eb..13196e50a98a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -928,6 +928,11 @@ struct amdgpu_device {
uint32_tbios_scratch_reg_offset;
uint32_tbios_scratch[AMDGPU_BIOS_NUM_SCRATCH];
 
+   /* vram check */
+   struct amdgpu_bo*vram_bo;
+   uint64_tvram_gpu;
+   void*vram_ptr;
+
/* Direct GMA */
struct amdgpu_direct_gmadirect_gma;
/* SSG */
-- 
2.25.1



[PATCH 2/2] drm/amdgpu: add vram check function for GMC

2022-01-12 Thread Xiaojian Du
This will add vram check function for GMC, it will cover gmc v8/9/10

Signed-off-by: Xiaojian Du 
Reviewed-by: Huang Rui 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 42 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  1 +
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c  |  4 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c   |  6 +++-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  8 -
 5 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 83f26bca7dac..dbc0de89d7e4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -833,3 +833,45 @@ void amdgpu_gmc_get_reserved_allocation(struct 
amdgpu_device *adev)
break;
}
 }
+
+int amdgpu_gmc_vram_checking(struct amdgpu_device *adev)
+{
+   int ret, size = 0x10;
+   uint8_t cptr[10];
+
+   ret = amdgpu_bo_create_kernel(adev, size, PAGE_SIZE,
+   AMDGPU_GEM_DOMAIN_VRAM,
+   &adev->vram_bo,
+   &adev->vram_gpu,
+   &adev->vram_ptr);
+   if (ret)
+   return ret;
+
+   memset(adev->vram_ptr, 0x86, size);
+   memset(cptr, 0x86, 10);
+
+   /**
+   * Check the start, the mid, and the end of the memory if the content of
+   * each byte is the pattern "0x86". If yes, we suppose the vram bo is
+   * workable.
+   *
+   * Note: If check the each byte of whole 1M bo, it will cost too many
+   * seconds, so here, we just pick up three parts for emulation.
+   */
+   ret = memcmp(adev->vram_ptr, cptr, 10);
+   if (ret)
+   return ret;
+
+   ret = memcmp(adev->vram_ptr + (size / 2), cptr, 10);
+   if (ret)
+   return ret;
+
+   ret = memcmp(adev->vram_ptr + size - 10, cptr, 10);
+   if (ret)
+   return ret;
+
+   amdgpu_bo_free_kernel(&adev->vram_bo, &adev->vram_gpu,
+   &adev->vram_ptr);
+
+   return 0;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 82ec665b366c..f06af61378ef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -343,4 +343,5 @@ void amdgpu_gmc_init_pdb0(struct amdgpu_device *adev);
 uint64_t amdgpu_gmc_vram_mc2pa(struct amdgpu_device *adev, uint64_t mc_addr);
 uint64_t amdgpu_gmc_vram_pa(struct amdgpu_device *adev, struct amdgpu_bo *bo);
 uint64_t amdgpu_gmc_vram_cpu_pa(struct amdgpu_device *adev, struct amdgpu_bo 
*bo);
+int amdgpu_gmc_vram_checking(struct amdgpu_device *adev);
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 3915ba837596..5e407c88c8d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -1048,6 +1048,10 @@ static int gmc_v10_0_hw_init(void *handle)
if (r)
return r;
 
+   r = amdgpu_gmc_vram_checking(adev);
+   if (r)
+   return r;
+
if (adev->umc.funcs && adev->umc.funcs->init_registers)
adev->umc.funcs->init_registers(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index 9a3fc0926903..6c94a9712a3a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -1241,7 +1241,11 @@ static int gmc_v8_0_hw_init(void *handle)
if (r)
return r;
 
-   return r;
+   r = amdgpu_gmc_vram_checking(adev);
+   if (r)
+   return r;
+
+   return 0;
 }
 
 static int gmc_v8_0_hw_fini(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index ce7d438eeabe..1ea18b4ff63f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1771,8 +1771,14 @@ static int gmc_v9_0_hw_init(void *handle)
adev->umc.funcs->init_registers(adev);
 
r = gmc_v9_0_gart_enable(adev);
+   if (r)
+   return r;
 
-   return r;
+   r = amdgpu_gmc_vram_checking(adev);
+   if (r)
+   return r;
+
+   return 0;
 }
 
 /**
-- 
2.25.1



Re: [PATCH 1/2] drm/admgpu: add data struct for vram check

2022-01-12 Thread Huang Rui
On Thu, Jan 13, 2022 at 03:45:25PM +0800, Du, Xiaojian wrote:
> This patch is to add data struct for vram check.
> 

The subject has a typo: admgpu -> amdgpu

> Signed-off-by: Xiaojian Du 
> Reviewed-by: Huang Rui 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 64cd80d050eb..13196e50a98a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -928,6 +928,11 @@ struct amdgpu_device {
>   uint32_tbios_scratch_reg_offset;
>   uint32_tbios_scratch[AMDGPU_BIOS_NUM_SCRATCH];
>  
> + /* vram check */
> + struct amdgpu_bo*vram_bo;
> + uint64_tvram_gpu;
> + void*vram_ptr;
> +
>   /* Direct GMA */
>   struct amdgpu_direct_gmadirect_gma;
>   /* SSG */
> -- 
> 2.25.1
>