nv: add navi GPU reset handler

Sharma, Shashank Fri, 04 Feb 2022 08:59:18 -0800



On 2/4/2022 5:50 PM, Lazar, Lijo wrote:

[AMD Official Use Only]

To explain more -
        It's an unconditional reset done by the kernel on every suspend 
(S3/S4). In such a case which process is going to receive the trace events?

Most likely use case would be related to gpu recovery. Triggering a trace on 
every reset doesn't look like a good idea.

If you observer carefully, we are just providing an infrastructure, theapplication's intention is unknown to us. In my opinion it's rather nota good idea to apply a filter in kernel, with our interpretation ofintention.

For example if an app just wants to count how many resets are happeningdue to S3/S4 transition, this infra might become useless. It wouldrather be a better idea for the app to learn and ignore these scenarioswhich it is not interested in.


This could eventually be just difference in design philosophy maybe :)

- Shashank

Thanks,
Lijo

-----Original Message-----
From: Sharma, Shashank <shashank.sha...@amd.com>
Sent: Friday, February 4, 2022 10:09 PM
To: Lazar, Lijo <lijo.la...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Somalapuram, Amaranath 
<amaranath.somalapu...@amd.com>; Koenig, Christian <christian.koe...@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

Hey Lijo,
I somehow missed to respond on this comment, pls find inline:

Regards
Shashank

On 1/22/2022 7:42 AM, Lazar, Lijo wrote:



On 1/22/2022 2:04 AM, Sharma, Shashank wrote:

  From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00
2001
From: Somalapuram Amaranath <amaranath.somalapu...@amd.com>
Date: Fri, 21 Jan 2022 14:19:42 +0530
Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

This patch adds a GPU reset handler for Navi ASIC family, which
typically dumps some of the registersand sends a trace event.

V2: Accomodated call to work function to send uevent

Signed-off-by: Somalapuram Amaranath <amaranath.somalapu...@amd.com>
Signed-off-by: Shashank Sharma <shashank.sha...@amd.com>
---
   drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++
   1 file changed, 28 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245
100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device
*adev)
       }
   }

+static void amdgpu_reset_dumps(struct amdgpu_device *adev) {
+    int r = 0, i;
+
+    /* original raven doesn't have full asic reset */
+    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
+        return;
+    for (i = 0; i < adev->num_ip_blocks; i++) {
+        if (!adev->ip_blocks[i].status.valid)
+            continue;
+        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
+            continue;
+        r =
+adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
+
+        if (r)
+            DRM_ERROR("reset_reg_dumps of IP block <%s> failed
+%d\n",
+                    adev->ip_blocks[i].version->funcs->name, r);
+    }
+
+    /* Schedule work to send uevent */
+    if (!queue_work(system_unbound_wq, &adev->gpu_reset_work))
+        DRM_ERROR("failed to add GPU reset work\n");
+
+    dump_stack();
+}
+
   static int nv_asic_reset(struct amdgpu_device *adev)
   {
       int ret = 0;

+    amdgpu_reset_dumps(adev);


Had a comment on this before. Now there are different reasons (or even
no reason like a precautionary reset) to perform reset. A user would
be interested in a trace only if the reason is valid.

To clarify on why a work shouldn't be scheduled on every reset, check
here -

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amd
gpu/amdgpu_drv.c#L2188

In the example you pointed to, they have a criteria to decide what is a valid 
reset in their context, in the kernel side itself. So they can take a call if 
they want to do something about it or not.

But, in our case, we want to send the trace_event to user with some register 
values on every reset, and it is actually up to the profiling app to interpret 
(along with what it wants to call a GPU reset). So I don't think this is 
causing a considerable overhead.

- Shashank




Thanks,
Lijo

       switch (nv_asic_reset_method(adev)) {
       case AMD_RESET_METHOD_PCI:
           dev_info(adev->dev, "PCI reset\n");

Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

Reply via email to