amdgpu: reduce reset time

Andrey Grodzovsky Wed, 27 Jul 2022 07:37:24 -0700

On 2022-07-27 06:35, Zhao, Victor wrote:

[AMD Official Use Only - General]


Hi Andrey,

Problem with status.hang is that it is set at 
amdgpu_device_ip_check_soft_reset, which is not implemented in nv or gfx10. 
They have to be nicely implemented first.
Another option I thought is to mark status.hang or add a flag to amdgpu_gfx 
when job timeout reported on gfx/comp ring. And this will require some logic to 
map the relationship for ring and ip blocks. This way does not look good as 
well.

I don't think we need this at the ring level, its enough to know thatthe reset you are going through is because of one of rings are hanged toapply this skip logic, it's pretty easy if we add 'bool hang' flag toadev->reset_domainwhich u can set in the beginning amdgpu_job_timedout and clear in theend. No protection is required as all the resets from all origins areserialized with timeout handler in a single threaded queue.


Andrey



Thanks,
Victor



-----Original Message-----
From: Grodzovsky, Andrey <andrey.grodzov...@amd.com>
Sent: Wednesday, July 27, 2022 12:57 AM
To: Zhao, Victor <victor.z...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Deng, Emily <emily.d...@amd.com>; 
Koenig, Christian <christian.koe...@amd.com>
Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time


On 2022-07-26 05:40, Zhao, Victor wrote:

[AMD Official Use Only - General]

Hi Andrey,

Reply inline.


Thanks,
Victor



-----Original Message-----
From: Grodzovsky, Andrey <andrey.grodzov...@amd.com>
Sent: Tuesday, July 26, 2022 5:18 AM
To: Zhao, Victor <victor.z...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Deng, Emily
<emily.d...@amd.com>; Koenig, Christian <christian.koe...@amd.com>
Subject: Re: [PATCH 5/5] drm/amdgpu: reduce reset time


On 2022-07-22 03:34, Victor Zhao wrote:

In multi container use case, reset time is important, so skip ring
tests and cp halt wait during ip suspending for reset as they are
going to fail and cost more time on reset

Why are they failing in this case ? Skipping ring tests is not the best idea as 
you loose important indicator of system's sanity. Is there any way to make them 
work ?

[Victor]: I've seen gfx ring test fail every time after a gfx engine hang. I 
thought it should be expected as gfx is in a bad state. Do you know the reason 
we have ring tests before reset? As we are going to reset the asic anyway.
Another approach could be to make the skip mode2 only or reduce the wait time 
here.


I dug down in history and according to commit 'drm/amdgpu:unmap KCQ in gfx 
hw_fini(v2)' you need to write to scratch register for completion of queue 
unmap operation so you defently don't want to just skip it. I agree in case 
that the ring is hung this has no point but remember that GPU reset can happen 
not only to a hunged ring but for other reasons (RAS, manual reset e.t.c.) in 
which case you probably want to shut down gracefully here ?
I see we have adev->ip_blocks[i].status.hang flag which you maybe can use here 
instead ?

Signed-off-by: Victor Zhao <victor.z...@amd.com>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  2 +-
    drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 26 +++++++++++++++----------
    2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 222d3d7ea076..f872495ccc3a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -477,7 +477,7 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device *adev)
                kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.compute_ring[i],
                                           RESET_QUEUES, 0, 0);

- if (adev->gfx.kiq.ring.sched.ready)

+       if (adev->gfx.kiq.ring.sched.ready && !amdgpu_in_reset(adev))
                r = amdgpu_ring_test_helper(kiq_ring);
        spin_unlock(&adev->gfx.kiq.ring_lock);

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c

b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index fafbad3cf08d..9ae29023e38f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -5971,16 +5971,19 @@ static int gfx_v10_0_cp_gfx_enable(struct amdgpu_device 
*adev, bool enable)
                WREG32_SOC15(GC, 0, mmCP_ME_CNTL, tmp);
        }

- for (i = 0; i < adev->usec_timeout; i++) {

-               if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
-                       break;
-               udelay(1);
-       }
-
-       if (i >= adev->usec_timeout)
-               DRM_ERROR("failed to %s cp gfx\n", enable ? "unhalt" : "halt");
+       if (!amdgpu_in_reset(adev)) {
+               for (i = 0; i < adev->usec_timeout; i++) {
+                       if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
+                               break;
+                       udelay(1);
+               }

+ if (i >= adev->usec_timeout)

+                       DRM_ERROR("failed to %s cp gfx\n",
+                                 enable ? "unhalt" : "halt");
+       }
        return 0;
+
    }

This change has impact beyond container case no ? We had no issue with this 
code during regular reset cases so why we would give up on this code which 
confirms CP is idle ? What is the side effect of skipping this during all GPU 
resets ?

Andrey

[Victor]: I see "failed to halt cp gfx" with regular reset cases as well when 
doing a gfx hang test using quark. I haven't seen a side effect with Mode1 reset yet but 
maybe shorten the wait time could be better?


Same as above i guess, it would indeed time out for a hung ring but GPU reset 
happens not only because of hung rings but for other reasons.

Andrey

static int gfx_v10_0_cp_gfx_load_pfp_microcode(struct

amdgpu_device
*adev) @@ -7569,8 +7572,10 @@ static int gfx_v10_0_kiq_disable_kgq(struct 
amdgpu_device *adev)
        for (i = 0; i < adev->gfx.num_gfx_rings; i++)
                kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.gfx_ring[i],
                                           PREEMPT_QUEUES, 0, 0);
-
-       return amdgpu_ring_test_helper(kiq_ring);
+       if (!amdgpu_in_reset(adev))
+               return amdgpu_ring_test_helper(kiq_ring);
+       else
+               return 0;
    }
    #endif

@@ -7610,6 +7615,7 @@ static int gfx_v10_0_hw_fini(void *handle)return 0;

        }
+
        gfx_v10_0_cp_enable(adev, false);
        gfx_v10_0_enable_gui_idle_interrupt(adev, false);

Re: [PATCH 5/5] drm/amdgpu: reduce reset time

Reply via email to