virt: Drop concurrent GPU reset protection for SRIOV

Christian König Tue, 04 Jan 2022 02:19:04 -0800

Hi Jingwen,

well what I mean is that we need to adjust the implementation in amdgputo actually match the requirements.

Could be that the reset sequence is questionable in general, but I doubtso at least for now.

See the FLR request from the hypervisor is just another source ofsignaling the need for a reset, similar to each job timeout on eachqueue. Otherwise you have a race condition between the hypervisor andthe scheduler.

Properly setting in_gpu_reset is indeed mandatory, but should happen ata central place and not in the SRIOV specific code.

In other words I strongly think that the current SRIOV resetimplementation is severely broken and what Andrey is doing is actuallyfixing it.


Regards,
Christian.

Am 04.01.22 um 10:07 schrieb JingWen Chen:

Hi Christian,
I'm not sure what do you mean by "we need to change SRIOV not the driver".

Do you mean we should change the reset sequence in SRIOV? This will be a huge 
change for our SRIOV solution.

 From my point of view, we can directly use
amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of 
try_lock since no one will conflict with this thread with reset_domain 
introduced.
But we do need the reset_sem and adev->in_gpu_reset to keep device untouched 
via user space.

Best Regards,
Jingwen Chen

On 2022/1/3 下午6:17, Christian König wrote:

Please don't. This patch is vital to the cleanup of the reset procedure.

If SRIOV doesn't work with that we need to change SRIOV and not the driver.

Christian.

Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:

Sure, I guess i can drop this patch then.

Andrey

On 2021-12-24 4:57 a.m., JingWen Chen wrote:

I do agree with shaoyun, if the host find the gpu engine hangs first, and do 
the flr, guest side thread may not know this and still try to access HW(e.g. 
kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset 
status). And this may lead to very bad result.

On 2021/12/24 下午4:58, Deng, Emily wrote:

These patches look good to me. JingWen will pull these patches and do some 
basic TDR test on sriov environment, and give feedback.

Best wishes
Emily Deng

-----Original Message-----
From: Liu, Monk <monk....@amd.com>
Sent: Thursday, December 23, 2021 6:14 PM
To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, Andrey
<andrey.grodzov...@amd.com>; dri-devel@lists.freedesktop.org; amd-
g...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; Chen,
JingWen <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com>
Cc: dan...@ffwll.ch
Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
for SRIOV

[AMD Official Use Only]

@Chen, Horace @Chen, JingWen @Deng, Emily

Please take a review on Andrey's patch

Thanks
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Solution | AMD
-------------------------------------------------------------------
we are hiring software manager for CVS core team
-------------------------------------------------------------------

-----Original Message-----
From: Koenig, Christian <christian.koe...@amd.com>
Sent: Thursday, December 23, 2021 4:42 PM
To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri-
de...@lists.freedesktop.org; amd-...@lists.freedesktop.org
Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace
<horace.c...@amd.com>
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection
for SRIOV

Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:

Since now flr work is serialized against  GPU resets there is no need
for this.

Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>

Acked-by: Christian König <christian.koe...@amd.com>

---
    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
    2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct

work_struct *work)

        struct amdgpu_device *adev = container_of(virt, struct

amdgpu_device, virt);

        int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;

-    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-     * otherwise the mailbox msg will be ruined/reseted by
-     * the VF FLR.
-     */
-    if (!down_write_trylock(&adev->reset_sem))
-        return;
-
        amdgpu_virt_fini_data_exchange(adev);
-    atomic_set(&adev->in_gpu_reset, 1);

        xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);

@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct

work_struct *work)

        } while (timeout > 1);

    flr_done:
-    atomic_set(&adev->in_gpu_reset, 0);
-    up_write(&adev->reset_sem);
-
        /* Trigger recovery for world switch failure if no TDR */
        if (amdgpu_device_should_recover_gpu(adev)
            && (!amdgpu_device_has_job_running(adev) || diff --git
a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct

work_struct *work)

        struct amdgpu_device *adev = container_of(virt, struct

amdgpu_device, virt);

        int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;

-    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-     * otherwise the mailbox msg will be ruined/reseted by
-     * the VF FLR.
-     */
-    if (!down_write_trylock(&adev->reset_sem))
-        return;
-
        amdgpu_virt_fini_data_exchange(adev);
-    atomic_set(&adev->in_gpu_reset, 1);

        xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);

@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct

work_struct *work)

        } while (timeout > 1);

    flr_done:
-    atomic_set(&adev->in_gpu_reset, 0);
-    up_write(&adev->reset_sem);
-
        /* Trigger recovery for world switch failure if no TDR */
        if (amdgpu_device_should_recover_gpu(adev)
            && (!amdgpu_device_has_job_running(adev) ||

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Reply via email to