virt: Drop concurrent GPU reset protection for SRIOV

JingWen Chen Wed, 05 Jan 2022 21:00:14 -0800


On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>
> On 2022-01-05 2:59 a.m., Christian König wrote:
>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>>>> signaling the need for a reset, similar to each job timeout on each 
>>>>>>>> queue. Otherwise you have a race condition between the hypervisor and 
>>>>>>>> the scheduler.
>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
>>>>>> is about to start or was already executed, but host will do FLR anyway 
>>>>>> without waiting for guest too long
>>>>>>
>>>>> Then we have a major design issue in the SRIOV protocol and really need 
>>>>> to question this.
>>>>>
>>>>> How do you want to prevent a race between the hypervisor resetting the 
>>>>> hardware and the client trying the same because of a timeout?
>>>>>
>>>>> As far as I can see the procedure should be:
>>>>> 1. We detect that a reset is necessary, either because of a fault a 
>>>>> timeout or signal from hypervisor.
>>>>> 2. For each of those potential reset sources a work item is send to the 
>>>>> single workqueue.
>>>>> 3. One of those work items execute first and prepares the reset.
>>>>> 4. We either do the reset our self or notify the hypervisor that we are 
>>>>> ready for the reset.
>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>> 6. Cancel work items which might have been scheduled from other reset 
>>>>> sources.
>>>>>
>>>>> It does make sense that the hypervisor resets the hardware without 
>>>>> waiting for the clients for too long, but if we don't follow this general 
>>>>> steps we will always have a race between the different components.
>>>>
>>>> Monk, just to add to this - if indeed as you say that 'FLR from hypervisor 
>>>> is just to notify guest the hw VF FLR is about to start or was already 
>>>> executed, but host will do FLR anyway without waiting for guest too long'
>>>> and there is no strict waiting from the hypervisor for IDH_READY_TO_RESET 
>>>> to be recived from guest before starting the reset then setting 
>>>> in_gpu_reset and locking reset_sem from guest side is not really full proof
>>>> protection from MMIO accesses by the guest - it only truly helps if 
>>>> hypervisor waits for that message before initiation of HW reset.
>>>>
>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never has 
>>> the chance to send the response back, then other VFs will have to wait it 
>>> reset. All the vfs will hang in this case. Or sometimes the mailbox has 
>>> some delay and other VFs will also wait. The user of other VFs will be 
>>> affected in this case.
>>
>> Yeah, agree completely with JingWen. The hypervisor is the one in charge 
>> here, not the guest.
>>
>> What the hypervisor should do (and it already seems to be designed that way) 
>> is to send the guest a message that a reset is about to happen and give it 
>> some time to response appropriately.
>>
>> The guest on the other hand then tells the hypervisor that all processing 
>> has stopped and it is ready to restart. If that doesn't happen in time the 
>> hypervisor should eliminate the guest probably trigger even more severe 
>> consequences, e.g. restart the whole VM etc...
>>
>> Christian.
>
>
> So what's the end conclusion here regarding dropping this particular patch ? 
> Seems to me we still need to drop it to prevent driver's MMIO access
> to the GPU during reset from various places in the code.
>
> Andrey
>
Hi Andrey & Christian,


I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) and 
run some tests. If a engine hang during an OCL benchmark(using kfd), we can see 
the logs below:

[  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
[  428.400582] [drm] clean up the vf2pf work item
[  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 
ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread 
xgemmStandalone pid 3557)
[  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 
0x00007fc991c04000 from client 0x1b (UTCL2)
[  437.531392] amdgpu: qcm fence wait loop timeout expired
[  437.535738] amdgpu: The cp might be in an unrecoverable state due to an 
unsuccessful queues preemption
[  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
[  438.087443] [drm] RE-INIT-early: nv_common succeeded

As kfd relies on these to check if GPU is in reset, dropping it will hit some 
page fault and fence error very easily.

>
>>
>>>> Andrey
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>>>> signaling the need for a reset, similar to each job timeout on each 
>>>>>>>> queue. Otherwise you have a race condition between the hypervisor and 
>>>>>>>> the scheduler.
>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF FLR 
>>>>>> is about to start or was already executed, but host will do FLR anyway 
>>>>>> without waiting for guest too long
>>>>>>
>>>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>>>> implementation is severely broken and what Andrey is doing is actually 
>>>>>>>> fixing it.
>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>
>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup 
>>>>>> do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>
>>>>>> Thanks
>>>>>> -------------------------------------------------------------------
>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>> -------------------------------------------------------------------
>>>>>> we are hiring software manager for CVS core team
>>>>>> -------------------------------------------------------------------
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>> To: Chen, JingWen <jingwen.ch...@amd.com>; Christian König 
>>>>>> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey 
>>>>>> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; Liu, Monk 
>>>>>> <monk....@amd.com>; dri-de...@lists.freedesktop.org; 
>>>>>> amd-gfx@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; Chen, 
>>>>>> JingWen <jingwen.ch...@amd.com>
>>>>>> Cc: dan...@ffwll.ch
>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>>> protection for SRIOV
>>>>>>
>>>>>> Hi Jingwen,
>>>>>>
>>>>>> well what I mean is that we need to adjust the implementation in amdgpu 
>>>>>> to actually match the requirements.
>>>>>>
>>>>>> Could be that the reset sequence is questionable in general, but I doubt 
>>>>>> so at least for now.
>>>>>>
>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>> signaling the need for a reset, similar to each job timeout on each 
>>>>>> queue. Otherwise you have a race condition between the hypervisor and 
>>>>>> the scheduler.
>>>>>>
>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen at 
>>>>>> a central place and not in the SRIOV specific code.
>>>>>>
>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>> implementation is severely broken and what Andrey is doing is actually 
>>>>>> fixing it.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>> Hi Christian,
>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the 
>>>>>>> driver".
>>>>>>>
>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will be 
>>>>>>> a huge change for our SRIOV solution.
>>>>>>>
>>>>>>>    From my point of view, we can directly use amdgpu_device_lock_adev
>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since no 
>>>>>>> one will conflict with this thread with reset_domain introduced.
>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device 
>>>>>>> untouched via user space.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Jingwen Chen
>>>>>>>
>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>> Please don't. This patch is vital to the cleanup of the reset 
>>>>>>>> procedure.
>>>>>>>>
>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the 
>>>>>>>> driver.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs 
>>>>>>>>>> first, and do the flr, guest side thread may not know this and still 
>>>>>>>>>> try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and 
>>>>>>>>>> reset_sem to identify the reset status). And this may lead to very 
>>>>>>>>>> bad result.
>>>>>>>>>>
>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>> These patches look good to me. JingWen will pull these patches and 
>>>>>>>>>>> do some basic TDR test on sriov environment, and give feedback.
>>>>>>>>>>>
>>>>>>>>>>> Best wishes
>>>>>>>>>>> Emily Deng
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Liu, Monk <monk....@amd.com>
>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky,
>>>>>>>>>>>> Andrey <andrey.grodzov...@amd.com>;
>>>>>>>>>>>> dri-de...@lists.freedesktop.org; amd- g...@lists.freedesktop.org;
>>>>>>>>>>>> Chen, Horace <horace.c...@amd.com>; Chen, JingWen
>>>>>>>>>>>> <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com>
>>>>>>>>>>>> Cc: dan...@ffwll.ch
>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>
>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>
>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>
>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>> -- 
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri-
>>>>>>>>>>>> de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace
>>>>>>>>>>>> <horace.c...@amd.com>
>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>
>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is no
>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>>>>>>>>>> Acked-by: Christian König <christian.koe...@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>> -     */
>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>> -
>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>> -
>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>> */
>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>> -     */
>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>> -
>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>>>>>
>>>>>>>>>>>>>       flr_done:
>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>> -
>>>>>>>>>>>>>           /* Trigger recovery for world switch failure if no TDR
>>>>>>>>>>>>> */
>>>>>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||
>>

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Reply via email to