virt: Drop concurrent GPU reset protection for SRIOV

JingWen Chen Thu, 06 Jan 2022 19:58:14 -0800


On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>
> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>
>>>>>>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>>>>>>> signaling the need for a reset, similar to each job timeout on each 
>>>>>>>>>>> queue. Otherwise you have a race condition between the hypervisor 
>>>>>>>>>>> and the scheduler.
>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>>>>>> FLR is about to start or was already executed, but host will do FLR 
>>>>>>>>> anyway without waiting for guest too long
>>>>>>>>>
>>>>>>>> Then we have a major design issue in the SRIOV protocol and really 
>>>>>>>> need to question this.
>>>>>>>>
>>>>>>>> How do you want to prevent a race between the hypervisor resetting the 
>>>>>>>> hardware and the client trying the same because of a timeout?
>>>>>>>>
>>>>>>>> As far as I can see the procedure should be:
>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a 
>>>>>>>> timeout or signal from hypervisor.
>>>>>>>> 2. For each of those potential reset sources a work item is send to 
>>>>>>>> the single workqueue.
>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we 
>>>>>>>> are ready for the reset.
>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>> 6. Cancel work items which might have been scheduled from other reset 
>>>>>>>> sources.
>>>>>>>>
>>>>>>>> It does make sense that the hypervisor resets the hardware without 
>>>>>>>> waiting for the clients for too long, but if we don't follow this 
>>>>>>>> general steps we will always have a race between the different 
>>>>>>>> components.
>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from 
>>>>>>> hypervisor is just to notify guest the hw VF FLR is about to start or 
>>>>>>> was already executed, but host will do FLR anyway without waiting for 
>>>>>>> guest too long'
>>>>>>> and there is no strict waiting from the hypervisor for 
>>>>>>> IDH_READY_TO_RESET to be recived from guest before starting the reset 
>>>>>>> then setting in_gpu_reset and locking reset_sem from guest side is not 
>>>>>>> really full proof
>>>>>>> protection from MMIO accesses by the guest - it only truly helps if 
>>>>>>> hypervisor waits for that message before initiation of HW reset.
>>>>>>>
>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never 
>>>>>> has the chance to send the response back, then other VFs will have to 
>>>>>> wait it reset. All the vfs will hang in this case. Or sometimes the 
>>>>>> mailbox has some delay and other VFs will also wait. The user of other 
>>>>>> VFs will be affected in this case.
>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge 
>>>>> here, not the guest.
>>>>>
>>>>> What the hypervisor should do (and it already seems to be designed that 
>>>>> way) is to send the guest a message that a reset is about to happen and 
>>>>> give it some time to response appropriately.
>>>>>
>>>>> The guest on the other hand then tells the hypervisor that all processing 
>>>>> has stopped and it is ready to restart. If that doesn't happen in time 
>>>>> the hypervisor should eliminate the guest probably trigger even more 
>>>>> severe consequences, e.g. restart the whole VM etc...
>>>>>
>>>>> Christian.
>>>> So what's the end conclusion here regarding dropping this particular patch 
>>>> ? Seems to me we still need to drop it to prevent driver's MMIO access
>>>> to the GPU during reset from various places in the code.
>>>>
>>>> Andrey
>>>>
>>> Hi Andrey & Christian,
>>>
>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) 
>>> and run some tests. If a engine hang during an OCL benchmark(using kfd), we 
>>> can see the logs below:
>
>
> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU 
> reset protection for SRIOV' ?
>
>
I ported the entire patchset
>>>
>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>> [  428.400582] [drm] clean up the vf2pf work item
>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 
>>> ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread 
>>> xgemmStandalone pid 3557)
>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 
>>> 0x00007fc991c04000 from client 0x1b (UTCL2)
>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an 
>>> unsuccessful queues preemption
>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>
>>> As kfd relies on these to check if GPU is in reset, dropping it will hit 
>>> some page fault and fence error very easily.
>> To be clear, we can also hit the page fault with the reset_sem and 
>> in_gpu_reset, just not as easily as dropping them.
>
> Are you saying that the entire patch-set with and without patch 
> 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
> is casing this GPUVM page fault during testing engine hang while running 
> benchmark ?
>
> Do you never observe this page fault when running this test with original 
> tree without the new patch-set ?
>
> Andrey
>
I think this page fault issue can be seen even on the original tree. It's just 
drop the concurrent GPU reset will hit it more easily.


We may need a new way to protect the reset in SRIOV.

>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>
>>>>>>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>>>>>>> signaling the need for a reset, similar to each job timeout on each 
>>>>>>>>>>> queue. Otherwise you have a race condition between the hypervisor 
>>>>>>>>>>> and the scheduler.
>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>>>>>> FLR is about to start or was already executed, but host will do FLR 
>>>>>>>>> anyway without waiting for guest too long
>>>>>>>>>
>>>>>>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>>>>>>> implementation is severely broken and what Andrey is doing is 
>>>>>>>>>>> actually fixing it.
>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>
>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the 
>>>>>>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>> To: Chen, JingWen <jingwen.ch...@amd.com>; Christian König 
>>>>>>>>> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey 
>>>>>>>>> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; Liu, 
>>>>>>>>> Monk <monk....@amd.com>; dri-de...@lists.freedesktop.org; 
>>>>>>>>> amd-gfx@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; 
>>>>>>>>> Chen, JingWen <jingwen.ch...@amd.com>
>>>>>>>>> Cc: dan...@ffwll.ch
>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>>>>>> protection for SRIOV
>>>>>>>>>
>>>>>>>>> Hi Jingwen,
>>>>>>>>>
>>>>>>>>> well what I mean is that we need to adjust the implementation in 
>>>>>>>>> amdgpu to actually match the requirements.
>>>>>>>>>
>>>>>>>>> Could be that the reset sequence is questionable in general, but I 
>>>>>>>>> doubt so at least for now.
>>>>>>>>>
>>>>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>>>>> signaling the need for a reset, similar to each job timeout on each 
>>>>>>>>> queue. Otherwise you have a race condition between the hypervisor and 
>>>>>>>>> the scheduler.
>>>>>>>>>
>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen 
>>>>>>>>> at a central place and not in the SRIOV specific code.
>>>>>>>>>
>>>>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>>>>> implementation is severely broken and what Andrey is doing is 
>>>>>>>>> actually fixing it.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>> Hi Christian,
>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the 
>>>>>>>>>> driver".
>>>>>>>>>>
>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will 
>>>>>>>>>> be a huge change for our SRIOV solution.
>>>>>>>>>>
>>>>>>>>>>     From my point of view, we can directly use 
>>>>>>>>>> amdgpu_device_lock_adev
>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since 
>>>>>>>>>> no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device 
>>>>>>>>>> untouched via user space.
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Jingwen Chen
>>>>>>>>>>
>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset 
>>>>>>>>>>> procedure.
>>>>>>>>>>>
>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the 
>>>>>>>>>>> driver.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs 
>>>>>>>>>>>>> first, and do the flr, guest side thread may not know this and 
>>>>>>>>>>>>> still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset 
>>>>>>>>>>>>> and reset_sem to identify the reset status). And this may lead to 
>>>>>>>>>>>>> very bad result.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches 
>>>>>>>>>>>>>> and do some basic TDR test on sriov environment, and give 
>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Liu, Monk <monk....@amd.com>
>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>> Andrey <andrey.grodzov...@amd.com>;
>>>>>>>>>>>>>>> dri-de...@lists.freedesktop.org; amd- 
>>>>>>>>>>>>>>> g...@lists.freedesktop.org;
>>>>>>>>>>>>>>> Chen, Horace <horace.c...@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>> <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com>
>>>>>>>>>>>>>>> Cc: dan...@ffwll.ch
>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>>>>>> reset
>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri-
>>>>>>>>>>>>>>> de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace
>>>>>>>>>>>>>>> <horace.c...@amd.com>
>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>>>>>> reset
>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koe...@amd.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>>>>>> struct
>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>>>>>> received,
>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>>>>>>>>> 0,
>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no 
>>>>>>>>>>>>>>>> TDR
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || diff
>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>>>>>> struct
>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>>>>>> received,
>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>>>>>>>>> 0,
>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no 
>>>>>>>>>>>>>>>> TDR
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Reply via email to