Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Christian König Wed, 07 Apr 2021 03:29:08 -0700

Hi Andrey,

Am 06.04.21 um 23:22 schrieb Andrey Grodzovsky:

Hey Christian, Denis, see bellow -

On 2021-04-06 6:34 a.m., Christian König wrote:
Hi Andrey,
well good question. My job is to watch over the implementation anddesign and while I always help I can adjust anybodies schedule.
Is the patch to print a warning when the hardware is accessed withoutholding the locks merged yet? If not then that would probably be agood starting point.
It's merged into amd-staging-drm-next and since I work ondrm-misc-next I will cherry-pick it into there.


Ok good to know, I haven't tracked that one further.

Then we would need to unify this with the SRCU to make sure that wehave both the reset lock as well as block the hotplug code fromreusing the MMIO space.
In my understanding there is a significant difference between handlingof GPU reset and unplug - while GPU reset use case requires any HWaccessing code to block and wait for the reset to finish and thenproceed, hot-unplugis permanent and hence no need to wait and proceed but rather abort atonce.


Yes, absolutely correct.

This why I think that in any place we already check for device resetwe should also add a check for hot-unplug but the handling would bedifferent
in that for hot-unplug we would abort instead of keep waiting.


Yes, that's the rough picture in my head as well.

Essentially Daniels patch of having anamdgpu_device_hwaccess_begin()/_end() was the right approach. You justcan't do it in the top level IOCTL handler, but rather need it somewherebetween front end and backend.

Similar to handling device reset for unplug we obviously also need tostop and block any MMIO accesses once device is unplugged and, asDaniel Vetter mentioned - we have to do it before finishing pci_remove(early device fini)and not later (when last device reference is dropped from user space)in order to prevent reuse of MMIO space we still access by other hotplugging devices. As in device reset case we need to cancel all delayworks, stop drm schedule, complete all unfinished fences(both HW andscheduler fences). While you stated strong objection to forcesignalling scheduler fences from GPU reset, quote:
"you can't signal the dma_fence waiting. Waiting for a dma_fence alsomeans you wait for the GPU reset to finish. When we would signal thedma_fence during the GPU reset then we would run into memorycorruption because the hardware jobs running after the GPU reset wouldaccess memory which is already freed."To my understating this is a key difference with hot-unplug, thedevice is gone, all those concerns are irrelevant and hence we canactually force signal scheduler fences (setting and error to thembefore) to force completion of any
waiting clients such as possibly IOCTLs or async page flips e.t.c.

Yes, absolutely correct. That's what I also mentioned to Daniel. When weare able to nuke the device and any memory access it might do we canalso signal the fences.

Beyond blocking all delayed works and scheduler threads we also needto guarantee no IOCTL can access MMIO post device unplug OR in flightIOCTLs are done before we finish pci_remove (amdgpu_pci_remove for us).For this I suggest we do something like what we worked on with TakashiIwai the ALSA maintainer recently when he helped implementing PCI BARsmove support for snd_hda_intel. Take a look athttps://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=cbaa324799718e2b828a8c7b5b001dd896748497and
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=e36365d9ab5bbc30bdc221ab4b3437de34492440
We also had same issue there, how to prevent MMIO accesses while theBARs are migrating. What was done there is a refcount was added tocount all IOCTLs in flight, for any in flight IOCTL the BAR migrationhandler wouldblock for the refcount to drop to 0 before it would proceed, for anylater IOCTL it stops and wait if device is in migration state. We evendon't need the wait part, nothing to wait for, we just return with-ENODEV for this case.


This is essentially what the DRM SRCU is doing as well.

For the hotplug case we could do this in the toplevel since we cansignal the fence and don't need to block memory management.

But I'm not sure, maybe we should handle it the same way as reset ormaybe we should have it at the top level.


Regards,
Christian.

The above approach should allow us to wait for all the IOCTLs inflight, together with stopping scheduler threads and cancelling andflushing all in flight work items and timers i think It should give asfull solution for the hot-unplug case
of preventing any MMIO accesses post device pci_remove.

Let me know what you think guys.

Andrey
And then testing, testing, testing to see if we have missed something.

Christian.

Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky:
Denis, Christian, are there any updates in the plan on how to moveon with this ? As you know I need very similar code for myup-streaming of device hot-unplug. My latest solution(https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html)was not acceptable because of low level guards on the registeraccessors level which was hurting performance. Basically I need away to prevent any MMIO write accesses from kernel driver afterdevice is removed (UMD accesses are taken care of by page faultingdummy page). We are using now hot-unplug code for Freemont programand so up-streaming became more of a priority then before. This MMIOaccess issue is currently my main blocker from up-streaming. Isthere any way I can assist in pushing this on ?
Andrey

On 2021-03-18 5:51 a.m., Christian König wrote:
Am 18.03.21 um 10:30 schrieb Li, Dennis:
>>> The GPU reset doesn't complete the fences we wait for. It onlycompletes the hardware fences as part of the reset.
>>> So waiting for a fence while holding the reset lock is illegaland needs to be avoided.
I understood your concern. It is more complex for DRM GFX,therefore I abandon adding lock protection for DRM ioctls now.Maybe we can try to add all kernel dma_fence waiting in a list,and signal all in recovery threads. Do you have same concern forcompute cases?
Yes, compute (KFD) is even harder to handle.
See you can't signal the dma_fence waiting. Waiting for a dma_fencealso means you wait for the GPU reset to finish.
When we would signal the dma_fence during the GPU reset then wewould run into memory corruption because the hardware jobs runningafter the GPU reset would access memory which is already freed.
>>> Lockdep also complains about this when it is used correctly.The only reason it doesn't complain here is because you use anatomic+wait_event instead of a locking primitive.
Agree. This approach will escape the monitor of lockdep. Its goalis to block other threads when GPU recovery thread start. But Icouldn’t find a better method to solve this problem. Do you havesome suggestion?
Well, completely abandon those change here.
What we need to do is to identify where hardware access happens andthen insert taking the read side of the GPU reset lock so that wedon't wait for a dma_fence or allocate memory, but still protectthe hardware from concurrent access and reset.
Regards,
Christian.
Best Regards

Dennis Li

*From:* Koenig, Christian <christian.koe...@amd.com>
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis <dennis...@amd.com>;amd-gfx@lists.freedesktop.org; Deucher, Alexander<alexander.deuc...@amd.com>; Kuehling, Felix<felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhanceits stability
Exactly that's what you don't seem to understand.
The GPU reset doesn't complete the fences we wait for. It onlycompletes the hardware fences as part of the reset.
So waiting for a fence while holding the reset lock is illegal andneeds to be avoided.
Lockdep also complains about this when it is used correctly. Theonly reason it doesn't complain here is because you use anatomic+wait_event instead of a locking primitive.
Regards,

Christian.

------------------------------------------------------------------------

*Von:*Li, Dennis <dennis...@amd.com <mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <christian.koe...@amd.com<mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org><amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander<alexander.deuc...@amd.com <mailto:alexander.deuc...@amd.com>>;Kuehling, Felix <felix.kuehl...@amd.com<mailto:felix.kuehl...@amd.com>>; Zhang, Hawking<hawking.zh...@amd.com <mailto:hawking.zh...@amd.com>>*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhanceits stability
>>> Those two steps need to be exchanged or otherwise it ispossible that new delayed work items etc are started before thelock is taken.What about adding check for adev->in_gpu_reset in work item? Ifexchange the two steps, it maybe introduce the deadlock. Forexample, the user thread hold the read lock and waiting for thefence, if recovery thread try to hold write lock and then completefences, in this case, recovery thread will always be blocked.
Best Regards
Dennis Li
-----Original Message-----
From: Koenig, Christian <christian.koe...@amd.com<mailto:christian.koe...@amd.com>>
Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis <dennis...@amd.com <mailto:dennis...@amd.com>>;amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander<alexander.deuc...@amd.com <mailto:alexander.deuc...@amd.com>>;Kuehling, Felix <felix.kuehl...@amd.com<mailto:felix.kuehl...@amd.com>>; Zhang, Hawking<hawking.zh...@amd.com <mailto:hawking.zh...@amd.com>>Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhanceits stability
Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adevobject. The atomic type variable in_gpu_reset is used to avoidrecovery thread reenter and make lower functions return moreearlier when recovery start, but couldn't block recovery threadwhen it access hardware. The r/w semaphore reset_sem is used tosolve these synchronization issues between recovery thread andother threads.
>
> The original solution locked registers' access in lowerfunctions, which will introduce following issues:
>
> 1) many lower functions are used in both recovery thread andothers. Firstly we must harvest these functions, it is easy tomiss someones. Secondly these functions need select which lock(read lock or write lock) will be used, according to the thread itis running in. If the thread context isn't considered, the addedlock will easily introduce deadlock. Besides that, in most time,developer easily forget to add locks for new functions.
>
> 2) performance drop. More lower functions are more frequentlycalled.
>
> 3) easily introduce false positive lockdep complaint, becausewrite lock has big range in recovery thread, but low levelfunctions will hold read lock may be protected by other locks inother threads.
>
> Therefore the new solution will try to add lock protection forioctls of kfd. Its goal is that there are no threads except forrecovery thread or its children (for xgmi) to access hardware whendoing GPU reset and resume. So refine recovery thread as thefollowing:
>
> Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1)
> 1). if failed, it means system had a recovery threadrunning, current thread exit directly;
>     2). if success, enter recovery thread;
>
> Step 1: cancel all delay works, stop drm schedule, complete allunreceived fences and so on. It try to stop or pause other threads.
>
> Step 2: call down_write(&adev->reset_sem) to hold write lock,which will block recovery thread until other threads release readlocks.
Those two steps need to be exchanged or otherwise it is possiblethat new delayed work items etc are started before the lock is taken.
Just to make it clear until this is fixed the whole patch set is aNAK.
Regards,
Christian.

>
> Step 3: normally, there is only recovery threads running toaccess hardware, it is safe to do gpu reset now.
>
> Step 4: do post gpu reset, such as call all ips' resume functions;
>
> Step 5: atomic set adev->in_gpu_reset as 0, wake up otherthreads and release write lock. Recovery thread exit normally.
>
> Other threads call the amdgpu_read_lock to synchronize withrecovery thread. If it finds that in_gpu_reset is 1, it shouldrelease read lock if it has holden one, and then blocks itself towait for recovery finished event. If thread successfully hold readlock and in_gpu_reset is 0, it continues. It will exit normally orbe stopped by recovery thread in step 1.
>
> Dennis Li (4):
>    drm/amdgpu: remove reset lock from low level functions
>    drm/amdgpu: refine the GPU recovery sequence
>    drm/amdgpu: instead of using down/up_read directly
>    drm/amdkfd: add reset lock protection for kfd entry functions
>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   | 14 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 173+++++++++++++-----
> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 8 -
> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        | 4 +-
> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c         | 9 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         | 5 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c         | 5 +-
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172++++++++++++++++-
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h         | 3 +-
> drivers/gpu/drm/amd/amdkfd/kfd_process.c      | 4 +
> .../amd/amdkfd/kfd_process_queue_manager.c    | 17 ++
>   12 files changed, 345 insertions(+), 75 deletions(-)
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Reply via email to