------------------------------------------------------------------
发件人:Felix Kuehling <felix.kuehl...@amd.com>
发送时间:2024年7月9日(星期二) 06:40
收件人:周春明(日月) <riyue....@alibaba-inc.com>; Tvrtko Ursulin <tursu...@ursulin.net>; 
dri-de...@lists.freedesktop.org <dri-de...@lists.freedesktop.org>; 
amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Dave Airlie 
<airl...@redhat.com>; Daniel Vetter <dan...@ffwll.ch>; criu <c...@openvz.org>
抄 送:"Errabolu, Ramesh" <ramesh.errab...@amd.com>; "Christian König" 
<christian.koe...@amd.com>
主 题:Re: Re:Proposal to add CRIU support to DRM render nodes
On 2024-07-08 2:51, 周春明(日月) wrote:
> 
> Hi Felix,
> 
> When I learn CRIU you introduced in 
> https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu 
> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu > 
> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> 
> <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu> > , 
> there is a sentence
> "ROCm manages memory in the form of buffer objects (BOs). We are also working 
> on a new memory management API that will be based on virtual address 
> ranges...", 
> Out of curious, how about "new memory management based on virtual address 
> ranges"? Any introduction for that?
>Hi David,
>This refers to the SVM API that has been in the upstream driver for a while 
>now: 
>https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732
> 
><https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732
> >
[David] Can all ROCm runtime memory management switch to use svm apis? No need 
BOs any more?
Thanks,
-David
Regards,
 Felix
> 
> Thanks,
> -David
> 
> ------------------------------------------------------------------
> 发件人:Felix Kuehling <felix.kuehl...@amd.com>
> 发送时间:2024年5月3日(星期五) 22:44
> 收件人:Tvrtko Ursulin <tursu...@ursulin.net>; dri-de...@lists.freedesktop.org 
> <dri-de...@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org 
> <amd-gfx@lists.freedesktop.org>; Dave Airlie <airl...@redhat.com>; Daniel 
> Vetter <dan...@ffwll.ch>; criu <c...@openvz.org>
> 抄 送:"Errabolu, Ramesh" <ramesh.errab...@amd.com>; "Christian König" 
> <christian.koe...@amd.com>
> 主 题:Re: Proposal to add CRIU support to DRM render nodes
> 
> 
> 
> On 2024-04-16 10:04, Tvrtko Ursulin wrote:
> >
> > On 01/04/2024 18:58, Felix Kuehling wrote:
> >>
> >> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
> >>>
> >>> On 01/04/2024 17:37, Felix Kuehling wrote:
> >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
> >>>>>
> >>>>> On 28/03/2024 20:42, Felix Kuehling wrote:
> >>>>>>
> >>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
> >>>>>>>
> >>>>>>> Hi Felix,
> >>>>>>>
> >>>>>>> I had one more thought while browsing around the amdgpu CRIU plugin. 
> >>>>>>> It appears it relies on the KFD support being compiled in and 
> >>>>>>> /dev/kfd present, correct? AFAICT at least, it relies on that to 
> >>>>>>> figure out the amdgpu DRM node.
> >>>>>>>
> >>>>>>> In would be probably good to consider designing things without that 
> >>>>>>> dependency. So that checkpointing an application which does not use 
> >>>>>>> /dev/kfd is possible. Or if the kernel does not even have the KFD 
> >>>>>>> support compiled in.
> >>>>>>
> >>>>>> Yeah, if we want to support graphics apps that don't use KFD, we 
> >>>>>> should definitely do that. Currently we get a lot of topology 
> >>>>>> information from KFD, not even from the /dev/kfd device but from the 
> >>>>>> sysfs nodes exposed by KFD. We'd need to get GPU device info from the 
> >>>>>> render nodes instead. And if KFD is available, we may need to 
> >>>>>> integrate both sources of information.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> It could perhaps mean no more than adding some GPU discovery code 
> >>>>>>> into CRIU. Which shuold be flexible enough to account for things like 
> >>>>>>> re-assigned minor numbers due driver reload.
> >>>>>>
> >>>>>> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I 
> >>>>>> was thinking this is still part of the plugin.
> >>>>>
> >>>>> Yes I agree. I was only thinking about adding some DRM device discovery 
> >>>>> code in a more decoupled fashion from the current plugin, for both the 
> >>>>> reason discussed above (decoupling a bit from reliance on kfd sysfs), 
> >>>>> and then also if/when a new DRM driver might want to implement this the 
> >>>>> code could be move to some common plugin area.
> >>>>>
> >>>>> I am not sure how feasible that would be though. The "gpu id" concept 
> >>>>> and it's matching in the current kernel code and CRIU plugin - is that 
> >>>>> value tied to the physical GPU instance or how it works?
> >>>>
> >>>> The concept of the GPU ID is that it's stable while the system is up, 
> >>>> even when devices get added and removed dynamically. It was baked into 
> >>>> the API early on, but I don't think we ever fully validated device hot 
> >>>> plug. I think the closest we're getting is with our latest MI GPUs and 
> >>>> dynamic partition mode change.
> >>>
> >>> Doesn't it read the saved gpu id from the image file while doing restore 
> >>> and tries to open the render node to match it? Maybe I am misreading the 
> >>> code.. But if it does, does it imply that in practice it could be stable 
> >>> across reboots? Or that it is not possible to restore to a different 
> >>> instance of maybe the same GPU model installed in a system?
> >>
> >> Ah, the idea is, that when you restore on a different system, you may get 
> >> different GPU IDs. Or you may checkpoint an app running on GPU 1 but 
> >> restore it on GPU 2 on the same system. That's why we need to translate 
> >> GPU IDs in restored applications. User mode still uses the old GPU IDs, 
> >> but the kernel mode driver translates them to the actual GPU IDs of the 
> >> GPUs that the process was restored on.
> >
> > I see.. I think. Normal flow is ppd->user_gpu_id set during client init, 
> > but for restored clients it gets overriden during restore so that any 
> > further ioctls can actually not instantly fail.
> >
> > And then in amdgpu_plugin_restore_file, when it is opening the render node, 
> > it relies on the kfd topology to have filled in (more or less) the 
> > target_gpu_id corresponding to the render node gpu id of the target GPU - 
> > the one associated with the new kfd gpu_id?
> 
> Yes.
> 
> >
> > I am digging into this because I am trying to see if some part of GPU 
> > discovery could somehow be decoupled.. to offer you to work on at least 
> > that until you start to tackle the main body of the feature. But it looks 
> > properly tangled up.
> 
> OK. Most of the interesting plugin code should be in 
> amdgpu_plugin_topology.c. It currently has some pretty complicated logic to 
> find a set of devices that matches the topology in the checkpoint, including 
> shader ISA versions, numbers of compute units, memory sizes, firmware 
> versions and IO-Links between GPUs. This was originally done to support P2P 
> with XGMI links. I'm not sure we ever updated it to properly support PCIe P2P.
> 
> 
> >
> > Do you have any suggestions with what I could help with? Maybe developing 
> > some sort of drm device enumeration library if you see a way that would be 
> > useful in decoupling the device discovery from kfd. We would need to define 
> > what sort of information you would need to be queryable from it.
> 
> Maybe. I think a lot of device information is available with some amdgpu 
> info-ioctl. It may not cover all the things we're checking in the KFD 
> topology, though.
> 
> >
> >>>> This also highlights another aspect on those spatially partitioned GPUs. 
> >>>> GPU IDs identify device partitions, not devices. Similarly, each 
> >>>> partition has its own render node, and the KFD topology info in sysfs 
> >>>> points to the render-minor number corresponding to each GPU ID.
> >>>
> >>> I am not familiar with this. This is not SR-IOV but some other kind of 
> >>> partitioning? Would you have any links where I could read more?
> >>
> >> Right, the bare-metal driver can partition a PF spatially without SRIOV. 
> >> SRIOV can also use spatial partitioning and expose each partition through 
> >> its own VF, but that's not useful for bare metal. Spatial partitioning is 
> >> new in MI300. There is some high-level info in this whitepaper: 
> >> https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
> >>  
> >> <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
> >>  > 
> >> <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>
> >>  
> >> <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>
> >>  >.
> >
> > From the outside (userspace) this looks simply like multiple DRM render 
> > nodes or how exactly?
> 
> Yes, that's correct. Each partition has its own render node and its own node 
> in the KFD topology.
> 
> Regards,
> Felix
> 
> >
> > Regards,
> >
> > Tvrtko
> >
> >>
> >> Regards,
> >> Felix
> >>
> >>
> >>>
> >>> Regards,
> >>>
> >>> Tvrtko
> >>>
> >>>>>>> Otherwise I am eagerly awaiting to hear more about the design 
> >>>>>>> specifics around dma-buf handling. And also seeing how to extend to 
> >>>>>>> other DRM related anonymous fds.
> >>>>>>
> >>>>>> I've been pretty far under-water lately. I hope I'll find time to work 
> >>>>>> on this more, but it's probably going to be at least a few weeks.
> >>>>>
> >>>>> Got it.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Tvrtko
> >>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>> Felix
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>>
> >>>>>>> Tvrtko
> >>>>>>>
> >>>>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
> >>>>>>>>
> >>>>>>>> On 15/03/2024 02:33, Felix Kuehling wrote:
> >>>>>>>>>
> >>>>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Felix,
> >>>>>>>>>>>
> >>>>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
> >>>>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render 
> >>>>>>>>>>>> nodes in order to maintain CRIU support for ROCm application 
> >>>>>>>>>>>> once they start relying on render nodes for more GPU memory 
> >>>>>>>>>>>> management. In this email I'm providing some background why we 
> >>>>>>>>>>>> are doing this, and outlining some of the problems we need to 
> >>>>>>>>>>>> solve to checkpoint and restore render node state and shared 
> >>>>>>>>>>>> memory (DMABuf) state. I have some thoughts on the API design, 
> >>>>>>>>>>>> leaning on what we did for KFD, but would like to get feedback 
> >>>>>>>>>>>> from the DRI community regarding that API and to what extent 
> >>>>>>>>>>>> there is interest in making that generic.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We are working on using DRM render nodes for virtual address 
> >>>>>>>>>>>> mappings in ROCm applications to implement the CUDA11-style VM 
> >>>>>>>>>>>> API and improve interoperability between graphics and compute. 
> >>>>>>>>>>>> This uses DMABufs for sharing buffer objects between KFD and 
> >>>>>>>>>>>> multiple render node devices, as well as between processes. In 
> >>>>>>>>>>>> the long run this also provides a path to moving all or most 
> >>>>>>>>>>>> memory management from the KFD ioctl API to libdrm.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Once ROCm user mode starts using render nodes for virtual 
> >>>>>>>>>>>> address management, that creates a problem for checkpointing and 
> >>>>>>>>>>>> restoring ROCm applications with CRIU. Currently there is no 
> >>>>>>>>>>>> support for checkpointing and restoring render node state, other 
> >>>>>>>>>>>> than CPU virtual address mappings. Support will be needed for 
> >>>>>>>>>>>> checkpointing GEM buffer objects and handles, their GPU virtual 
> >>>>>>>>>>>> address mappings and memory sharing relationships between 
> >>>>>>>>>>>> devices and processes.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Eventually, if full CRIU support for graphics applications is 
> >>>>>>>>>>>> desired, more state would need to be captured, including 
> >>>>>>>>>>>> scheduler contexts and BO lists. Most of this state is 
> >>>>>>>>>>>> driver-specific.
> >>>>>>>>>>>>
> >>>>>>>>>>>> After some internal discussions we decided to take our design 
> >>>>>>>>>>>> process public as this potentially touches DRM GEM and DMABuf 
> >>>>>>>>>>>> APIs and may have implications for other drivers in the future.
> >>>>>>>>>>>>
> >>>>>>>>>>>> One basic question before going into any API details: Is there a 
> >>>>>>>>>>>> desire to have CRIU support for other DRM drivers?
> >>>>>>>>>>>
> >>>>>>>>>>> This sounds like a very interesting feature on the overall, 
> >>>>>>>>>>> although I cannot answer on the last question here.
> >>>>>>>>>>
> >>>>>>>>>> I forgot to finish this thought. I cannot answer / don't know of 
> >>>>>>>>>> any concrete plans, but I think feature is pretty cool and if 
> >>>>>>>>>> amdgpu gets it working I wouldn't be surprised if other drivers 
> >>>>>>>>>> would get interested.
> >>>>>>>>>
> >>>>>>>>> Thanks, that's good to hear!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I 
> >>>>>>>>>>> recently implemented on Mesa's request, which is to be able to 
> >>>>>>>>>>> "upload" the GPU context from the GPU hang error state and replay 
> >>>>>>>>>>> the hanging request. It is kind of (at a stretch) a very special 
> >>>>>>>>>>> tiny subset of checkout and restore so I am not mentioning it as 
> >>>>>>>>>>> a curiosity.
> >>>>>>>>>>>
> >>>>>>>>>>> And there is also another partical conceptual intersect with the 
> >>>>>>>>>>> (at the moment not yet upstream) i915 online debugger. This part 
> >>>>>>>>>>> being in the area of discovering and enumerating GPU resources 
> >>>>>>>>>>> beloning to the client.
> >>>>>>>>>>>
> >>>>>>>>>>> I don't see an immediate design or code sharing opportunities 
> >>>>>>>>>>> though but just mentioning.
> >>>>>>>>>>>
> >>>>>>>>>>> I did spend some time reading your plugin and kernel 
> >>>>>>>>>>> implementation out of curiousity and have some comments and 
> >>>>>>>>>>> questions.
> >>>>>>>>>>>
> >>>>>>>>>>>> With that out of the way, some considerations for a possible DRM 
> >>>>>>>>>>>> CRIU API (either generic of AMDGPU driver specific): The API 
> >>>>>>>>>>>> goes through several phases during checkpoint and restore:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Checkpoint:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Process-info (enumerates objects and sizes so user mode can 
> >>>>>>>>>>>> allocate
> >>>>>>>>>>>> memory for the checkpoint, stops execution on the GPU)
> >>>>>>>>>>>> 2. Checkpoint (store object metadata for BOs, queues, etc.)
> >>>>>>>>>>>> 3. Unpause (resumes execution after the checkpoint is complete)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Restore:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Restore (restore objects, VMAs are not in the right place at 
> >>>>>>>>>>>> this time)
> >>>>>>>>>>>> 2. Resume (final fixups after the VMAs are sorted out, resume 
> >>>>>>>>>>>> execution)
> >>>>>>>>>>>
> >>>>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is 
> >>>>>>>>>>> idled? For instance dma_resv objects are free of fences which 
> >>>>>>>>>>> would need to restored for things to continue executing sensibly? 
> >>>>>>>>>>> Or how is that handled?
> >>>>>>>>>
> >>>>>>>>> In our compute use cases, we suspend user mode queues. This can 
> >>>>>>>>> include CWSR (compute-wave-save-restore) where the state of 
> >>>>>>>>> in-flight waves is stored in memory and can be reloaded and resumed 
> >>>>>>>>> from memory later. We don't use any fences other than "eviction 
> >>>>>>>>> fences", that are signaled after the queues are suspended. And 
> >>>>>>>>> those fences are never handed to user mode. So we don't need to 
> >>>>>>>>> worry about any fence state in the checkpoint.
> >>>>>>>>>
> >>>>>>>>> If we extended this to support the kernel mode command submission 
> >>>>>>>>> APIs, I would expect that we'd wait for all current submissions to 
> >>>>>>>>> complete, and stop new ones from being sent to the HW before taking 
> >>>>>>>>> the checkpoint. When we take the checkpoint in the CRIU plugin, the 
> >>>>>>>>> CPU threads are already frozen and cannot submit any more work. If 
> >>>>>>>>> we wait for all currently pending submissions to drain, I think we 
> >>>>>>>>> don't need to save any fence state because all the fences will have 
> >>>>>>>>> signaled. (I may be missing some intricacies and I'm afraid it may 
> >>>>>>>>> not be that simple in reality, but that's my opening bid. ;)
> >>>>>>>>
> >>>>>>>> It feels feasible to me too, for the normally behaving clients at 
> >>>>>>>> least.
> >>>>>>>>
> >>>>>>>> Presumably, given that the whole checkpointing is not instant, it 
> >>>>>>>> would be okay to wait a second or two longer for the in-progress 
> >>>>>>>> submissions complete. After which kernel would need to prune all 
> >>>>>>>> signalled fences from the respective container objects before 
> >>>>>>>> checkpointing.
> >>>>>>>>
> >>>>>>>> For the "misbehaving" clients who have perhaps queued up too much 
> >>>>>>>> work, either still in the scheduler with unsatisfied dependencies, 
> >>>>>>>> or already submitted to the hardware and/or driver backend, is there 
> >>>>>>>> a timeout concept in CRIU so it would be possible to say something 
> >>>>>>>> like "try to checkpoint but if the kernel says no time period t then 
> >>>>>>>> give up"?
> >>>>>>>>
> >>>>>>>>>>>> For some more background about our implementation in KFD, you 
> >>>>>>>>>>>> can refer to this whitepaper: 
> >>>>>>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
> >>>>>>>>>>>>  
> >>>>>>>>>>>> <https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
> >>>>>>>>>>>>  > 
> >>>>>>>>>>>> <https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md>
> >>>>>>>>>>>>  
> >>>>>>>>>>>> <https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md>
> >>>>>>>>>>>>  >
> >>>>>>>>>>>>
> >>>>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render 
> >>>>>>>>>>>> nodes, I'll address each of them in more detail below:
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Opaque information in the checkpoint data that user mode can't
> >>>>>>>>>>>> interpret or do anything with
> >>>>>>>>>>>> * A second API for creating objects (e.g. BOs) that is separate 
> >>>>>>>>>>>> from
> >>>>>>>>>>>> the regular BO creation API
> >>>>>>>>>>>> * Kernel mode would need to be involved in restoring BO sharing
> >>>>>>>>>>>> relationships rather than replaying BO creation, export and 
> >>>>>>>>>>>> import
> >>>>>>>>>>>> from user mode
> >>>>>>>>>>>>
> >>>>>>>>>>>> # Opaque information in the checkpoint
> >>>>>>>>>>>>
> >>>>>>>>>>>> This comes out of ABI compatibility considerations. Adding any 
> >>>>>>>>>>>> new objects or attributes to the driver/HW state that needs to 
> >>>>>>>>>>>> be checkpointed could potentially break the ABI of the CRIU 
> >>>>>>>>>>>> checkpoint/restore ioctl if the plugin needs to parse that 
> >>>>>>>>>>>> information. Therefore, much of the information in our KFD CRIU 
> >>>>>>>>>>>> ioctl API is opaque. It is written by kernel mode in the 
> >>>>>>>>>>>> checkpoint, it is consumed by kernel mode when restoring the 
> >>>>>>>>>>>> checkpoint, but user mode doesn't care about the contents or 
> >>>>>>>>>>>> binary layout, so there is no user mode ABI to break. This is 
> >>>>>>>>>>>> how we were able to maintain CRIU support when we added the SVM 
> >>>>>>>>>>>> API to KFD without changing the CRIU plugin and without breaking 
> >>>>>>>>>>>> our ABI.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Opaque information may also lend itself to API abstraction, if 
> >>>>>>>>>>>> this becomes a generic DRM API with driver-specific callbacks 
> >>>>>>>>>>>> that fill in HW-specific opaque data.
> >>>>>>>>>>>
> >>>>>>>>>>> This feels sound in principle to me. Fundamentally the state is 
> >>>>>>>>>>> very hardware specfic, and/or driver version specific, so I don't 
> >>>>>>>>>>> see anything could be gained in practice by making it much less 
> >>>>>>>>>>> opaque. (Apart from making things more complicated.)
> >>>>>>>>>>>
> >>>>>>>>>>> I was however unsure of the current split of how you dump buffer 
> >>>>>>>>>>> objects with some data in the defined bo structure, and some in 
> >>>>>>>>>>> completely opaque private data. Is there a benefit to that split, 
> >>>>>>>>>>> or maybe in other words, is there a benefit on having part 
> >>>>>>>>>>> transparent and part opaque for buffer objects?
> >>>>>>>>>
> >>>>>>>>> Some of the buffer object state is needed by the plugin. E.g. the 
> >>>>>>>>> size and mmap offset are needed to match VMAs with BOs. I'd have to 
> >>>>>>>>> review the plugin in detail to prove that all the fields are, in 
> >>>>>>>>> fact, needed by the plugin, but that was the intent. Anything that 
> >>>>>>>>> the plugin doesn't need to know should be in the opaque data 
> >>>>>>>>> structures.
> >>>>>>>>
> >>>>>>>> Right, got it.
> >>>>>>>>
> >>>>>>>> Would it make sense to make the opaque data in the same block as the 
> >>>>>>>> defined one? I mean instead of separating the two in the binary 
> >>>>>>>> image for instance have struct kfd_criu_bo_bucket have a trailing 
> >>>>>>>> priv_data blob? Maybe it is too late now if the image format is not 
> >>>>>>>> versioned or something.
> >>>>>>>>
> >>>>>>>>>>> To slightly touch upon the question of whether this could become 
> >>>>>>>>>>> a generic DRM API. It feels it would be hard to do it from the 
> >>>>>>>>>>> start. What sounds more feasible is if/when generic looking 
> >>>>>>>>>>> helpers can be spotted while developing the RFC then potentially 
> >>>>>>>>>>> structure the code they can easily be promoted to shared/common 
> >>>>>>>>>>> at some future moment.
> >>>>>>>>>
> >>>>>>>>> Yes, that's how this usually goes, in my experience. Thanks for 
> >>>>>>>>> confirming.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> # Second API for creating objects
> >>>>>>>>>>>>
> >>>>>>>>>>>> Creating BOs and other objects when restoring a checkpoint needs 
> >>>>>>>>>>>> more information than the usual BO alloc and similar APIs 
> >>>>>>>>>>>> provide. For example, we need to restore BOs with the same GEM 
> >>>>>>>>>>>> handles so that user mode can continue using those handles after 
> >>>>>>>>>>>> resuming execution. If BOs are shared through DMABufs without 
> >>>>>>>>>>>> dynamic attachment, we need to restore pinned BOs as pinned. 
> >>>>>>>>>>>> Validation of virtual addresses and handling MMU notifiers must 
> >>>>>>>>>>>> be suspended until the virtual address space is restored. For 
> >>>>>>>>>>>> user mode queues we need to save and restore a lot of queue 
> >>>>>>>>>>>> execution state so that execution can resume cleanly.
> >>>>>>>>>>>
> >>>>>>>>>>> This also sounds justified to me. Restore creating all internal 
> >>>>>>>>>>> objects is definitely special and sounds better to add uapi to 
> >>>>>>>>>>> create them directly with the correct properties, than to add 
> >>>>>>>>>>> uapi to adjust internal properties after creation. And in case 
> >>>>>>>>>>> you would always need some new uapi - so at least to adjust after 
> >>>>>>>>>>> creation. At which point you may have both in one. Internally 
> >>>>>>>>>>> implementation can be split or common, whatever makes sense for a 
> >>>>>>>>>>> given object type, but new uapi definitely sounds is required.
> >>>>>>>>>>>> # Restoring buffer sharing relationships
> >>>>>>>>>>>>
> >>>>>>>>>>>> Different GEM handles in different render nodes and processes 
> >>>>>>>>>>>> can refer to the same underlying shared memory, either by 
> >>>>>>>>>>>> directly pointing to the same GEM object, or by creating an 
> >>>>>>>>>>>> import attachment that may get its SG tables invalidated and 
> >>>>>>>>>>>> updated dynamically through dynamic attachment callbacks. In the 
> >>>>>>>>>>>> latter case it's obvious, who is the exporter and who is the 
> >>>>>>>>>>>> importer. In the first case, either one could be the exporter, 
> >>>>>>>>>>>> and it's not clear who would need to create the BO and who would 
> >>>>>>>>>>>> need to
> >>>>>>>>>>>
> >>>>>>>>>>> To see if I follow the former case correctly.
> >>>>>>>>>>>
> >>>>>>>>>>> This could be two clients A and B, where B has imported a dma-buf 
> >>>>>>>>>>> shared BO from A and has since closed the dma-buf fd? Which 
> >>>>>>>>>>> results in a single BO with reference count of 2 and 
> >>>>>>>>>>> obj->import_attach unset. History of who created the object is 
> >>>>>>>>>>> lost.
> >>>>>>>>>
> >>>>>>>>> Yes. In the amdgpu driver this happens when the exporter and import 
> >>>>>>>>> device are the same.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> In fact it could even be that two imported objects remain 
> >>>>>>>>>>> (clients A, B and C) and A, who originally created the BO, has 
> >>>>>>>>>>> since fully released it. So any kind of "creator" tracking if 
> >>>>>>>>>>> added wouldn't be fully reliable either.
> >>>>>>>>>
> >>>>>>>>> That's a good point.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> import it when restoring the checkpoint. To further complicate 
> >>>>>>>>>>>> things, multiple processes in a checkpoint get restored 
> >>>>>>>>>>>> concurrently. So there is no guarantee that an exporter has 
> >>>>>>>>>>>> restored a shared BO at the time an importer is trying to 
> >>>>>>>>>>>> restore its import.
> >>>>>>>>>>>>
> >>>>>>>>>>>> A proposal to deal with these problems would be to treat 
> >>>>>>>>>>>> importers and exporters the same. Whoever restores first, ends 
> >>>>>>>>>>>> up creating the BO and potentially attaching to it. The other 
> >>>>>>>>>>>> process(es) can find BOs that were already restored by another 
> >>>>>>>>>>>> process by looking it up with a unique ID that could be based on 
> >>>>>>>>>>>> the DMABuf inode number. An alternative would be a two-pass 
> >>>>>>>>>>>> approach that needs to restore BOs on two passes:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Restore exported BOs
> >>>>>>>>>>>> 2. Restore imports
> >>>>>>>>>>>>
> >>>>>>>>>>>> With some inter-process synchronization in CRIU itself between 
> >>>>>>>>>>>> these two passes. This may require changes in the core CRIU, 
> >>>>>>>>>>>> outside our plugin. Both approaches depend on identifying BOs 
> >>>>>>>>>>>> with some unique ID that could be based on the DMABuf inode 
> >>>>>>>>>>>> number in the checkpoint. However, we would need to identify the 
> >>>>>>>>>>>> processes in the same restore session, possibly based on 
> >>>>>>>>>>>> parent/child process relationships, to create a scope where 
> >>>>>>>>>>>> those IDs are valid during restore.
> >>>>>>>>>>>
> >>>>>>>>>>> If my understanding above is on the right track, then I think 
> >>>>>>>>>>> this is the only thing which can be done (for all scenarios).
> >>>>>>>>>
> >>>>>>>>> I presented two alternatives. I think you're in favor of the first 
> >>>>>>>>> one, where it doesn't matter who is the importer and exporter. I 
> >>>>>>>>> think the two-pass approach requires that you can identify an 
> >>>>>>>>> exporter. And as you pointed out, the exporter may already have 
> >>>>>>>>> dropped their reference to the BO.
> >>>>>>>>
> >>>>>>>> Yep.
> >>>>>>>>
> >>>>>>>>>>> I also *think* it would be safe. At least at the moment I cannot 
> >>>>>>>>>>> think what could go wrong. Semantics are that it doesn't really 
> >>>>>>>>>>> matter who created the object.
> >>>>>>>>>
> >>>>>>>>> I would agree. What matters is that the object is recreated on the 
> >>>>>>>>> correct device, and that all the direct references and import 
> >>>>>>>>> attachments pointing to it are restored.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf 
> >>>>>>>>>>>> file descriptors themselves. These are anonymous file 
> >>>>>>>>>>>> descriptors. The CRIU plugin could probably be taught to 
> >>>>>>>>>>>> recreate them from the original exported BO based on the inode 
> >>>>>>>>>>>> number that could be queried with fstat in the checkpoint. It 
> >>>>>>>>>>>> would need help from the render node CRIU API to find the right 
> >>>>>>>>>>>> BO from the inode, which may be from a different process in the 
> >>>>>>>>>>>> same restore session.
> >>>>>>>>>>>
> >>>>>>>>>>> This part feels like it is breaking the component separation a 
> >>>>>>>>>>> bit because even for buffers fully owned by amdgpu, strictly 
> >>>>>>>>>>> speaking the dma-buf fd is not. At least my understanding from 
> >>>>>>>>>>> the above is that you propose to attempt to import the fd, from 
> >>>>>>>>>>> the kernel side, during the checkpoint process? Like:
> >>>>>>>>>>>
> >>>>>>>>>>> Checkpoint:
> >>>>>>>>>>>
> >>>>>>>>>>> CRIU for each anon fd:
> >>>>>>>>>>> amdgpu_plugin(fd)
> >>>>>>>>>>> -> attempt in kernel dma buf import (passes fd to kernel via 
> >>>>>>>>>>> ioctl?)
> >>>>>>>>>>> -> is it ours? (no -> error)
> >>>>>>>>>>> -> create a record mapping fd number to amdgpu BO
> >>>>>>>>>>>
> >>>>>>>>>>> Restore:
> >>>>>>>>>>>
> >>>>>>>>>>> for each dma-buf fd record:
> >>>>>>>>>>> create BO if does not exists
> >>>>>>>>>>> export BO to same fd
> >>>>>>>>>>> close BO handle if not in regular BO handle records
> >>>>>>>>>>>
> >>>>>>>>>>> Or since you mention lookup by inode, that would need to be 
> >>>>>>>>>>> dmabuf_plugin so it can lookup inodes in the private mount space. 
> >>>>>>>>>>> However how would it co-operate with amdgpu_plugin is not clear 
> >>>>>>>>>>> to me.
> >>>>>>>>>
> >>>>>>>>> The way I think about the ownership is, whichever driver created 
> >>>>>>>>> the underlying BO owns the checkpointing of the dmabuf. You need 
> >>>>>>>>> driver-specific information to link the dmabuf with the driver's BO 
> >>>>>>>>> and you need the right driver to recreate the BO and the dmabuf fd 
> >>>>>>>>> when restoring the checkpoint.
> >>>>>>>>>
> >>>>>>>>> It gets really interesting if you have an amdgpu plugin and an i915 
> >>>>>>>>> plugin, and they checkpoint an application that shares BOs between 
> >>>>>>>>> the two devices through DMABufs. E.g. if i915 created a BO and 
> >>>>>>>>> amdgpu imported it, then during restore, i915 needs to restore the 
> >>>>>>>>> dmabuf before the amdgpu import of it can be restored. I think that 
> >>>>>>>>> brings us back to a two-phase approach to restoring the memory 
> >>>>>>>>> sharing relationships. Uff.
> >>>>>>>>
> >>>>>>>> I think this part of the discussion somewhat depends on the previous 
> >>>>>>>> part about idling. If it is feasible to completely idle and prune, 
> >>>>>>>> and fail if that is not happening quickly enough, then maybe there 
> >>>>>>>> wouldn't be too much hierarchical state to save.
> >>>>>>>>
> >>>>>>>> Otherwise my idea was that there is a top-level drm_plugin.so which 
> >>>>>>>> understands amdgpu fds, i915, syncobj, sync_file, and uses some new 
> >>>>>>>> uapi to uniquely identify each, associate with the correct driver, 
> >>>>>>>> and then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so. 
> >>>>>>>> Building the in memory representation of their relationships. As 
> >>>>>>>> long as all objects and their relationships have been recorded I 
> >>>>>>>> think everything could then be correctly restored.
> >>>>>>>>
> >>>>>>>> It is possible there is flaw in my thinking and that something in 
> >>>>>>>> CRIU design would make this impossible? I think it would require the 
> >>>>>>>> top-level drm_plugin.so to hold all state in memory until the whole 
> >>>>>>>> checkpointing is done, and then verify something is not incomplete, 
> >>>>>>>> failing it all if it was. (For instance one plugin discovered an 
> >>>>>>>> reference to an object which was not discoverd by any other plugin 
> >>>>>>>> or things like that.) May need some further tweaks to CRIU common 
> >>>>>>>> code.
> >>>>>>>>
> >>>>>>>> Maybe I need to better understand how exactly you mean to query the 
> >>>>>>>> DRM driver about random anonymous fds. I see it as a problem in the 
> >>>>>>>> design, possibly even implementation, but maybe I am missing 
> >>>>>>>> something which makes it not so. I mean even with my general idea I 
> >>>>>>>> don't know how would one determine which driver to query about a 
> >>>>>>>> particular anonymous inode.
> >>>>>>>>
> >>>>>>>>>> I later also realised that I was maybe increasing the scope for 
> >>>>>>>>>> you here. :) You did state focus is ROCm applications which 
> >>>>>>>>>> possibly doesn't care about dma-resv, fences, syncobjs etc?
> >>>>>>>>>
> >>>>>>>>> That's my focus for now. But I don't want to engineer a solution 
> >>>>>>>>> that would preclude your use cases in the future.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> But I think the "how to handle dma-bufs" design question is still 
> >>>>>>>>>> relevant and interesting. For example I had this thought that 
> >>>>>>>>>> perhaps what would be needed is a CRIU plugin hierarchy.
> >>>>>>>>>>
> >>>>>>>>>> Because fundamentally we would be snapshoting a hierarcy of kernel 
> >>>>>>>>>> objects belonging to different drivers (kfd, amdgpu, dma-buf, 
> >>>>>>>>>> ...). And if one day someone would to try to handle dma fences and 
> >>>>>>>>>> drm syncobjs, the argument for a hierarchial design would be even 
> >>>>>>>>>> stronger I think.
> >>>>>>>>>>
> >>>>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu, 
> >>>>>>>>>> dma-buf, sync file, ...) and internally build the representation 
> >>>>>>>>>> of the whole state and how the relationship between the objects.
> >>>>>>>>>
> >>>>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not 
> >>>>>>>>> sure it's strictly a hierarchy. Maybe more like some common code 
> >>>>>>>>> shared by multiple GPU driver plugins. I think the common 
> >>>>>>>>> checkpoint state is quite limited and restoring it requires the 
> >>>>>>>>> GPU-specific drivers anyway.
> >>>>>>>>>
> >>>>>>>>> Also the idea of building a representation of the whole state 
> >>>>>>>>> doesn't work well with the CRIU design, because "the whole state" 
> >>>>>>>>> can include multiple processes that restore themselves concurrently 
> >>>>>>>>> and only synchronize with each other in a few places in the restore 
> >>>>>>>>> process. I feel, if we can work out how to checkpoint and restore 
> >>>>>>>>> shared objects between processes, we can do the same for shared 
> >>>>>>>>> objects between drivers without imposing a strict hierarchy and 
> >>>>>>>>> some omniscient entity that needs to know "the whole state".
> >>>>>>>>
> >>>>>>>> Okay, this continues on the same problem space as above. And you 
> >>>>>>>> obviously know how CRIU works much better than me.
> >>>>>>>>
> >>>>>>>>>> With that kind of design there probably would be a need to define 
> >>>>>>>>>> some common kernel side api and uapi, so all involved objects can 
> >>>>>>>>>> be enumerated with some unique ids etc.
> >>>>>>>>>>
> >>>>>>>>>> Now.. the counter argument.. the more state from different drivers 
> >>>>>>>>>> would one want to handle the bigger this project would get. Would 
> >>>>>>>>>> it even be feasible is the question, to the point that it may be 
> >>>>>>>>>> simpler to just run the workload in a VM via SR-IOV and simply 
> >>>>>>>>>> hibernate the whole thin guest. :)
> >>>>>>>>>
> >>>>>>>>> Well, CRIU kind of tries to do that, but with containers instead of 
> >>>>>>>>> VMs. ;)
> >>>>>>>>
> >>>>>>>> It would definitely be useful for hardware and drivers without 
> >>>>>>>> SR-IOV support so lets hope it is doable. :)
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>>
> >>>>>>>> Tvrtko

Reply via email to