On 2024-04-16 10:04, Tvrtko Ursulin wrote:
> 
> On 01/04/2024 18:58, Felix Kuehling wrote:
>>
>> On 2024-04-01 12:56, Tvrtko Ursulin wrote:
>>>
>>> On 01/04/2024 17:37, Felix Kuehling wrote:
>>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 28/03/2024 20:42, Felix Kuehling wrote:
>>>>>>
>>>>>> On 2024-03-28 12:03, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> Hi Felix,
>>>>>>>
>>>>>>> I had one more thought while browsing around the amdgpu CRIU plugin. It 
>>>>>>> appears it relies on the KFD support being compiled in and /dev/kfd 
>>>>>>> present, correct? AFAICT at least, it relies on that to figure out the 
>>>>>>> amdgpu DRM node.
>>>>>>>
>>>>>>> In would be probably good to consider designing things without that 
>>>>>>> dependency. So that checkpointing an application which does not use 
>>>>>>> /dev/kfd is possible. Or if the kernel does not even have the KFD 
>>>>>>> support compiled in.
>>>>>>
>>>>>> Yeah, if we want to support graphics apps that don't use KFD, we should 
>>>>>> definitely do that. Currently we get a lot of topology information from 
>>>>>> KFD, not even from the /dev/kfd device but from the sysfs nodes exposed 
>>>>>> by KFD. We'd need to get GPU device info from the render nodes instead. 
>>>>>> And if KFD is available, we may need to integrate both sources of 
>>>>>> information.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> It could perhaps mean no more than adding some GPU discovery code into 
>>>>>>> CRIU. Which shuold be flexible enough to account for things like 
>>>>>>> re-assigned minor numbers due driver reload.
>>>>>>
>>>>>> Do you mean adding GPU discovery to the core CRIU, or to the plugin. I 
>>>>>> was thinking this is still part of the plugin.
>>>>>
>>>>> Yes I agree. I was only thinking about adding some DRM device discovery 
>>>>> code in a more decoupled fashion from the current plugin, for both the 
>>>>> reason discussed above (decoupling a bit from reliance on kfd sysfs), and 
>>>>> then also if/when a new DRM driver might want to implement this the code 
>>>>> could be move to some common plugin area.
>>>>>
>>>>> I am not sure how feasible that would be though. The "gpu id" concept and 
>>>>> it's matching in the current kernel code and CRIU plugin - is that value 
>>>>> tied to the physical GPU instance or how it works?
>>>>
>>>> The concept of the GPU ID is that it's stable while the system is up, even 
>>>> when devices get added and removed dynamically. It was baked into the API 
>>>> early on, but I don't think we ever fully validated device hot plug. I 
>>>> think the closest we're getting is with our latest MI GPUs and dynamic 
>>>> partition mode change.
>>>
>>> Doesn't it read the saved gpu id from the image file while doing restore 
>>> and tries to open the render node to match it? Maybe I am misreading the 
>>> code.. But if it does, does it imply that in practice it could be stable 
>>> across reboots? Or that it is not possible to restore to a different 
>>> instance of maybe the same GPU model installed in a system?
>>
>> Ah, the idea is, that when you restore on a different system, you may get 
>> different GPU IDs. Or you may checkpoint an app running on GPU 1 but restore 
>> it on GPU 2 on the same system. That's why we need to translate GPU IDs in 
>> restored applications. User mode still uses the old GPU IDs, but the kernel 
>> mode driver translates them to the actual GPU IDs of the GPUs that the 
>> process was restored on.
> 
> I see.. I think. Normal flow is ppd->user_gpu_id set during client init, but 
> for restored clients it gets overriden during restore so that any further 
> ioctls can actually not instantly fail.
> 
> And then in amdgpu_plugin_restore_file, when it is opening the render node, 
> it relies on the kfd topology to have filled in (more or less) the 
> target_gpu_id corresponding to the render node gpu id of the target GPU - the 
> one associated with the new kfd gpu_id?

Yes.

> 
> I am digging into this because I am trying to see if some part of GPU 
> discovery could somehow be decoupled.. to offer you to work on at least that 
> until you start to tackle the main body of the feature. But it looks properly 
> tangled up.

OK. Most of the interesting plugin code should be in amdgpu_plugin_topology.c. 
It currently has some pretty complicated logic to find a set of devices that 
matches the topology in the checkpoint, including shader ISA versions, numbers 
of compute units, memory sizes, firmware versions and IO-Links between GPUs. 
This was originally done to support P2P with XGMI links. I'm not sure we ever 
updated it to properly support PCIe P2P.


> 
> Do you have any suggestions with what I could help with? Maybe developing 
> some sort of drm device enumeration library if you see a way that would be 
> useful in decoupling the device discovery from kfd. We would need to define 
> what sort of information you would need to be queryable from it.

Maybe. I think a lot of device information is available with some amdgpu 
info-ioctl. It may not cover all the things we're checking in the KFD topology, 
though.

> 
>>>> This also highlights another aspect on those spatially partitioned GPUs. 
>>>> GPU IDs identify device partitions, not devices. Similarly, each partition 
>>>> has its own render node, and the KFD topology info in sysfs points to the 
>>>> render-minor number corresponding to each GPU ID.
>>>
>>> I am not familiar with this. This is not SR-IOV but some other kind of 
>>> partitioning? Would you have any links where I could read more?
>>
>> Right, the bare-metal driver can partition a PF spatially without SRIOV. 
>> SRIOV can also use spatial partitioning and expose each partition through 
>> its own VF, but that's not useful for bare metal. Spatial partitioning is 
>> new in MI300. There is some high-level info in this whitepaper: 
>> https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.
> 
> From the outside (userspace) this looks simply like multiple DRM render nodes 
> or how exactly?

Yes, that's correct. Each partition has its own render node and its own node in 
the KFD topology.

Regards,
  Felix

> 
> Regards,
> 
> Tvrtko
> 
>>
>> Regards,
>>    Felix
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>>>> Otherwise I am eagerly awaiting to hear more about the design specifics 
>>>>>>> around dma-buf handling. And also seeing how to extend to other DRM 
>>>>>>> related anonymous fds.
>>>>>>
>>>>>> I've been pretty far under-water lately. I hope I'll find time to work 
>>>>>> on this more, but it's probably going to be at least a few weeks.
>>>>>
>>>>> Got it.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>    Felix
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tvrtko
>>>>>>>
>>>>>>> On 15/03/2024 18:36, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> On 15/03/2024 02:33, Felix Kuehling wrote:
>>>>>>>>>
>>>>>>>>> On 2024-03-12 5:45, Tvrtko Ursulin wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/03/2024 14:48, Tvrtko Ursulin wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Felix,
>>>>>>>>>>>
>>>>>>>>>>> On 06/12/2023 21:23, Felix Kuehling wrote:
>>>>>>>>>>>> Executive Summary: We need to add CRIU support to DRM render nodes 
>>>>>>>>>>>> in order to maintain CRIU support for ROCm application once they 
>>>>>>>>>>>> start relying on render nodes for more GPU memory management. In 
>>>>>>>>>>>> this email I'm providing some background why we are doing this, 
>>>>>>>>>>>> and outlining some of the problems we need to solve to checkpoint 
>>>>>>>>>>>> and restore render node state and shared memory (DMABuf) state. I 
>>>>>>>>>>>> have some thoughts on the API design, leaning on what we did for 
>>>>>>>>>>>> KFD, but would like to get feedback from the DRI community 
>>>>>>>>>>>> regarding that API and to what extent there is interest in making 
>>>>>>>>>>>> that generic.
>>>>>>>>>>>>
>>>>>>>>>>>> We are working on using DRM render nodes for virtual address 
>>>>>>>>>>>> mappings in ROCm applications to implement the CUDA11-style VM API 
>>>>>>>>>>>> and improve interoperability between graphics and compute. This 
>>>>>>>>>>>> uses DMABufs for sharing buffer objects between KFD and multiple 
>>>>>>>>>>>> render node devices, as well as between processes. In the long run 
>>>>>>>>>>>> this also provides a path to moving all or most memory management 
>>>>>>>>>>>> from the KFD ioctl API to libdrm.
>>>>>>>>>>>>
>>>>>>>>>>>> Once ROCm user mode starts using render nodes for virtual address 
>>>>>>>>>>>> management, that creates a problem for checkpointing and restoring 
>>>>>>>>>>>> ROCm applications with CRIU. Currently there is no support for 
>>>>>>>>>>>> checkpointing and restoring render node state, other than CPU 
>>>>>>>>>>>> virtual address mappings. Support will be needed for checkpointing 
>>>>>>>>>>>> GEM buffer objects and handles, their GPU virtual address mappings 
>>>>>>>>>>>> and memory sharing relationships between devices and processes.
>>>>>>>>>>>>
>>>>>>>>>>>> Eventually, if full CRIU support for graphics applications is 
>>>>>>>>>>>> desired, more state would need to be captured, including scheduler 
>>>>>>>>>>>> contexts and BO lists. Most of this state is driver-specific.
>>>>>>>>>>>>
>>>>>>>>>>>> After some internal discussions we decided to take our design 
>>>>>>>>>>>> process public as this potentially touches DRM GEM and DMABuf APIs 
>>>>>>>>>>>> and may have implications for other drivers in the future.
>>>>>>>>>>>>
>>>>>>>>>>>> One basic question before going into any API details: Is there a 
>>>>>>>>>>>> desire to have CRIU support for other DRM drivers?
>>>>>>>>>>>
>>>>>>>>>>> This sounds like a very interesting feature on the overall, 
>>>>>>>>>>> although I cannot answer on the last question here.
>>>>>>>>>>
>>>>>>>>>> I forgot to finish this thought. I cannot answer / don't know of any 
>>>>>>>>>> concrete plans, but I think feature is pretty cool and if amdgpu 
>>>>>>>>>> gets it working I wouldn't be surprised if other drivers would get 
>>>>>>>>>> interested.
>>>>>>>>>
>>>>>>>>> Thanks, that's good to hear!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Funnily enough, it has a tiny relation to an i915 feature I 
>>>>>>>>>>> recently implemented on Mesa's request, which is to be able to 
>>>>>>>>>>> "upload" the GPU context from the GPU hang error state and replay 
>>>>>>>>>>> the hanging request. It is kind of (at a stretch) a very special 
>>>>>>>>>>> tiny subset of checkout and restore so I am not mentioning it as a 
>>>>>>>>>>> curiosity.
>>>>>>>>>>>
>>>>>>>>>>> And there is also another partical conceptual intersect with the 
>>>>>>>>>>> (at the moment not yet upstream) i915 online debugger. This part 
>>>>>>>>>>> being in the area of discovering and enumerating GPU resources 
>>>>>>>>>>> beloning to the client.
>>>>>>>>>>>
>>>>>>>>>>> I don't see an immediate design or code sharing opportunities 
>>>>>>>>>>> though but just mentioning.
>>>>>>>>>>>
>>>>>>>>>>> I did spend some time reading your plugin and kernel implementation 
>>>>>>>>>>> out of curiousity and have some comments and questions.
>>>>>>>>>>>
>>>>>>>>>>>> With that out of the way, some considerations for a possible DRM 
>>>>>>>>>>>> CRIU API (either generic of AMDGPU driver specific): The API goes 
>>>>>>>>>>>> through several phases during checkpoint and restore:
>>>>>>>>>>>>
>>>>>>>>>>>> Checkpoint:
>>>>>>>>>>>>
>>>>>>>>>>>>  1. Process-info (enumerates objects and sizes so user mode can 
>>>>>>>>>>>> allocate
>>>>>>>>>>>>     memory for the checkpoint, stops execution on the GPU)
>>>>>>>>>>>>  2. Checkpoint (store object metadata for BOs, queues, etc.)
>>>>>>>>>>>>  3. Unpause (resumes execution after the checkpoint is complete)
>>>>>>>>>>>>
>>>>>>>>>>>> Restore:
>>>>>>>>>>>>
>>>>>>>>>>>>  1. Restore (restore objects, VMAs are not in the right place at 
>>>>>>>>>>>> this time)
>>>>>>>>>>>>  2. Resume (final fixups after the VMAs are sorted out, resume 
>>>>>>>>>>>> execution)
>>>>>>>>>>>
>>>>>>>>>>> Btw is check-pointing guaranteeing all relevant activity is idled? 
>>>>>>>>>>> For instance dma_resv objects are free of fences which would need 
>>>>>>>>>>> to restored for things to continue executing sensibly? Or how is 
>>>>>>>>>>> that handled?
>>>>>>>>>
>>>>>>>>> In our compute use cases, we suspend user mode queues. This can 
>>>>>>>>> include CWSR (compute-wave-save-restore) where the state of in-flight 
>>>>>>>>> waves is stored in memory and can be reloaded and resumed from memory 
>>>>>>>>> later. We don't use any fences other than "eviction fences", that are 
>>>>>>>>> signaled after the queues are suspended. And those fences are never 
>>>>>>>>> handed to user mode. So we don't need to worry about any fence state 
>>>>>>>>> in the checkpoint.
>>>>>>>>>
>>>>>>>>> If we extended this to support the kernel mode command submission 
>>>>>>>>> APIs, I would expect that we'd wait for all current submissions to 
>>>>>>>>> complete, and stop new ones from being sent to the HW before taking 
>>>>>>>>> the checkpoint. When we take the checkpoint in the CRIU plugin, the 
>>>>>>>>> CPU threads are already frozen and cannot submit any more work. If we 
>>>>>>>>> wait for all currently pending submissions to drain, I think we don't 
>>>>>>>>> need to save any fence state because all the fences will have 
>>>>>>>>> signaled. (I may be missing some intricacies and I'm afraid it may 
>>>>>>>>> not be that simple in reality, but that's my opening bid. ;)
>>>>>>>>
>>>>>>>> It feels feasible to me too, for the normally behaving clients at 
>>>>>>>> least.
>>>>>>>>
>>>>>>>> Presumably, given that the whole checkpointing is not instant, it 
>>>>>>>> would be okay to wait a second or two longer for the in-progress 
>>>>>>>> submissions complete. After which kernel would need to prune all 
>>>>>>>> signalled fences from the respective container objects before 
>>>>>>>> checkpointing.
>>>>>>>>
>>>>>>>> For the "misbehaving" clients who have perhaps queued up too much 
>>>>>>>> work, either still in the scheduler with unsatisfied dependencies, or 
>>>>>>>> already submitted to the hardware and/or driver backend, is there a 
>>>>>>>> timeout concept in CRIU so it would be possible to say something like 
>>>>>>>> "try to checkpoint but if the kernel says no time period t then give 
>>>>>>>> up"?
>>>>>>>>
>>>>>>>>>>>> For some more background about our implementation in KFD, you can 
>>>>>>>>>>>> refer to this whitepaper: 
>>>>>>>>>>>> https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md
>>>>>>>>>>>>
>>>>>>>>>>>> Potential objections to a KFD-style CRIU API in DRM render nodes, 
>>>>>>>>>>>> I'll address each of them in more detail below:
>>>>>>>>>>>>
>>>>>>>>>>>>   * Opaque information in the checkpoint data that user mode can't
>>>>>>>>>>>>     interpret or do anything with
>>>>>>>>>>>>   * A second API for creating objects (e.g. BOs) that is separate 
>>>>>>>>>>>> from
>>>>>>>>>>>>     the regular BO creation API
>>>>>>>>>>>>   * Kernel mode would need to be involved in restoring BO sharing
>>>>>>>>>>>>     relationships rather than replaying BO creation, export and 
>>>>>>>>>>>> import
>>>>>>>>>>>>     from user mode
>>>>>>>>>>>>
>>>>>>>>>>>> # Opaque information in the checkpoint
>>>>>>>>>>>>
>>>>>>>>>>>> This comes out of ABI compatibility considerations. Adding any new 
>>>>>>>>>>>> objects or attributes to the driver/HW state that needs to be 
>>>>>>>>>>>> checkpointed could potentially break the ABI of the CRIU 
>>>>>>>>>>>> checkpoint/restore ioctl if the plugin needs to parse that 
>>>>>>>>>>>> information. Therefore, much of the information in our KFD CRIU 
>>>>>>>>>>>> ioctl API is opaque. It is written by kernel mode in the 
>>>>>>>>>>>> checkpoint, it is consumed by kernel mode when restoring the 
>>>>>>>>>>>> checkpoint, but user mode doesn't care about the contents or 
>>>>>>>>>>>> binary layout, so there is no user mode ABI to break. This is how 
>>>>>>>>>>>> we were able to maintain CRIU support when we added the SVM API to 
>>>>>>>>>>>> KFD without changing the CRIU plugin and without breaking our ABI.
>>>>>>>>>>>>
>>>>>>>>>>>> Opaque information may also lend itself to API abstraction, if 
>>>>>>>>>>>> this becomes a generic DRM API with driver-specific callbacks that 
>>>>>>>>>>>> fill in HW-specific opaque data.
>>>>>>>>>>>
>>>>>>>>>>> This feels sound in principle to me. Fundamentally the state is 
>>>>>>>>>>> very hardware specfic, and/or driver version specific, so I don't 
>>>>>>>>>>> see anything could be gained in practice by making it much less 
>>>>>>>>>>> opaque. (Apart from making things more complicated.)
>>>>>>>>>>>
>>>>>>>>>>> I was however unsure of the current split of how you dump buffer 
>>>>>>>>>>> objects with some data in the defined bo structure, and some in 
>>>>>>>>>>> completely opaque private data. Is there a benefit to that split, 
>>>>>>>>>>> or maybe in other words, is there a benefit on having part 
>>>>>>>>>>> transparent and part opaque for buffer objects?
>>>>>>>>>
>>>>>>>>> Some of the buffer object state is needed by the plugin. E.g. the 
>>>>>>>>> size and mmap offset are needed to match VMAs with BOs. I'd have to 
>>>>>>>>> review the plugin in detail to prove that all the fields are, in 
>>>>>>>>> fact, needed by the plugin, but that was the intent. Anything that 
>>>>>>>>> the plugin doesn't need to know should be in the opaque data 
>>>>>>>>> structures.
>>>>>>>>
>>>>>>>> Right, got it.
>>>>>>>>
>>>>>>>> Would it make sense to make the opaque data in the same block as the 
>>>>>>>> defined one? I mean instead of separating the two in the binary image 
>>>>>>>> for instance have struct kfd_criu_bo_bucket have a trailing priv_data 
>>>>>>>> blob? Maybe it is too late now if the image format is not versioned or 
>>>>>>>> something.
>>>>>>>>
>>>>>>>>>>> To slightly touch upon the question of whether this could become a 
>>>>>>>>>>> generic DRM API. It feels it would be hard to do it from the start. 
>>>>>>>>>>> What sounds more feasible is if/when generic looking helpers can be 
>>>>>>>>>>> spotted while developing the RFC then potentially structure the 
>>>>>>>>>>> code they can easily be promoted to shared/common at some future 
>>>>>>>>>>> moment.
>>>>>>>>>
>>>>>>>>> Yes, that's how this usually goes, in my experience. Thanks for 
>>>>>>>>> confirming.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> # Second API for creating objects
>>>>>>>>>>>>
>>>>>>>>>>>> Creating BOs and other objects when restoring a checkpoint needs 
>>>>>>>>>>>> more information than the usual BO alloc and similar APIs provide. 
>>>>>>>>>>>> For example, we need to restore BOs with the same GEM handles so 
>>>>>>>>>>>> that user mode can continue using those handles after resuming 
>>>>>>>>>>>> execution. If BOs are shared through DMABufs without dynamic 
>>>>>>>>>>>> attachment, we need to restore pinned BOs as pinned. Validation of 
>>>>>>>>>>>> virtual addresses and handling MMU notifiers must be suspended 
>>>>>>>>>>>> until the virtual address space is restored. For user mode queues 
>>>>>>>>>>>> we need to save and restore a lot of queue execution state so that 
>>>>>>>>>>>> execution can resume cleanly.
>>>>>>>>>>>
>>>>>>>>>>> This also sounds justified to me. Restore creating all internal 
>>>>>>>>>>> objects is definitely special and sounds better to add uapi to 
>>>>>>>>>>> create them directly with the correct properties, than to add uapi 
>>>>>>>>>>> to adjust internal properties after creation. And in case you would 
>>>>>>>>>>> always need some new uapi - so at least to adjust after creation. 
>>>>>>>>>>> At which point you may have both in one. Internally implementation 
>>>>>>>>>>> can be split or common, whatever makes sense for a given object 
>>>>>>>>>>> type, but new uapi definitely sounds is required.
>>>>>>>>>>>> # Restoring buffer sharing relationships
>>>>>>>>>>>>
>>>>>>>>>>>> Different GEM handles in different render nodes and processes can 
>>>>>>>>>>>> refer to the same underlying shared memory, either by directly 
>>>>>>>>>>>> pointing to the same GEM object, or by creating an import 
>>>>>>>>>>>> attachment that may get its SG tables invalidated and updated 
>>>>>>>>>>>> dynamically through dynamic attachment callbacks. In the latter 
>>>>>>>>>>>> case it's obvious, who is the exporter and who is the importer. In 
>>>>>>>>>>>> the first case, either one could be the exporter, and it's not 
>>>>>>>>>>>> clear who would need to create the BO and who would need to 
>>>>>>>>>>>
>>>>>>>>>>> To see if I follow the former case correctly.
>>>>>>>>>>>
>>>>>>>>>>> This could be two clients A and B, where B has imported a dma-buf 
>>>>>>>>>>> shared BO from A and has since closed the dma-buf fd? Which results 
>>>>>>>>>>> in a single BO with reference count of 2 and obj->import_attach 
>>>>>>>>>>> unset. History of who created the object is lost.
>>>>>>>>>
>>>>>>>>> Yes. In the amdgpu driver this happens when the exporter and import 
>>>>>>>>> device are the same.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In fact it could even be that two imported objects remain (clients 
>>>>>>>>>>> A, B and C) and A, who originally created the BO, has since fully 
>>>>>>>>>>> released it. So any kind of "creator" tracking if added wouldn't be 
>>>>>>>>>>> fully reliable either.
>>>>>>>>>
>>>>>>>>> That's a good point.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> import it when restoring the checkpoint. To further complicate 
>>>>>>>>>>>> things, multiple processes in a checkpoint get restored 
>>>>>>>>>>>> concurrently. So there is no guarantee that an exporter has 
>>>>>>>>>>>> restored a shared BO at the time an importer is trying to restore 
>>>>>>>>>>>> its import.
>>>>>>>>>>>>
>>>>>>>>>>>> A proposal to deal with these problems would be to treat importers 
>>>>>>>>>>>> and exporters the same. Whoever restores first, ends up creating 
>>>>>>>>>>>> the BO and potentially attaching to it. The other process(es) can 
>>>>>>>>>>>> find BOs that were already restored by another process by looking 
>>>>>>>>>>>> it up with a unique ID that could be based on the DMABuf inode 
>>>>>>>>>>>> number. An alternative would be a two-pass approach that needs to 
>>>>>>>>>>>> restore BOs on two passes:
>>>>>>>>>>>>
>>>>>>>>>>>>  1. Restore exported BOs
>>>>>>>>>>>>  2. Restore imports
>>>>>>>>>>>>
>>>>>>>>>>>> With some inter-process synchronization in CRIU itself between 
>>>>>>>>>>>> these two passes. This may require changes in the core CRIU, 
>>>>>>>>>>>> outside our plugin. Both approaches depend on identifying BOs with 
>>>>>>>>>>>> some unique ID that could be based on the DMABuf inode number in 
>>>>>>>>>>>> the checkpoint. However, we would need to identify the processes 
>>>>>>>>>>>> in the same restore session, possibly based on parent/child 
>>>>>>>>>>>> process relationships, to create a scope where those IDs are valid 
>>>>>>>>>>>> during restore.
>>>>>>>>>>>
>>>>>>>>>>> If my understanding above is on the right track, then I think this 
>>>>>>>>>>> is the only thing which can be done (for all scenarios).
>>>>>>>>>
>>>>>>>>> I presented two alternatives. I think you're in favor of the first 
>>>>>>>>> one, where it doesn't matter who is the importer and exporter. I 
>>>>>>>>> think the two-pass approach requires that you can identify an 
>>>>>>>>> exporter. And as you pointed out, the exporter may already have 
>>>>>>>>> dropped their reference to the BO.
>>>>>>>>
>>>>>>>> Yep.
>>>>>>>>
>>>>>>>>>>> I also *think* it would be safe. At least at the moment I cannot 
>>>>>>>>>>> think what could go wrong. Semantics are that it doesn't really 
>>>>>>>>>>> matter who created the object.
>>>>>>>>>
>>>>>>>>> I would agree. What matters is that the object is recreated on the 
>>>>>>>>> correct device, and that all the direct references and import 
>>>>>>>>> attachments pointing to it are restored.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Finally, we would also need to checkpoint and restore DMABuf file 
>>>>>>>>>>>> descriptors themselves. These are anonymous file descriptors. The 
>>>>>>>>>>>> CRIU plugin could probably be taught to recreate them from the 
>>>>>>>>>>>> original exported BO based on the inode number that could be 
>>>>>>>>>>>> queried with fstat in the checkpoint. It would need help from the 
>>>>>>>>>>>> render node CRIU API to find the right BO from the inode, which 
>>>>>>>>>>>> may be from a different process in the same restore session.
>>>>>>>>>>>
>>>>>>>>>>> This part feels like it is breaking the component separation a bit 
>>>>>>>>>>> because even for buffers fully owned by amdgpu, strictly speaking 
>>>>>>>>>>> the dma-buf fd is not. At least my understanding from the above is 
>>>>>>>>>>> that you propose to attempt to import the fd, from the kernel side, 
>>>>>>>>>>> during the checkpoint process? Like:
>>>>>>>>>>>
>>>>>>>>>>> Checkpoint:
>>>>>>>>>>>
>>>>>>>>>>> CRIU for each anon fd:
>>>>>>>>>>>    amdgpu_plugin(fd)
>>>>>>>>>>>      -> attempt in kernel dma buf import (passes fd to kernel via 
>>>>>>>>>>> ioctl?)
>>>>>>>>>>>          -> is it ours? (no -> error)
>>>>>>>>>>>              -> create a record mapping fd number to amdgpu BO
>>>>>>>>>>>
>>>>>>>>>>> Restore:
>>>>>>>>>>>
>>>>>>>>>>> for each dma-buf fd record:
>>>>>>>>>>>     create BO if does not exists
>>>>>>>>>>>     export BO to same fd
>>>>>>>>>>>     close BO handle if not in regular BO handle records
>>>>>>>>>>>
>>>>>>>>>>> Or since you mention lookup by inode, that would need to be 
>>>>>>>>>>> dmabuf_plugin so it can lookup inodes in the private mount space. 
>>>>>>>>>>> However how would it co-operate with amdgpu_plugin is not clear to 
>>>>>>>>>>> me.
>>>>>>>>>
>>>>>>>>> The way I think about the ownership is, whichever driver created the 
>>>>>>>>> underlying BO owns the checkpointing of the dmabuf. You need 
>>>>>>>>> driver-specific information to link the dmabuf with the driver's BO 
>>>>>>>>> and you need the right driver to recreate the BO and the dmabuf fd 
>>>>>>>>> when restoring the checkpoint.
>>>>>>>>>
>>>>>>>>> It gets really interesting if you have an amdgpu plugin and an i915 
>>>>>>>>> plugin, and they checkpoint an application that shares BOs between 
>>>>>>>>> the two devices through DMABufs. E.g. if i915 created a BO and amdgpu 
>>>>>>>>> imported it, then during restore, i915 needs to restore the dmabuf 
>>>>>>>>> before the amdgpu import of it can be restored. I think that brings 
>>>>>>>>> us back to a two-phase approach to restoring the memory sharing 
>>>>>>>>> relationships. Uff.
>>>>>>>>
>>>>>>>> I think this part of the discussion somewhat depends on the previous 
>>>>>>>> part about idling. If it is feasible to completely idle and prune, and 
>>>>>>>> fail if that is not happening quickly enough, then maybe there 
>>>>>>>> wouldn't be too much hierarchical state to save.
>>>>>>>>
>>>>>>>> Otherwise my idea was that there is a top-level drm_plugin.so which 
>>>>>>>> understands amdgpu fds, i915, syncobj, sync_file, and uses some new 
>>>>>>>> uapi to uniquely identify each, associate with the correct driver, and 
>>>>>>>> then internally dispatches to amdgpu|i915|dmabuf|..._plugin.so. 
>>>>>>>> Building the in memory representation of their relationships. As long 
>>>>>>>> as all objects and their relationships have been recorded I think 
>>>>>>>> everything could then be correctly restored.
>>>>>>>>
>>>>>>>> It is possible there is flaw in my thinking and that something in CRIU 
>>>>>>>> design would make this impossible? I think it would require the 
>>>>>>>> top-level drm_plugin.so to hold all state in memory until the whole 
>>>>>>>> checkpointing is done, and then verify something is not incomplete, 
>>>>>>>> failing it all if it was. (For instance one plugin discovered an 
>>>>>>>> reference to an object which was not discoverd by any other plugin or 
>>>>>>>> things like that.) May need some further tweaks to CRIU common code.
>>>>>>>>
>>>>>>>> Maybe I need to better understand how exactly you mean to query the 
>>>>>>>> DRM driver about random anonymous fds. I see it as a problem in the 
>>>>>>>> design, possibly even implementation, but maybe I am missing something 
>>>>>>>> which makes it not so. I mean even with my general idea I don't know 
>>>>>>>> how would one determine which driver to query about a particular 
>>>>>>>> anonymous inode.
>>>>>>>>
>>>>>>>>>> I later also realised that I was maybe increasing the scope for you 
>>>>>>>>>> here. :) You did state focus is ROCm applications which possibly 
>>>>>>>>>> doesn't care about dma-resv, fences, syncobjs etc?
>>>>>>>>>
>>>>>>>>> That's my focus for now. But I don't want to engineer a solution that 
>>>>>>>>> would preclude your use cases in the future.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But I think the "how to handle dma-bufs" design question is still 
>>>>>>>>>> relevant and interesting. For example I had this thought that 
>>>>>>>>>> perhaps what would be needed is a CRIU plugin hierarchy.
>>>>>>>>>>
>>>>>>>>>> Because fundamentally we would be snapshoting a hierarcy of kernel 
>>>>>>>>>> objects belonging to different drivers (kfd, amdgpu, dma-buf, ...). 
>>>>>>>>>> And if one day someone would to try to handle dma fences and drm 
>>>>>>>>>> syncobjs, the argument for a hierarchial design would be even 
>>>>>>>>>> stronger I think.
>>>>>>>>>>
>>>>>>>>>> Something like a drm_plugin.so could call sub-plugins (amdgpu, 
>>>>>>>>>> dma-buf, sync file, ...) and internally build the representation of 
>>>>>>>>>> the whole state and how the relationship between the objects.
>>>>>>>>>
>>>>>>>>> Maybe. I guess a structure similar to libdrm makes sense. I'm not 
>>>>>>>>> sure it's strictly a hierarchy. Maybe more like some common code 
>>>>>>>>> shared by multiple GPU driver plugins. I think the common checkpoint 
>>>>>>>>> state is quite limited and restoring it requires the GPU-specific 
>>>>>>>>> drivers anyway.
>>>>>>>>>
>>>>>>>>> Also the idea of building a representation of the whole state doesn't 
>>>>>>>>> work well with the CRIU design, because "the whole state" can include 
>>>>>>>>> multiple processes that restore themselves concurrently and only 
>>>>>>>>> synchronize with each other in a few places in the restore process. I 
>>>>>>>>> feel, if we can work out how to checkpoint and restore shared objects 
>>>>>>>>> between processes, we can do the same for shared objects between 
>>>>>>>>> drivers without imposing a strict hierarchy and some omniscient 
>>>>>>>>> entity that needs to know "the whole state".
>>>>>>>>
>>>>>>>> Okay, this continues on the same problem space as above. And you 
>>>>>>>> obviously know how CRIU works much better than me.
>>>>>>>>
>>>>>>>>>> With that kind of design there probably would be a need to define 
>>>>>>>>>> some common kernel side api and uapi, so all involved objects can be 
>>>>>>>>>> enumerated with some unique ids etc.
>>>>>>>>>>
>>>>>>>>>> Now.. the counter argument.. the more state from different drivers 
>>>>>>>>>> would one want to handle the bigger this project would get. Would it 
>>>>>>>>>> even be feasible is the question, to the point that it may be 
>>>>>>>>>> simpler to just run the workload in a VM via SR-IOV and simply 
>>>>>>>>>> hibernate the whole thin guest. :)
>>>>>>>>>
>>>>>>>>> Well, CRIU kind of tries to do that, but with containers instead of 
>>>>>>>>> VMs. ;)
>>>>>>>>
>>>>>>>> It would definitely be useful for hardware and drivers without SR-IOV 
>>>>>>>> support so lets hope it is doable. :)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Tvrtko

Reply via email to