On Mon, Jun 08, 2026 at 12:59:55PM -0500, Michael Roth wrote: > On Fri, Jun 05, 2026 at 10:57:34AM -0400, Peter Xu wrote: > > On Thu, Jun 04, 2026 at 05:36:42PM -0500, Michael Roth wrote: > > > > IIUC it's a matter of if we expect future property of guest-memfd that > > > > will > > > > stop applying to memfd anymore? > > > > > > Yah, I think that's the main thing to consider. There's a few things in > > > the > > > pipeline where the options associated with guest_memfd might diverage > > > quite a bit from memfd: > > > > Thanks for all these contexts. I'll throw some random questions below, > > some of them may not be directly related to the current discussion, but > > please bare with me. > > > > > > > > - hugetlb: yes, these could potentially use the same options memfd > > > uses, and I'm guessing that will end up being the case, but one > > > large gap there is that shared memory is always split to 4K, which > > > we've accepted for now, but if you consider use-cases like DPDK > > > there can still be major performance bottlenecks that would drive > > > us to try to enable larger mappings for the shared ranges, and then > > > we'd end up with guest-memfd-specific parameters intermix with > > > normal memfd options, and our related documentation would need to > > > covers these differences case by case > > > > The first thing I thought about is mTHP and how it can also be similarly > > applied to normal memfd (now, or in the future, that I'm not sure). > > That might work, though for some architectures the shared pages might > benefit from a wider range of hugepage granularities than private. For > instance with SNP it might make sense to expose 1GB hugepages but then > limit them to 2MB granularity for private (since private pages get 2MB > TLB entries anyway so rebuilding from 2MB->1GB is a waste) > > https://github.com/AMDESE/amdsev#prepare-hostBut IIUC that sort of support > will likely depend on some mm/ changes > are refcounting that probably won't be completed any time soon so it's > hard to say to anticipate what that'll end up looking like. > > > > > Before that.. shouldn't the whole concept of private mem / gmem about > > reducing the area of mapping the host (including dpdk, if we're talking > > about things like OpenVswitch)? Can you roughly describe how huge mapping > > is expected to be allowed in such case? Does it mean the guest driver > > should also be aware to allocate huge continuous physical mem for DMA only? > > Guest drivers would generally end up either going through SWIOTLB to get > access to shared memory, or convert their pages to shared directly prior > to use. Optimizations for hugepages would work roughly the same as any > optimizations that have already been done for the non-confidential case. > Shared guest allocations/conversions at granularities smaller than the > backing page/hugepage size are where confidential VMs start to pay an > extra tax in guest_memfd and potentially with the security architecture > (e.g. extra RMP table checks for SNP, not just the TLB/TLB misses). > > It seems to be in everyone's best interest to have a common shared > memory pool that gets initialized/converted/replenished at hugepage-sized > granularities. Here's one patchset[1] that takes the natural choice of > doing this through SWIOTLB. Not sure that's ultimately what it will look > like but I think it's safe to expect some level of optimization.
[sorry for a late response] OK, yes this sounds reasonable in general. > > So with all these pieces in play, what I would expect is that DPDK > applications could access these shared buffers using hugepages both in > the guest and host-side as they do today, since the whole hugepage/folio So even if physical continous chunks (each 2M in size) are preserved, it's still part of gmem page cache (based on either 2M or larger huge pages). Then I assume if we want to have OVS-DPDK be able to hugely map it, we'll need to support in gmem's fault handler when seeing some shared huge 2M and do the job right. There seems to have some details missing (e.g. I wonder if it's still safer to only map 4K by default..), but it sounds working indeed. > will be homogenous with the above guest optimizations... but that still > requires the above-mentioned mm/ rework for refcounting to allow hugepages > for shared ranges so this is another thing that we probably won't need to > deal with any time soon, but I think that's roughly what we could expect > it to look like eventually. > > [1] https://lkml.org/lkml/2024/1/12/65 > > > > > > - DAX-like stuff: there are some proposals for making device memory > > > available to use as private guest memory, and since 'guest-memfd' > > > is generally responsible for managing private memory, it will > > > likely end up being extended to handle this at some point. One > > > proposal/PoC[1] would involve at least needing additional options > > > for the /dev/dax path, but there have also been discussions about > > > having a general notion of custom allocators that can be plugged > > > into guest_memfd, and some of these might have overlapping options > > > WRT things like hugepages/etc. But at a high-level, DAX would map > > > more to memory-backend-file than memory-backend-memfd, so we'd > > > already be crossing up some wires there. > > > > I have no deep understanding on this, but IIUC we used to stick with > > memory-backend-file for dax. Why switch to memory-backend-guest-memfd? > > Are we still exposing a dax via a file path ultimately, even with CoCo? > > I touched on this a bit below, but I don't necessarily think > memory-backend-guest-memfd should handle DAX, it's just one example > where we clearly need to think beyond 'memfd', but are still potentially > in the realm of 'guest_memfd' depending on what the API ends up looking > like. > > But I agree with your below point that we don't need the backend to > match up with implementation details of how guest_memfd works > internally, and that the core point that memory-backend-file might still > end up seeming like the most appropriate way for a QEMU user to specify > a DAX path, even if internally it's still using guest_memfd. > > Though going that route, we'd still have a > memory-backend-file,...,guest_memfd=on that brings 'memfd's' back into > the discussion. We could take it a step further and rename the > 'guest_memfd' backend option to 'securable', but maybe this is ends up > being that right level of balance between 'i need to open a file that > can do guest_memfd-related stuff', and 'i need to create a guest_memfd > instance can handle a DAX path'. What I have in mind is, when used as CoCo where we will need to apply the machine property first saying "this is a CoCo VM", we shouldn't need to specify "guest-memfd=on" all over the places anymore. QEMU should be able to identify it's required to initialize everything to use guest-memfd, or fail when it's not supported. I introduced "guest-memfd" parameter in this series because this series still haven't gone as far: unlike CoCo+in-place-conversion or the old CoCo with two-layered backends, init-shared is an opt-in feature, and there's no machine property to imply anything. It's different from CoCo VMs. I wonder if you think that's a good idea. So I found it redundant to specify guest-memfd=on for CoCo VMs for all such backends (unless if we decide to go with memory-backend-guest-memfd, then it's fine). Say, after in-place conversion ready, we should specify this: $QEMU -M $SOME_COCO_PARAMS,in-place-conversion=on -object memory-backend-memfd,id=* (without guest-memfd=on) Then that memory-backend-memfd should automatically notice "ok this is a CoCo VM with in-place conversion enabled, we don't do dual-layer, this must be a gmemfd". > > My thinking was that since the hugepage PoC already implements the notion > of custom allocators in the uAPI, and that there's been talk of 'pluggable' > backends for guest_memfd, that the kernel would also need to do a reasonable > job in creating a consistent uAPI/documentation, such that the hugepage/DAX > cases would end up looking something like: > > memory-backend-guest-memfd,allocator=hugetlb,pagesize=2M,... > memory-backend-guest-memfd,allocator=dax,path=/dev/daxX,pagesize=2M,... So far I'm still a bit conservative to introduce a path= parameter to a memfd backend.. but we can hold the discussion until the idea becomes clearer. > > which is firmly on the 'i need to create a guest_memfd instance that's > back by a DAX path' end of the spectrum, compared to the more abstracted > approach you're suggesting, and so for the most part we'd be passing > through the kernel options/documentation to users vs. abstracting it and > then touching on it case-by-case in 'memfd'/'file'/etc. documentation. If you agree with my above idea, we don't need to document anything special for CoCo. We support all things internally, no doc change needed except telling people how to enable CoCo. Say, if we have a command line that works for QEMU without CoCo: $QEMU -M ... -object memory-backend-memfd,id=... Then my imagination is if we want to boot that same VM but with CoCo, the cmdline should be: $QEMU $COCO_PARAMS1 -M ...,$COCO_PARAMS2 -object memory-backend-memfd,id=... (again, no guest-memfd=on anywhere) I wish it will just work all like before, except that it now boots a confidential VM, so safety guaranteed. I agree that's the ideal case, and I may over look things.. Please correct me if I missed some. > > Personally, I'm not sure at this point which approach will end up being > the more workable one. But it is harder/more-confusing to start with > memory-backend-guest-memfd, and then go back to e.g. > memory-backend-file,guest_memfd=on later for future extensions. So I'm > start to lean toward doing the minimal > '<existing_backend>,guest_memfd=on' thing for now, and then just > deprecating it if we really feel like we need a more direct interface > that memory-backend-guest-memfd down the road. > > Does that seem reasonable for a starting point? I feel like we'll be > better positioned to make a better long-term decision once some of these > patchsets are further along. The more I read into it, the more I feel like we should settle this early.. It might be a good timing to discuss this. Again, I don't actually have any strong opinion right now, say, duplicating the current memfd backend file isn't a huge deal. Now, I'm uncertain slightly more about the future, and whether that's the right start to future's QEMU cmdline interfacing with CoCo. > > > > > Note, here I want to differenciate two concepts: QEMU interfacing and > > kernel/KVM interfacing. I mean, I have a gut feeling that for coco dax we > > could still stick with memory-backend-file, even if internally we can still > > use new KVM ioctls to set them up: there's no rule to say only > > memory-backend-guest-memfd can use the KVM ioctl. IMHO they're different > > stories, and here I'm focused more on the QEMU interfacing that we're > > discussing here. > > > > IMHO for QEMU's interfacing, any memory-backend should play one solo role > > which is to point to QEMU (as a hypervisor) a backing store for some piece > > of resource that can be used as guest memory backend. It doesn't need to > > have any implication on how we implement that backend internally. > > > > > - live update: there's work[2] on enabling preservation of confidential > > > guest memory across kexec by preserving it through guest_memfd. This > > > one is still a bit mind-blowing to me but I could see us needing > > > some additional options here that would really make no sense for > > > memfd. > > > > Could you elaborate what kind of parameter you would expect? > > I was thinking stuff like the metadata that would be needed to rebuild a > KVM instance with the same GPA->HPA mappings to the pages previously > allocated by guest_memfd. It makes semse that each backend has it's own > associated metadata so that each can be restored in-turn, but yes there > would also need to be some common state like KVM itself that needs to be > serialized, and this would probably have separate options. So in theory it > wouldn't need to be tied to the backend, but IMO it feels very natural > to imagine the options like something like that. So live update should ideally work both for CoCo and non-CoCo, and work for all memory-backends, am I right? I'm not sure if it means gmem is not special in this case, because they should all be reserved on save(), and recovered on load(). If you put everything above into this picture, live update (or CPR) only means: let's do a migration with this VM, no matter if it's CoCo or not. It hopefully shouldn't care some parameter the memfd (gmem or not) has. > > > > > I'm not sure if you have investigated QEMU's CPR approach, now memfd > > backend is really the core of supporting such infrastructure, where fds can > > be persisted. For live update, it'll be persisted across kexec and kernel > > switchover. For CPR, it actually also works when with cpr-reboot with its > > own tricky way to persist memory. > > > > In general, what I want to say is, I really think they should play the same > > in term of live update case too: if we need to register some fd for > > persistency, we need to register gmem, kvm, but also memfd if some of them > > are attached to the current VM, right? > > I definitely need to look into this more (and intra-host live update for > guest_memfd/in-place converison in general), but for guest memory > persistence it seems like we'd generally be relying on > memory-backend-file=<path> as a target/src for serializing guest memory > to persistent storage for normal/'memfd' case. > > But for confidential VMs we don't just need the data for a particular GPA, > but the original HPA and maybe details like the associated shared/private > memory attributes, which is why I'm thinking we might need something like a > separate path argument for that, or maybe QEMU abstracting this out into its > own user-configurable format. > > > > > > - directmap removal: these[3] patches allow a new guest_memfd flag to > > > be set to unmap guest_memfd pages from kernel directmap to help > > > mitigate speculative attacks, probably would involve a new option > > > as well that wouldn't be applicable to normal memfds > > > > Now the question is, do we want to remove directmap for "some" memory > > backend, or do we want to remove it per-VM? > > > > This is another thing I want to make sure we're on the same page: I want to > > make sure we don't introduce per-VM setup for memory backends. > > > > Say, "init-shared" or "in-place CoCo", what should we use for one gmem fd? > > IMHO it shouldn't be a parameter in the memory-backend. It should be a > > parameter for the -machine or some similar per-vm setup, which will apply > > to all gmemfd across the current VM. > > > > My understanding is directmap removal is similar in this case, which seems > > to be a per-VM (rather than per-memory-backend) attribute? We can still > > operate on that per-memory-backend, but then it'll be internally, the > > backends need to understand the VM setup and do things properly, IMHO. > > I think all-or-nothing would be most common, but it's completely > controlled at the guest_memfd inode level so it would support that sort > of flexibility if needed. One side effect is that setting it currently > sets AS_NO_DIRECT_MAP which can have some performance downsides... > maaaaaybe that's enough for someone to want to fine-tune 'isolated' vs. > 'non-isolated' GPA ranges? Yes, it sounds possible, but it sounds non-trivial to do any fine tuning.. and maybe the admin will just go take or not take the penalty. :) I'll see how you think about my above comments in general, the summary is whether we can relying on a machine prop to define the VM completely on all internal things, rather than needing to touch everything. If that's a good idea, then maybe we don't need a new backend for the same reason, because memfd backend "must" be guest-memfd backend for CoCo. This made me also thought about when CPR was introduced: not relevant to "live update" perspective, but similarly we need something special all over the places, and what CPR does was exactly requiring one global -M parameter with: -machine aux-ram-share=on That enables all internal RAMs allocated (like, ROMs) to be fully fd-based. All rest QEMU cmdlines untouched to support CPR. CPR hided quite some tricky checks all over the place. Sometimes I don't like it, but I think what I don't like is not the interfacing part, only when it hides things too deep in stack.. Logically the idea still makes sense and it's similar here on one knob controlling everything. https://www.qemu.org/docs/master/devel/migration/CPR.html Thanks again for all the details! > > So I think it's pretty safe to say we don't *need* to expose this > functionality per-backend/inode initially, and if we end up preferring a > global option then that's probably fine too. So we can probably set this > example aside for now. > > > > > > > > > It could also end up that even memory-backend-guest-memfd is too > > > generic, and that some of these would involve a more specialized memory > > > backend where may they can share a common base class for some of the > > > core guest_memfd stuff but otherwise be separate backends with their > > > own specific options. So to me, starting off building up > > > memory-backend-memfd seems like a potential misstep, whereas we don't > > > really lose much to start with a clean slate. > > > > > > [1] DAX: > > > https://lwn.net/ml/all/[email protected]/ > > > [2] LUO: > > > https://lore.kernel.org/all/[email protected]/#r > > > [3] directmap removal: > > > https://lore.kernel.org/kvm/[email protected]/ > > > > > > > > > > > > > > > > > I also saw you were open to having someone pick up these patches if > > > > > you > > > > > don't think you'll have a chance to get to them near-term, so I'd be > > > > > happy to pick them up if that's preferable. > > > > > > > > Sure! Indeed I don't have bandwidth to keep working on this one in the > > > > near future. Please feel free to pick whatever needed into your series. > > > > > > Ok, sounds good, I'll pick these up for my next posting and incorporate > > > any changes/comments that might still be pending at that time. > > > > > > Thanks for getting things to this stage! > > > > Thanks for picking it up! Juraj in our team may have some future > > exploration on gmem over 1G for postcopy on init-shared, so it's great the > > code is moving closer to that direction. > > Nice, lots of interesting work ahead it seems :) > > Thanks, > > Mike > > > > > Thanks, > > > > -- > > Peter Xu > > > > > -- Peter Xu
