On Fri, Jun 05, 2026 at 10:57:34AM -0400, Peter Xu wrote: > On Thu, Jun 04, 2026 at 05:36:42PM -0500, Michael Roth wrote: > > > IIUC it's a matter of if we expect future property of guest-memfd that > > > will > > > stop applying to memfd anymore? > > > > Yah, I think that's the main thing to consider. There's a few things in the > > pipeline where the options associated with guest_memfd might diverage > > quite a bit from memfd: > > Thanks for all these contexts. I'll throw some random questions below, > some of them may not be directly related to the current discussion, but > please bare with me. > > > > > - hugetlb: yes, these could potentially use the same options memfd > > uses, and I'm guessing that will end up being the case, but one > > large gap there is that shared memory is always split to 4K, which > > we've accepted for now, but if you consider use-cases like DPDK > > there can still be major performance bottlenecks that would drive > > us to try to enable larger mappings for the shared ranges, and then > > we'd end up with guest-memfd-specific parameters intermix with > > normal memfd options, and our related documentation would need to > > covers these differences case by case > > The first thing I thought about is mTHP and how it can also be similarly > applied to normal memfd (now, or in the future, that I'm not sure).
That might work, though for some architectures the shared pages might benefit from a wider range of hugepage granularities than private. For instance with SNP it might make sense to expose 1GB hugepages but then limit them to 2MB granularity for private (since private pages get 2MB TLB entries anyway so rebuilding from 2MB->1GB is a waste) https://github.com/AMDESE/amdsev#prepare-hostBut IIUC that sort of support will likely depend on some mm/ changes are refcounting that probably won't be completed any time soon so it's hard to say to anticipate what that'll end up looking like. > > Before that.. shouldn't the whole concept of private mem / gmem about > reducing the area of mapping the host (including dpdk, if we're talking > about things like OpenVswitch)? Can you roughly describe how huge mapping > is expected to be allowed in such case? Does it mean the guest driver > should also be aware to allocate huge continuous physical mem for DMA only? Guest drivers would generally end up either going through SWIOTLB to get access to shared memory, or convert their pages to shared directly prior to use. Optimizations for hugepages would work roughly the same as any optimizations that have already been done for the non-confidential case. Shared guest allocations/conversions at granularities smaller than the backing page/hugepage size are where confidential VMs start to pay an extra tax in guest_memfd and potentially with the security architecture (e.g. extra RMP table checks for SNP, not just the TLB/TLB misses). It seems to be in everyone's best interest to have a common shared memory pool that gets initialized/converted/replenished at hugepage-sized granularities. Here's one patchset[1] that takes the natural choice of doing this through SWIOTLB. Not sure that's ultimately what it will look like but I think it's safe to expect some level of optimization. So with all these pieces in play, what I would expect is that DPDK applications could access these shared buffers using hugepages both in the guest and host-side as they do today, since the whole hugepage/folio will be homogenous with the above guest optimizations... but that still requires the above-mentioned mm/ rework for refcounting to allow hugepages for shared ranges so this is another thing that we probably won't need to deal with any time soon, but I think that's roughly what we could expect it to look like eventually. [1] https://lkml.org/lkml/2024/1/12/65 > > > - DAX-like stuff: there are some proposals for making device memory > > available to use as private guest memory, and since 'guest-memfd' > > is generally responsible for managing private memory, it will > > likely end up being extended to handle this at some point. One > > proposal/PoC[1] would involve at least needing additional options > > for the /dev/dax path, but there have also been discussions about > > having a general notion of custom allocators that can be plugged > > into guest_memfd, and some of these might have overlapping options > > WRT things like hugepages/etc. But at a high-level, DAX would map > > more to memory-backend-file than memory-backend-memfd, so we'd > > already be crossing up some wires there. > > I have no deep understanding on this, but IIUC we used to stick with > memory-backend-file for dax. Why switch to memory-backend-guest-memfd? > Are we still exposing a dax via a file path ultimately, even with CoCo? I touched on this a bit below, but I don't necessarily think memory-backend-guest-memfd should handle DAX, it's just one example where we clearly need to think beyond 'memfd', but are still potentially in the realm of 'guest_memfd' depending on what the API ends up looking like. But I agree with your below point that we don't need the backend to match up with implementation details of how guest_memfd works internally, and that the core point that memory-backend-file might still end up seeming like the most appropriate way for a QEMU user to specify a DAX path, even if internally it's still using guest_memfd. Though going that route, we'd still have a memory-backend-file,...,guest_memfd=on that brings 'memfd's' back into the discussion. We could take it a step further and rename the 'guest_memfd' backend option to 'securable', but maybe this is ends up being that right level of balance between 'i need to open a file that can do guest_memfd-related stuff', and 'i need to create a guest_memfd instance can handle a DAX path'. My thinking was that since the hugepage PoC already implements the notion of custom allocators in the uAPI, and that there's been talk of 'pluggable' backends for guest_memfd, that the kernel would also need to do a reasonable job in creating a consistent uAPI/documentation, such that the hugepage/DAX cases would end up looking something like: memory-backend-guest-memfd,allocator=hugetlb,pagesize=2M,... memory-backend-guest-memfd,allocator=dax,path=/dev/daxX,pagesize=2M,... which is firmly on the 'i need to create a guest_memfd instance that's back by a DAX path' end of the spectrum, compared to the more abstracted approach you're suggesting, and so for the most part we'd be passing through the kernel options/documentation to users vs. abstracting it and then touching on it case-by-case in 'memfd'/'file'/etc. documentation. Personally, I'm not sure at this point which approach will end up being the more workable one. But it is harder/more-confusing to start with memory-backend-guest-memfd, and then go back to e.g. memory-backend-file,guest_memfd=on later for future extensions. So I'm start to lean toward doing the minimal '<existing_backend>,guest_memfd=on' thing for now, and then just deprecating it if we really feel like we need a more direct interface that memory-backend-guest-memfd down the road. Does that seem reasonable for a starting point? I feel like we'll be better positioned to make a better long-term decision once some of these patchsets are further along. > > Note, here I want to differenciate two concepts: QEMU interfacing and > kernel/KVM interfacing. I mean, I have a gut feeling that for coco dax we > could still stick with memory-backend-file, even if internally we can still > use new KVM ioctls to set them up: there's no rule to say only > memory-backend-guest-memfd can use the KVM ioctl. IMHO they're different > stories, and here I'm focused more on the QEMU interfacing that we're > discussing here. > > IMHO for QEMU's interfacing, any memory-backend should play one solo role > which is to point to QEMU (as a hypervisor) a backing store for some piece > of resource that can be used as guest memory backend. It doesn't need to > have any implication on how we implement that backend internally. > > > - live update: there's work[2] on enabling preservation of confidential > > guest memory across kexec by preserving it through guest_memfd. This > > one is still a bit mind-blowing to me but I could see us needing > > some additional options here that would really make no sense for > > memfd. > > Could you elaborate what kind of parameter you would expect? I was thinking stuff like the metadata that would be needed to rebuild a KVM instance with the same GPA->HPA mappings to the pages previously allocated by guest_memfd. It makes semse that each backend has it's own associated metadata so that each can be restored in-turn, but yes there would also need to be some common state like KVM itself that needs to be serialized, and this would probably have separate options. So in theory it wouldn't need to be tied to the backend, but IMO it feels very natural to imagine the options like something like that. > > I'm not sure if you have investigated QEMU's CPR approach, now memfd > backend is really the core of supporting such infrastructure, where fds can > be persisted. For live update, it'll be persisted across kexec and kernel > switchover. For CPR, it actually also works when with cpr-reboot with its > own tricky way to persist memory. > > In general, what I want to say is, I really think they should play the same > in term of live update case too: if we need to register some fd for > persistency, we need to register gmem, kvm, but also memfd if some of them > are attached to the current VM, right? I definitely need to look into this more (and intra-host live update for guest_memfd/in-place converison in general), but for guest memory persistence it seems like we'd generally be relying on memory-backend-file=<path> as a target/src for serializing guest memory to persistent storage for normal/'memfd' case. But for confidential VMs we don't just need the data for a particular GPA, but the original HPA and maybe details like the associated shared/private memory attributes, which is why I'm thinking we might need something like a separate path argument for that, or maybe QEMU abstracting this out into its own user-configurable format. > > > - directmap removal: these[3] patches allow a new guest_memfd flag to > > be set to unmap guest_memfd pages from kernel directmap to help > > mitigate speculative attacks, probably would involve a new option > > as well that wouldn't be applicable to normal memfds > > Now the question is, do we want to remove directmap for "some" memory > backend, or do we want to remove it per-VM? > > This is another thing I want to make sure we're on the same page: I want to > make sure we don't introduce per-VM setup for memory backends. > > Say, "init-shared" or "in-place CoCo", what should we use for one gmem fd? > IMHO it shouldn't be a parameter in the memory-backend. It should be a > parameter for the -machine or some similar per-vm setup, which will apply > to all gmemfd across the current VM. > > My understanding is directmap removal is similar in this case, which seems > to be a per-VM (rather than per-memory-backend) attribute? We can still > operate on that per-memory-backend, but then it'll be internally, the > backends need to understand the VM setup and do things properly, IMHO. I think all-or-nothing would be most common, but it's completely controlled at the guest_memfd inode level so it would support that sort of flexibility if needed. One side effect is that setting it currently sets AS_NO_DIRECT_MAP which can have some performance downsides... maaaaaybe that's enough for someone to want to fine-tune 'isolated' vs. 'non-isolated' GPA ranges? So I think it's pretty safe to say we don't *need* to expose this functionality per-backend/inode initially, and if we end up preferring a global option then that's probably fine too. So we can probably set this example aside for now. > > > > > It could also end up that even memory-backend-guest-memfd is too > > generic, and that some of these would involve a more specialized memory > > backend where may they can share a common base class for some of the > > core guest_memfd stuff but otherwise be separate backends with their > > own specific options. So to me, starting off building up > > memory-backend-memfd seems like a potential misstep, whereas we don't > > really lose much to start with a clean slate. > > > > [1] DAX: > > https://lwn.net/ml/all/[email protected]/ > > [2] LUO: > > https://lore.kernel.org/all/[email protected]/#r > > [3] directmap removal: > > https://lore.kernel.org/kvm/[email protected]/ > > > > > > > > > > > > > I also saw you were open to having someone pick up these patches if you > > > > don't think you'll have a chance to get to them near-term, so I'd be > > > > happy to pick them up if that's preferable. > > > > > > Sure! Indeed I don't have bandwidth to keep working on this one in the > > > near future. Please feel free to pick whatever needed into your series. > > > > Ok, sounds good, I'll pick these up for my next posting and incorporate > > any changes/comments that might still be pending at that time. > > > > Thanks for getting things to this stage! > > Thanks for picking it up! Juraj in our team may have some future > exploration on gmem over 1G for postcopy on init-shared, so it's great the > code is moving closer to that direction. Nice, lots of interesting work ahead it seems :) Thanks, Mike > > Thanks, > > -- > Peter Xu > >
