On Mon, May 13, 2024, James Gowans wrote: > On Mon, 2024-05-13 at 10:09 -0700, Sean Christopherson wrote: > > On Mon, May 13, 2024, James Gowans wrote: > > > On Mon, 2024-05-13 at 08:39 -0700, Sean Christopherson wrote: > > > > > Sean, you mentioned that you envision guest_memfd also supporting > > > > > non-CoCo VMs. > > > > > Do you have some thoughts about how to make the above cases work in > > > > > the > > > > > guest_memfd context? > > > > > > > > Yes. The hand-wavy plan is to allow selectively mmap()ing > > > > guest_memfd(). There > > > > is a long thread[*] discussing how exactly we want to do that. The > > > > TL;DR is that > > > > the basic functionality is also straightforward; the bulk of the > > > > discussion is > > > > around gup(), reclaim, page migration, etc. > > > > > > I still need to read this long thread, but just a thought on the word > > > "restricted" here: for MMIO the instruction can be anywhere and > > > similarly the load/store MMIO data can be anywhere. Does this mean that > > > for running unmodified non-CoCo VMs with guest_memfd backend that we'll > > > always need to have the whole of guest memory mmapped? > > > > Not necessarily, e.g. KVM could re-establish the direct map or mremap() > > on-demand. > > There are variation on that, e.g. if ASI[*] were to ever make it's way > > upstream, > > which is a huge if, then we could have guest_memfd mapped into a KVM-only > > CR3. > > Yes, on-demand mapping in of guest RAM pages is definitely an option. It > sounds quite challenging to need to always go via interfaces which > demand map/fault memory, and also potentially quite slow needing to > unmap and flush afterwards. > > Not too sure what you have in mind with "guest_memfd mapped into KVM- > only CR3" - could you expand?
Remove guest_memfd from the kernel's direct map, e.g. so that the kernel at-large can't touch guest memory, but have a separate set of page tables that have the direct map, userspace page tables, _and_ kernel mappings for guest_memfd. On KVM_RUN (or vcpu_load()?), switch to KVM's CR3 so that KVM always map/unmap are free (literal nops). That's an imperfect solution as IRQs and NMIs will run kernel code with KVM's page tables, i.e. guest memory would still be exposed to the host kernel. And of course we'd need to get buy in from multiple architecturs and maintainers, etc. > > > I guess the idea is that this use case will still be subject to the > > > normal restriction rules, but for a non-CoCo non-pKVM VM there will be > > > no restriction in practice, and userspace will need to mmap everything > > > always? > > > > > > It really seems yucky to need to have all of guest RAM mmapped all the > > > time just for MMIO to work... But I suppose there is no way around that > > > for Intel x86. > > > > It's not just MMIO. Nested virtualization, and more specifically shadowing > > nested > > TDP, is also problematic (probably more so than MMIO). And there are more > > cases, > > i.e. we'll need a generic solution for this. As above, there are a variety > > of > > options, it's largely just a matter of doing the work. I'm not saying it's > > a > > trivial amount of work/effort, but it's far from an unsolvable problem. > > I didn't even think of nested virt, but that will absolutely be an even > bigger problem too. MMIO was just the first roadblock which illustrated > the problem. > Overall what I'm trying to figure out is whether there is any sane path > here other than needing to mmap all guest RAM all the time. Trying to > get nested virt and MMIO and whatever else needs access to guest RAM > working by doing just-in-time (aka: on-demand) mappings and unmappings > of guest RAM sounds like a painful game of whack-a-mole, potentially > really bad for performance too. It's a whack-a-mole game that KVM already plays, e.g. for dirty tracking, post-copy demand paging, etc.. There is still plenty of room for improvement, e.g. to reduce the number of touchpoints and thus the potential for missed cases. But KVM more or less needs to solve this basic problem no matter what, so I don't think that guest_memfd adds much, if any, burden. > Do you think we should look at doing this on-demand mapping, or, for > now, simply require that all guest RAM is mmapped all the time and KVM > be given a valid virtual addr for the memslots? I don't think "map everything into userspace" is a viable approach, precisely because it requires reflecting that back into KVM's memslots, which in turn means guest_memfd needs to allow gup(). And I don't think we want to allow gup(), because that opens a rather large can of worms (see the long thread I linked). Hmm, a slightly crazy idea (ok, maybe wildly crazy) would be to support mapping all of guest_memfd into kernel address space, but as USER=1 mappings. I.e. don't require a carve-out from userspace, but do require CLAC/STAC when access guest memory from the kernel. I think/hope that would provide the speculative execution mitigation properties you're looking for? Userspace would still have access to guest memory, but it would take a truly malicious userspace for that to matter. And when CPUs that support LASS come along, userspace would be completely unable to access guest memory through KVM's magic mapping. This too would require a decent amount of buy-in from outside of KVM, e.g. to carve out the virtual address range in the kernel. But the performance overhead would be identical to the status quo. And there could be advantages to being able to identify accesses to guest memory based purely on kernel virtual address.