Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
Hi Ackerley, On Thu, 2024-10-03 at 22:32 +0100, Ackerley Tng wrote: > Elliot Berman writes: > >> On Tue, Sep 10, 2024 at 11:44:01PM +, Ackerley Tng wrote: >>> Since guest_memfd now supports mmap(), folios have to be prepared >>> before they are faulted into userspace. >>> >>> When memory attributes are switched between shared and private, the >>> up-to-date flags will be cleared. >>> >>> Use the folio's up-to-date flag to indicate being ready for the guest >>> usage and can be used to mark whether the folio is ready for shared OR >>> private use. >> >> Clearing the up-to-date flag also means that the page gets zero'd out >> whenever it transitions between shared and private (either direction). >> pKVM (Android) hypervisor policy can allow in-place conversion between >> shared/private. >> >> I believe the important thing is that sev_gmem_prepare() needs to be >> called prior to giving page to guest. In my series, I had made a >> ->prepare_inaccessible() callback where KVM would only do this part. >> When transitioning to inaccessible, only that callback would be made, >> besides the bookkeeping. The folio zeroing happens once when allocating >> the folio if the folio is initially accessible (faultable). >> >> From x86 CoCo perspective, I think it also makes sense to not zero >> the folio when changing faultiblity from private to shared: >> - If guest is sharing some data with host, you've wiped the data and >>guest has to copy again. >> - Or, if SEV/TDX enforces that page is zero'd between transitions, >>Linux has duplicated the work that trusted entity has already done. >> >> Fuad and I can help add some details for the conversion. Hopefully we >> can figure out some of the plan at plumbers this week. > > Zeroing the page prevents leaking host data (see function docstring for > kvm_gmem_prepare_folio() introduced in [1]), so we definitely don't want > to introduce a kernel data leak bug here. > > In-place conversion does require preservation of data, so for > conversions, shall we zero depending on VM type? > > + Gunyah: don't zero since ->prepare_inaccessible() is a no-op > + pKVM: don't zero > + TDX: don't zero > + SEV: AMD Architecture Programmers Manual 7.10.6 says there is no > automatic encryption and implies no zeroing, hence perform zeroing > + KVM_X86_SW_PROTECTED_VM: Doesn't have a formal definition so I guess > we could require zeroing on transition? Maybe for KVM_X86_SW_PROTECTED_VM we could make zero-ing configurable via some CREATE_GUEST_MEMFD flag, instead of forcing one specific behavior. For the "non-CoCo with direct map entries removed" VMs that we at AWS are going for, we'd like a VM type with host-controlled in-place conversions which doesn't zero on transitions, so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new VM type for that. Somewhat related sidenote: For VMs that allow inplace conversions and do not zero, we do not need to zap the stage-2 mappings on memory attribute changes, right? > This way, the uptodate flag means that it has been prepared (as in > sev_gmem_prepare()), and zeroed if required by VM type. > > Regarding flushing the dcache/tlb in your other question [2], if we > don't use folio_zero_user(), can we relying on unmapping within core-mm > to flush after shared use, and unmapping within KVM To flush after > private use? > > Or should flush_dcache_folio() be explicitly called on kvm_gmem_fault()? > > clear_highpage(), used in the non-hugetlb (original) path, doesn't flush > the dcache. Was that intended? > >> Thanks, >> Elliot >> >>> >>> > > [1] https://lore.kernel.org/all/20240726185157.72821-8-pbonz...@redhat.com/ > [2] > https://lore.kernel.org/all/diqz34ldszp3@ackerleytng-ctop.c.googlers.com/ Best, Patrick
Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
On Fri, 2024-10-18 at 00:16 +0100, Ackerley Tng wrote: > Patrick Roy writes: > >> On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote: >>> On Tue, Oct 08, 2024, Ackerley Tng wrote: >>>> Patrick Roy writes: >>>>> For the "non-CoCo with direct map entries removed" VMs that we at AWS >>>>> are going for, we'd like a VM type with host-controlled in-place >>>>> conversions which doesn't zero on transitions, >>> >>> Hmm, your use case shouldn't need conversions _for KVM_, as there's no need >>> for >>> KVM to care if userspace or the guest _wants_ a page to be shared vs. >>> private. >>> Userspace is fully trusted to manage things; KVM simply reacts to the >>> current >>> state of things. >>> >>> And more importantly, whether or not the direct map is zapped needs to be a >>> property of the guest_memfd inode, i.e. can't be associated with a struct >>> kvm. >>> I forget who got volunteered to do the work, >> >> I think me? At least we talked about it briefly >> >>> but we're going to need similar >>> functionality for tracking the state of individual pages in a huge folio, as >>> folio_mark_uptodate() is too coarse-grained. I.e. at some point, I expect >>> that >>> guest_memfd will make it easy-ish to determine whether or not the direct >>> map has >>> been obliterated. >>> >>> The shared vs. private attributes tracking in KVM is still needed (I >>> think), as >>> it communicates what userspace _wants_, whereas he guest_memfd machinery >>> will >>> track what the state _is_. >> >> If I'm understanding this patch series correctly, the approach taken >> here is to force the KVM memory attributes and the internal guest_memfd >> state to be in-sync, because the VMA from mmap()ing guest_memfd is >> reflected back into the userspace_addr of the memslot. > > In this patch series, we're also using guest_memfd state (faultability > xarray) to prevent any future faults before checking that there are no > mappings. Further explanation at [1]. > > Reason (a) at [1] is what Sean describes above to be what userspace > _wants_ vs what the state _is_. Ah, I was missing that detail about faultability being disabled, yet mem_attrs not being updated until all pins are actually gone. Thanks! Mh, I'm probably not seeing it because of my lack with CoCo setups, but how would pKVM not trusting userspace about conversions cause mem_attrs and faultability go out of sync? Or generally, if the guest and userspace have different ideas about what is shared, and userspace's idea is stored in mem_attrs (or rather, the part where they can agree is stored in mem_attrs?), where do we store the guest's view of it? Guest page tables? >> So, to me, in >> this world, "direct map zapped iff >> kvm_has_mem_attributes(KVM_MEMORY_ATTRIBUTES_PRIVATE)", with memory >> attribute changes forcing the corresponding gmem state change. That's >> why I was talking about conversions above. > > I think if we do continue to have state in guest_memfd, then direct map > removal should be based on guest_memfd's state, rather than > KVM_MEMORY_ATTRIBUTE_PRIVATE in mem_attr_array. I am not trying to argue against tracking it in guest_memfd, I'm just wondering if mem attributes and direct map state would ever disagree. But probably that's also just because of my confusion above :) >> I've played around with this locally, and since KVM seems to generally >> use copy_from_user and friends to access the userspace_addr VMA, (aka >> private mem that's reflected back into memslots here), with this things >> like MMIO emulation can be oblivious to gmem's existence, since >> copy_from_user and co don't require GUP or presence of direct map >> entries (well, "oblivious" in the sense that things like kvm_read_guest >> currently ignore memory attributes and unconditionally access >> userspace_addr, which I suppose is not really wanted for VMs where >> userspace_addr and guest_memfd aren't short-circuited like this). The >> exception is kvm_clock, where the pv_time page would need to be >> explicitly converted to shared to restore the direct map entry, although >> I think we could just let userspace deal with making sure this page is >> shared (and then, if gmem supports GUP on shared memory, even the >> gfn_to_pfn_caches could work without gmem knowledge. Without GUP, we'd >> still need a tiny hack in the uhva->p
Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
On Thu, 2024-10-17 at 20:18 +0100, Jason Gunthorpe wrote: > On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote: >> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote: If so, maybe that's a non-issue for non-CoCo, where the VM object / gmemfd object (when created) can have a flag marking that it's always shared and can never be converted to private for any page within. >>> >>> What is non-CoCo? Does it include the private/shared concept? >> >> I used that to represent the possible gmemfd use cases outside confidential >> computing. >> >> So the private/shared things should still be around as fundamental property >> of gmemfd, but it should be always shared and no convertion needed for the >> whole lifecycle of the gmemfd when marked !CoCo. > > But what does private mean in this context? > > Is it just like a bit of additional hypervisor security that the page > is not mapped anyplace except the KVM stage 2 and the hypervisor can > cause it to become mapped/shared at any time? But the guest has no > idea about this? > > Jason Yes, this is pretty much exactly what I'm after when I say "non-CoCo". No direct map entries to provide defense-in-depth for guests against various speculative execution issues, but not a full confidential computing setup (e.g. the guest should be completely oblivious to this, and not require any modifications).
Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
On Fri, 2024-10-18 at 08:50 +0100, David Hildenbrand wrote: > On 18.10.24 09:15, Patrick Roy wrote: >> >> >> On Thu, 2024-10-17 at 20:18 +0100, Jason Gunthorpe wrote: >>> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote: >>>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote: >>>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object / >>>>>> gmemfd object (when created) can have a flag marking that it's >>>>>> always shared and can never be converted to private for any page >>>>>> within. >>>>> >>>>> What is non-CoCo? Does it include the private/shared concept? >>>> >>>> I used that to represent the possible gmemfd use cases outside confidential >>>> computing. >>>> >>>> So the private/shared things should still be around as fundamental property >>>> of gmemfd, but it should be always shared and no convertion needed for the >>>> whole lifecycle of the gmemfd when marked !CoCo. >>> >>> But what does private mean in this context? >>> >>> Is it just like a bit of additional hypervisor security that the page >>> is not mapped anyplace except the KVM stage 2 and the hypervisor can >>> cause it to become mapped/shared at any time? But the guest has no >>> idea about this? >>> >>> Jason >> >> Yes, this is pretty much exactly what I'm after when I say "non-CoCo". > > It's likely not what Peter meant, though. > > I think there are three scenarios: > > (a) Secure CoCo VMs: private is protected by HW > (b) Semi-secured non-CoCo VMs: private is removed from the directmap > (c) Non-CoCo VMs: only shared memory > > Does that match what you have in mind? Are there other cases? Yeah, I'm after your case (b). I suppose I will not call it just "non-CoCo" anymore then :) > -- > Cheers, > > David / dhildenb >
Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote: > On Tue, Oct 08, 2024, Ackerley Tng wrote: >> Patrick Roy writes: >>> For the "non-CoCo with direct map entries removed" VMs that we at AWS >>> are going for, we'd like a VM type with host-controlled in-place >>> conversions which doesn't zero on transitions, > > Hmm, your use case shouldn't need conversions _for KVM_, as there's no need > for > KVM to care if userspace or the guest _wants_ a page to be shared vs. private. > Userspace is fully trusted to manage things; KVM simply reacts to the current > state of things. > > And more importantly, whether or not the direct map is zapped needs to be a > property of the guest_memfd inode, i.e. can't be associated with a struct kvm. > I forget who got volunteered to do the work, I think me? At least we talked about it briefly > but we're going to need similar > functionality for tracking the state of individual pages in a huge folio, as > folio_mark_uptodate() is too coarse-grained. I.e. at some point, I expect > that > guest_memfd will make it easy-ish to determine whether or not the direct map > has > been obliterated. > > The shared vs. private attributes tracking in KVM is still needed (I think), > as > it communicates what userspace _wants_, whereas he guest_memfd machinery will > track what the state _is_. If I'm understanding this patch series correctly, the approach taken here is to force the KVM memory attributes and the internal guest_memfd state to be in-sync, because the VMA from mmap()ing guest_memfd is reflected back into the userspace_addr of the memslot. So, to me, in this world, "direct map zapped iff kvm_has_mem_attributes(KVM_MEMORY_ATTRIBUTES_PRIVATE)", with memory attribute changes forcing the corresponding gmem state change. That's why I was talking about conversions above. I've played around with this locally, and since KVM seems to generally use copy_from_user and friends to access the userspace_addr VMA, (aka private mem that's reflected back into memslots here), with this things like MMIO emulation can be oblivious to gmem's existence, since copy_from_user and co don't require GUP or presence of direct map entries (well, "oblivious" in the sense that things like kvm_read_guest currently ignore memory attributes and unconditionally access userspace_addr, which I suppose is not really wanted for VMs where userspace_addr and guest_memfd aren't short-circuited like this). The exception is kvm_clock, where the pv_time page would need to be explicitly converted to shared to restore the direct map entry, although I think we could just let userspace deal with making sure this page is shared (and then, if gmem supports GUP on shared memory, even the gfn_to_pfn_caches could work without gmem knowledge. Without GUP, we'd still need a tiny hack in the uhva->pfn translation somewhere to handle gmem vmas, but iirc you did mention that having kvm-clock be special might be fine). I guess it does come down to what you note below, answering the question of "how does KVM internally access guest_memfd for non-CoCo VMs". Is there any way we can make uaccesses like above work? I've finally gotten around to re-running some performance benchmarks of my on-demand reinsertion patches with all the needed TLB flushes added, and my fio benchmark on a virtio-blk device suffers a ~50% throughput regression, which does not necessarily spark joy. And I think James H. mentioned at LPC that making the userfault stuff work with my patches would be quite hard. All this in addition to you also not necessarily sounding too keen on it either :D >>> so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new >>> VM type for that. > > Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM > rename? > The original thought behind "software protected VM" was to do a slow build of > something akin to pKVM, but realistically I don't think that idea is going > anywhere. Ah, admittedly I've thought of KVM_X86_SW_PROTECTED_VM as a bit of a playground where various configurations other VM types enforce can be mixed and matched (e.g. zero on conversions yes/no, direct map removal yes/no) so more of a KVM_X86_GMEM_VM, but am happy to update my understanding :) > Alternatively, depending on how KVM accesses guest memory that's been removed > from > the direct map, another solution would be to allow "regular" VMs to bind > memslots > to guest_memfd, i.e. if the non-CoCo use case needs/wnats to bind all memory > to > guest_memfd, not just "private" mappings. > > That's probably the biggest topic of discussion: how do we wa
[PATCH] kvm: selftest: fix noop test in guest_memfd_test.c
The loop in test_create_guest_memfd_invalid that is supposed to test that nothing is accepted as a valid flag to KVM_CREATE_GUEST_MEMFD was initializing `flag` as 0 instead of BIT(0). This caused the loop to immediately exit instead of iterating over BIT(0), BIT(1), ... . Fixes: 8a89efd43423 ("KVM: selftests: Add basic selftest for guest_memfd()") Signed-off-by: Patrick Roy --- tools/testing/selftests/kvm/guest_memfd_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index ba0c8e9960358..ce687f8d248fc 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -134,7 +134,7 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm) size); } - for (flag = 0; flag; flag <<= 1) { + for (flag = BIT(0); flag; flag <<= 1) { fd = __vm_create_guest_memfd(vm, page_size, flag); TEST_ASSERT(fd == -1 && errno == EINVAL, "guest_memfd() with flag '0x%lx' should fail with EINVAL", -- 2.47.0