On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote: > Yan Zhao <[email protected]> writes: > > >> > <snip> > >> > > >> > What options does userspace have in this scenario? > >> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the > >> > gmem.pgoff > >> > isn't ideal either. > >> > > >> > What about something similar as below? > >> > > >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > >> > index d2feacd14786..87c33704a748 100644 > >> > --- a/virt/kvm/guest_memfd.c > >> > +++ b/virt/kvm/guest_memfd.c > >> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct > >> > kvm_memory_slot *slot, > >> > } > >> > > >> > *pfn = folio_file_pfn(folio, index); > >> > - if (max_order) > >> > - *max_order = folio_order(folio); > >> > + if (max_order) { > >> > + int order; > >> > + > >> > + order = folio_order(folio); > >> > + > >> > + while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) > >> > & ((1 << order) - 1))) > >> > + order--; > >> > + > >> > + *max_order = order; > >> > + } > >> > > >> > *is_prepared = folio_test_uptodate(folio); > >> > return folio; > >> > > >> > >> Vishal was wondering how this is working before guest_memfd was > >> introduced, for other backing memory like HugeTLB. > >> > >> I then poked around and found this [1]. I will be adding a similar check > >> for any slot where kvm_slot_can_be_private(slot). > >> > >> Yan, that should work, right? > > No, I don't think the checking of ugfn [1] should work. > > > > 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared > > memory > > are allocated from guest_memfd), the slot->userspace_addr does not > > necessarily > > have the same offset as slot->gmem.pgoff. Even if we audit the offset in > > kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, > > causing > > slot->userspace_addr to point to a different offset. > > > > 2. for slots bound to guest_memfd that do not support in-place-conversion, > > shared memory is allocated from a different backend. Therefore, checking > > "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The > > check is > > currently absent because guest_memfd supports 4K only. > > > > > > Let me clarify, I meant these changes: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 4b64ab3..d0dccf1 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, > unsigned long npages) > return 0; > } > > +static inline bool kvm_is_level_aligned(u64 value, int level) > +{ > + return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level)); > +} > + > static int kvm_alloc_memslot_metadata(struct kvm *kvm, > struct kvm_memory_slot *slot) > { > @@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm > *kvm, > > slot->arch.lpage_info[i - 1] = linfo; > > - if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1)) > + if (!kvm_is_level_aligned(slot->base_gfn, level)) > linfo[0].disallow_lpage = 1; > - if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - > 1)) > + if (!kvm_is_level_aligned(slot->base_gfn + npages, level)) > linfo[lpages - 1].disallow_lpage = 1; > ugfn = slot->userspace_addr >> PAGE_SHIFT; > /* > - * If the gfn and userspace address are not aligned wrt each > - * other, disable large page support for this slot. > + * If the gfn and userspace address are not aligned or if gfn > + * and guest_memfd offset are not aligned wrt each other, > + * disable large page support for this slot. > */ > - if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - > 1)) { > + if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) || > + (kvm_slot_can_be_private(slot) && > + !kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff, > + level))) { > unsigned long j; > > for (j = 0; j < lpages; ++j) > > This does not rely on the ugfn check, but adds a similar check for gmem.pgoff. In the case of shared memory is not allocated from guest_memfd, (e.g. with the current upstream code), the checking of gmem.pgoff here will disallow huge page of shared memory even if "slot->base_gfn ^ ugfn" is aligned.
> I think this should take care of case (1.), for guest_memfds going to be > used for both shared and private memory. Userspace can't update > slot->userspace_addr, since guest_memfd memslots cannot be updated and > can only be deleted. > > If userspace re-uses slot->userspace_addr for some other memory address > without deleting and re-adding a memslot, > > + KVM's access to memory should still be fine, since after the recent > discussion at guest_memfd upstream call, KVM's guest faults will > always go via fd+offset and KVM's access won't be disrupted > there. Whatever checking done at memslot binding time will still be > valid. Could the offset of shared memory and offset of private memory be different if userspace re-uses slot->userspace_addr without deleting and re-adding a memslot? Then though the two offsets are validated as equal in kvm_gmem_bind(), they may differ later on. > + Host's access and other accesses (e.g. instruction emulation, which > uses slot->userspace_addr) to guest memory will be broken, but I think > there's nothing protecting against that. The same breakage would > happen for non-guest_memfd memslot. Why is host access broken in non-guest_memfd case? The HVA is still a valid one in QEMU's mmap-ed address space. > p.s. I will be adding the validation as you suggested [1], though that > shouldn't make a difference here, since the above check directly > validates against gmem.pgoff. > > Regarding 2., checking this checks against gmem.pgoff and should handle > that as well. > > [1] https://lore.kernel.org/all/[email protected]/ > > I prefer checking at binding time because it aligns with the ugfn check > that is already there, and avoids having to check at every fault. > > >> [1] > >> https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616186/arch/x86/kvm/x86.c#L12996 > >> > >> >> >> Adding checks at binding time will allow hugepage-unaligned offsets > >> >> >> (to > >> >> >> be at parity with non-guest_memfd backing memory) but still fix this > >> >> >> issue. > >> >> >> > >> >> >> lpage_info will make sure that ranges near the bounds will be > >> >> >> fragmented, but the hugepages in the middle will still be mappable as > >> >> >> hugepages. > >> >> >> > >> >> >> [1] > >> >> >> https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

