Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-05-13 Thread Ackerley Tng
Yan Zhao  writes:

> On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote:
>> Yan Zhao  writes:
>> 
>> >> > 
>> >> >
>> >> > What options does userspace have in this scenario?
>> >> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the 
>> >> > gmem.pgoff
>> >> > isn't ideal either.
>> >> >
>> >> > What about something similar as below?
>> >> >
>> >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> >> > index d2feacd14786..87c33704a748 100644
>> >> > --- a/virt/kvm/guest_memfd.c
>> >> > +++ b/virt/kvm/guest_memfd.c
>> >> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct 
>> >> > kvm_memory_slot *slot,
>> >> > }
>> >> >
>> >> > *pfn = folio_file_pfn(folio, index);
>> >> > -   if (max_order)
>> >> > -   *max_order = folio_order(folio);
>> >> > +   if (max_order) {
>> >> > +   int order;
>> >> > +
>> >> > +   order = folio_order(folio);
>> >> > +
>> >> > +   while (order > 0 && ((slot->base_gfn ^ 
>> >> > slot->gmem.pgoff) & ((1 << order) - 1)))
>> >> > +   order--;
>> >> > +
>> >> > +   *max_order = order;
>> >> > +   }
>> >> >
>> >> > *is_prepared = folio_test_uptodate(folio);
>> >> > return folio;
>> >> >
>> >> 
>> >> Vishal was wondering how this is working before guest_memfd was
>> >> introduced, for other backing memory like HugeTLB.
>> >> 
>> >> I then poked around and found this [1]. I will be adding a similar check
>> >> for any slot where kvm_slot_can_be_private(slot).
>> >>
>> >> Yan, that should work, right?
>> > No, I don't think the checking of ugfn [1] should work.
>> >
>> > 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared 
>> > memory
>> > are allocated from guest_memfd), the slot->userspace_addr does not 
>> > necessarily
>> > have the same offset as slot->gmem.pgoff. Even if we audit the offset in
>> > kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, 
>> > causing
>> > slot->userspace_addr to point to a different offset.
>> >
>> > 2. for slots bound to guest_memfd that do not support in-place-conversion,
>> > shared memory is allocated from a different backend. Therefore, checking
>> > "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The 
>> > check is
>> > currently absent because guest_memfd supports 4K only.
>> >
>> >
>> 
>> Let me clarify, I meant these changes:
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 4b64ab3..d0dccf1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot 
>> *slot, unsigned long npages)
>> return 0;
>>  }
>>  
>> +static inline bool kvm_is_level_aligned(u64 value, int level)
>> +{
>> +   return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
>> +}
>> +
>>  static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>>   struct kvm_memory_slot *slot)
>>  {
>> @@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm 
>> *kvm,
>>  
>> slot->arch.lpage_info[i - 1] = linfo;
>>  
>> -   if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
>> +   if (!kvm_is_level_aligned(slot->base_gfn, level))
>> linfo[0].disallow_lpage = 1;
>> -   if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) 
>> - 1))
>> +   if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
>> linfo[lpages - 1].disallow_lpage = 1;
>> ugfn = slot->userspace_addr >> PAGE_SHIFT;
>> /*
>> -* If the gfn and userspace address are not aligned wrt each
>> -* other, disable large page support for this slot.
>> +* If the gfn and userspace address are not aligned or if gfn
>> +* and guest_memfd offset are not aligned wrt each other,
>> +* disable large page support for this slot.
>>  */
>> -   if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 
>> 1)) {
>> +   if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||
>> +   (kvm_slot_can_be_private(slot) &&
>> +!kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,
>> +  level))) {
>> unsigned long j;
>>  
>> for (j = 0; j < lpages; ++j)
>> 
>> This does not rely on the ugfn check, but adds a similar check for 
>> gmem.pgoff.
> In the case of shared memory is not allocated from guest_memfd, (e.g. with the
> current upstream code), the checking of gmem.pgoff here will disallow huge 
> page
> of shared memory even if "slot->base_gfn ^ ugfn" is aligned.
>

Thanks, I get it now. What you mean is that the memslot could have been
set up such that

+ slot->user

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-05-06 Thread Yan Zhao
On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote:
> Yan Zhao  writes:
> 
> >> > 
> >> >
> >> > What options does userspace have in this scenario?
> >> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the 
> >> > gmem.pgoff
> >> > isn't ideal either.
> >> >
> >> > What about something similar as below?
> >> >
> >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >> > index d2feacd14786..87c33704a748 100644
> >> > --- a/virt/kvm/guest_memfd.c
> >> > +++ b/virt/kvm/guest_memfd.c
> >> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct 
> >> > kvm_memory_slot *slot,
> >> > }
> >> >
> >> > *pfn = folio_file_pfn(folio, index);
> >> > -   if (max_order)
> >> > -   *max_order = folio_order(folio);
> >> > +   if (max_order) {
> >> > +   int order;
> >> > +
> >> > +   order = folio_order(folio);
> >> > +
> >> > +   while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) 
> >> > & ((1 << order) - 1)))
> >> > +   order--;
> >> > +
> >> > +   *max_order = order;
> >> > +   }
> >> >
> >> > *is_prepared = folio_test_uptodate(folio);
> >> > return folio;
> >> >
> >> 
> >> Vishal was wondering how this is working before guest_memfd was
> >> introduced, for other backing memory like HugeTLB.
> >> 
> >> I then poked around and found this [1]. I will be adding a similar check
> >> for any slot where kvm_slot_can_be_private(slot).
> >>
> >> Yan, that should work, right?
> > No, I don't think the checking of ugfn [1] should work.
> >
> > 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared 
> > memory
> > are allocated from guest_memfd), the slot->userspace_addr does not 
> > necessarily
> > have the same offset as slot->gmem.pgoff. Even if we audit the offset in
> > kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, 
> > causing
> > slot->userspace_addr to point to a different offset.
> >
> > 2. for slots bound to guest_memfd that do not support in-place-conversion,
> > shared memory is allocated from a different backend. Therefore, checking
> > "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The 
> > check is
> > currently absent because guest_memfd supports 4K only.
> >
> >
> 
> Let me clarify, I meant these changes:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4b64ab3..d0dccf1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
> unsigned long npages)
> return 0;
>  }
>  
> +static inline bool kvm_is_level_aligned(u64 value, int level)
> +{
> +   return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
> +}
> +
>  static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>   struct kvm_memory_slot *slot)
>  {
> @@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm 
> *kvm,
>  
> slot->arch.lpage_info[i - 1] = linfo;
>  
> -   if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
> +   if (!kvm_is_level_aligned(slot->base_gfn, level))
> linfo[0].disallow_lpage = 1;
> -   if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 
> 1))
> +   if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
> linfo[lpages - 1].disallow_lpage = 1;
> ugfn = slot->userspace_addr >> PAGE_SHIFT;
> /*
> -* If the gfn and userspace address are not aligned wrt each
> -* other, disable large page support for this slot.
> +* If the gfn and userspace address are not aligned or if gfn
> +* and guest_memfd offset are not aligned wrt each other,
> +* disable large page support for this slot.
>  */
> -   if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 
> 1)) {
> +   if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||
> +   (kvm_slot_can_be_private(slot) &&
> +!kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,
> +  level))) {
> unsigned long j;
>  
> for (j = 0; j < lpages; ++j)
> 
> This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.
In the case of shared memory is not allocated from guest_memfd, (e.g. with the
current upstream code), the checking of gmem.pgoff here will disallow huge page
of shared memory even if "slot->base_gfn ^ ugfn" is aligned.

> I think this should take care of case (1.), for guest_memfds going to be
> used for both shared and private memory. Userspace can't update
> slot->userspace_addr, since guest_memfd memslots cannot be updated and
> can only be deleted.
> 
> If userspa

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-05-06 Thread Ackerley Tng
Yan Zhao  writes:

>> > 
>> >
>> > What options does userspace have in this scenario?
>> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the 
>> > gmem.pgoff
>> > isn't ideal either.
>> >
>> > What about something similar as below?
>> >
>> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> > index d2feacd14786..87c33704a748 100644
>> > --- a/virt/kvm/guest_memfd.c
>> > +++ b/virt/kvm/guest_memfd.c
>> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct 
>> > kvm_memory_slot *slot,
>> > }
>> >
>> > *pfn = folio_file_pfn(folio, index);
>> > -   if (max_order)
>> > -   *max_order = folio_order(folio);
>> > +   if (max_order) {
>> > +   int order;
>> > +
>> > +   order = folio_order(folio);
>> > +
>> > +   while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & 
>> > ((1 << order) - 1)))
>> > +   order--;
>> > +
>> > +   *max_order = order;
>> > +   }
>> >
>> > *is_prepared = folio_test_uptodate(folio);
>> > return folio;
>> >
>> 
>> Vishal was wondering how this is working before guest_memfd was
>> introduced, for other backing memory like HugeTLB.
>> 
>> I then poked around and found this [1]. I will be adding a similar check
>> for any slot where kvm_slot_can_be_private(slot).
>>
>> Yan, that should work, right?
> No, I don't think the checking of ugfn [1] should work.
>
> 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory
> are allocated from guest_memfd), the slot->userspace_addr does not necessarily
> have the same offset as slot->gmem.pgoff. Even if we audit the offset in
> kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, 
> causing
> slot->userspace_addr to point to a different offset.
>
> 2. for slots bound to guest_memfd that do not support in-place-conversion,
> shared memory is allocated from a different backend. Therefore, checking
> "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check 
> is
> currently absent because guest_memfd supports 4K only.
>
>

Let me clarify, I meant these changes:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b64ab3..d0dccf1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
unsigned long npages)
return 0;
 }
 
+static inline bool kvm_is_level_aligned(u64 value, int level)
+{
+   return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
+}
+
 static int kvm_alloc_memslot_metadata(struct kvm *kvm,
  struct kvm_memory_slot *slot)
 {
@@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 
slot->arch.lpage_info[i - 1] = linfo;
 
-   if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
+   if (!kvm_is_level_aligned(slot->base_gfn, level))
linfo[0].disallow_lpage = 1;
-   if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 
1))
+   if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
linfo[lpages - 1].disallow_lpage = 1;
ugfn = slot->userspace_addr >> PAGE_SHIFT;
/*
-* If the gfn and userspace address are not aligned wrt each
-* other, disable large page support for this slot.
+* If the gfn and userspace address are not aligned or if gfn
+* and guest_memfd offset are not aligned wrt each other,
+* disable large page support for this slot.
 */
-   if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) 
{
+   if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||
+   (kvm_slot_can_be_private(slot) &&
+!kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,
+  level))) {
unsigned long j;
 
for (j = 0; j < lpages; ++j)

This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.

I think this should take care of case (1.), for guest_memfds going to be
used for both shared and private memory. Userspace can't update
slot->userspace_addr, since guest_memfd memslots cannot be updated and
can only be deleted.

If userspace re-uses slot->userspace_addr for some other memory address
without deleting and re-adding a memslot,

+ KVM's access to memory should still be fine, since after the recent
  discussion at guest_memfd upstream call, KVM's guest faults will
  always go via fd+offset and KVM's access won't be disrupted
  there. Whatever checking done at memslot binding time will still be
  valid.
+ Host's access and other accesses (e.g. instruction emulation, which
  uses slot->userspace_addr) to guest memory will be broken, but I think
  t

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-05-05 Thread Yan Zhao
On Wed, Apr 30, 2025 at 01:09:33PM -0700, Ackerley Tng wrote:
> Yan Zhao  writes:
> 
> > On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
> >> Yan Zhao  writes:
> >> 
> >> > On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
> >> >> Vishal Annapurve  writes:
> >> >> 
> >> >> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao  wrote:
> >> >> >>
> >> >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >> >> >> >
> >> >> >> >
> >> >> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> >> >> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> >> >> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> >> >> >> > >>> Yan Zhao  writes:
> >> >> >> > >>>
> >> >> >> >  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> >> >> >> > > +/*
> >> >> >> > > + * Allocates and then caches a folio in the filemap. 
> >> >> >> > > Returns a folio with
> >> >> >> > > + * refcount of 2: 1 after allocation, and 1 taken by the 
> >> >> >> > > filemap.
> >> >> >> > > + */
> >> >> >> > > +static struct folio 
> >> >> >> > > *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> >> >> >> > > +   
> >> >> >> > > pgoff_t index)
> >> >> >> > > +{
> >> >> >> > > +   struct kvm_gmem_hugetlb *hgmem;
> >> >> >> > > +   pgoff_t aligned_index;
> >> >> >> > > +   struct folio *folio;
> >> >> >> > > +   int nr_pages;
> >> >> >> > > +   int ret;
> >> >> >> > > +
> >> >> >> > > +   hgmem = kvm_gmem_hgmem(inode);
> >> >> >> > > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, 
> >> >> >> > > hgmem->spool);
> >> >> >> > > +   if (IS_ERR(folio))
> >> >> >> > > +   return folio;
> >> >> >> > > +
> >> >> >> > > +   nr_pages = 1UL << huge_page_order(hgmem->h);
> >> >> >> > > +   aligned_index = round_down(index, nr_pages);
> >> >> >> >  Maybe a gap here.
> >> >> >> > 
> >> >> >> >  When a guest_memfd is bound to a slot where slot->base_gfn is 
> >> >> >> >  not aligned to
> >> >> >> >  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G 
> >> >> >> >  aligned, the
> >> >> >> >  corresponding GFN is not 2M/1G aligned.
> >> >> >> > >>>
> >> >> >> > >>> Thanks for looking into this.
> >> >> >> > >>>
> >> >> >> > >>> In 1G page support for guest_memfd, the offset and size are 
> >> >> >> > >>> always
> >> >> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd 
> >> >> >> > >>> creation
> >> >> >> > >>> time, and it is true that when binding to a memslot, 
> >> >> >> > >>> slot->base_gfn and
> >> >> >> > >>> slot->npages may not be hugepage aligned.
> >> >> >> > >>>
> >> >> >> > 
> >> >> >> >  However, TDX requires that private huge pages be 2M aligned 
> >> >> >> >  in GFN.
> >> >> >> > 
> >> >> >> > >>>
> >> >> >> > >>> IIUC other factors also contribute to determining the mapping 
> >> >> >> > >>> level in
> >> >> >> > >>> the guest page tables, like lpage_info and 
> >> >> >> > >>> .private_max_mapping_level()
> >> >> >> > >>> in kvm_x86_ops.
> >> >> >> > >>>
> >> >> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, 
> >> >> >> > >>> lpage_info
> >> >> >> > >>> will track that and not allow faulting into guest page tables 
> >> >> >> > >>> at higher
> >> >> >> > >>> granularity.
> >> >> >> > >>
> >> >> >> > >> lpage_info only checks the alignments of slot->base_gfn and
> >> >> >> > >> slot->base_gfn + npages. e.g.,
> >> >> >> > >>
> >> >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> >> >> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 
> >> >> >> > >> 2M+8K);
> >> >> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 
> >> >> >> > >> 4M+8K);
> >> >> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 
> >> >> >> > >> 6M+8K);
> >> >> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 
> >> >> >> > >> 8M+8K);
> >> >> >> >
> >> >> >> > Should it be?
> >> >> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> >> >> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> >> >> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> >> >> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> >> >> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> >> >> >> Right. Good catch. Thanks!
> >> >> >>
> >> >> >> Let me update the example as below:
> >> >> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
> >> >> >>
> >> >> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> >> >> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> >> >> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> >> >> >> lpage_info[2M][3].di

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-30 Thread Ackerley Tng
Yan Zhao  writes:

> On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
>> Yan Zhao  writes:
>> 
>> > On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
>> >> Vishal Annapurve  writes:
>> >> 
>> >> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao  wrote:
>> >> >>
>> >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
>> >> >> >
>> >> >> >
>> >> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
>> >> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
>> >> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
>> >> >> > >>> Yan Zhao  writes:
>> >> >> > >>>
>> >> >> >  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
>> >> >> > > +/*
>> >> >> > > + * Allocates and then caches a folio in the filemap. Returns 
>> >> >> > > a folio with
>> >> >> > > + * refcount of 2: 1 after allocation, and 1 taken by the 
>> >> >> > > filemap.
>> >> >> > > + */
>> >> >> > > +static struct folio 
>> >> >> > > *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
>> >> >> > > +   
>> >> >> > > pgoff_t index)
>> >> >> > > +{
>> >> >> > > +   struct kvm_gmem_hugetlb *hgmem;
>> >> >> > > +   pgoff_t aligned_index;
>> >> >> > > +   struct folio *folio;
>> >> >> > > +   int nr_pages;
>> >> >> > > +   int ret;
>> >> >> > > +
>> >> >> > > +   hgmem = kvm_gmem_hgmem(inode);
>> >> >> > > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, 
>> >> >> > > hgmem->spool);
>> >> >> > > +   if (IS_ERR(folio))
>> >> >> > > +   return folio;
>> >> >> > > +
>> >> >> > > +   nr_pages = 1UL << huge_page_order(hgmem->h);
>> >> >> > > +   aligned_index = round_down(index, nr_pages);
>> >> >> >  Maybe a gap here.
>> >> >> > 
>> >> >> >  When a guest_memfd is bound to a slot where slot->base_gfn is 
>> >> >> >  not aligned to
>> >> >> >  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G 
>> >> >> >  aligned, the
>> >> >> >  corresponding GFN is not 2M/1G aligned.
>> >> >> > >>>
>> >> >> > >>> Thanks for looking into this.
>> >> >> > >>>
>> >> >> > >>> In 1G page support for guest_memfd, the offset and size are 
>> >> >> > >>> always
>> >> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd 
>> >> >> > >>> creation
>> >> >> > >>> time, and it is true that when binding to a memslot, 
>> >> >> > >>> slot->base_gfn and
>> >> >> > >>> slot->npages may not be hugepage aligned.
>> >> >> > >>>
>> >> >> > 
>> >> >> >  However, TDX requires that private huge pages be 2M aligned in 
>> >> >> >  GFN.
>> >> >> > 
>> >> >> > >>>
>> >> >> > >>> IIUC other factors also contribute to determining the mapping 
>> >> >> > >>> level in
>> >> >> > >>> the guest page tables, like lpage_info and 
>> >> >> > >>> .private_max_mapping_level()
>> >> >> > >>> in kvm_x86_ops.
>> >> >> > >>>
>> >> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, 
>> >> >> > >>> lpage_info
>> >> >> > >>> will track that and not allow faulting into guest page tables at 
>> >> >> > >>> higher
>> >> >> > >>> granularity.
>> >> >> > >>
>> >> >> > >> lpage_info only checks the alignments of slot->base_gfn and
>> >> >> > >> slot->base_gfn + npages. e.g.,
>> >> >> > >>
>> >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
>> >> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 
>> >> >> > >> 2M+8K);
>> >> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 
>> >> >> > >> 4M+8K);
>> >> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 
>> >> >> > >> 6M+8K);
>> >> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 
>> >> >> > >> 8M+8K);
>> >> >> >
>> >> >> > Should it be?
>> >> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
>> >> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
>> >> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
>> >> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
>> >> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
>> >> >> Right. Good catch. Thanks!
>> >> >>
>> >> >> Let me update the example as below:
>> >> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>> >> >>
>> >> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
>> >> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
>> >> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
>> >> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
>> >> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>> >> >>
>> >> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB 
>> >> >> and GPA
>> >> >> 4MB+16KB. However, their aligned_in

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-28 Thread Vishal Annapurve
On Sun, Apr 27, 2025 at 6:08 PM Yan Zhao  wrote:
>
> On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
> > Yan Zhao  writes:
> > ...
> > >
> > > For some memory region, e.g., "pc.ram", it's divided into 2 parts:
> > > - one with offset 0, size 0x8000(2G),
> > >   positioned at GPA 0, which is below GPA 4G;
> > > - one with offset 0x8000(2G), size 0x8000(2G),
> > >   positioned at GPA 0x1(4G), which is above GPA 4G.
> > >
> > > For the second part, its slot->base_gfn is 0x1, while 
> > > slot->gmem.pgoff
> > > is 0x8000.
> > >
> >
> > Nope I don't mean to enforce that they are equal, we just need the
> > offsets within the page to be equal.
> >
> > I edited Vishal's code snippet, perhaps it would help explain better:
> >
> > page_size is the size of the hugepage, so in our example,
> >
> >   page_size = SZ_2M;
> >   page_mask = ~(page_size - 1);
> page_mask = page_size - 1  ?
>
> >   offset_within_page = slot->gmem.pgoff & page_mask;
> >   gfn_within_page = (slot->base_gfn << PAGE_SHIFT) & page_mask;
> >
> > We will enforce that
> >
> >   offset_within_page == gfn_within_page;
> For "pc.ram", if it has 2.5G below 4G, it would be configured as follows
> - slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G
> - slot 2: slot->gmem.pgoff=2.5G, base GPA 4G, size=1.5G
>
> When binding these two slots to the same guest_memfd created with flag
> KVM_GUEST_MEMFD_HUGE_1GB:
> - binding the 1st slot will succeed;
> - binding the 2nd slot will fail.
>
> What options does userspace have in this scenario?

Userspace can create new gmem files that have aligned offsets. But I
see your point, enforcing alignment at binding time will lead to
wastage of memory. i.e. Your example above could be reworked to have:
- slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G, gmem_fd = x, gmem_size = 3G
- slot 2: slot->gmem.pgoff=0, base GPA 4G, size=1.5G, gmem_fd = y,
gmem_size = 2G

This will waste 1G of memory as gmem files will have to be hugepage aligned.

> It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
> isn't ideal either.
>
> What about something similar as below?
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d2feacd14786..87c33704a748 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct 
> kvm_memory_slot *slot,
> }
>
> *pfn = folio_file_pfn(folio, index);
> -   if (max_order)
> -   *max_order = folio_order(folio);
> +   if (max_order) {
> +   int order;
> +
> +   order = folio_order(folio);
> +
> +   while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & 
> ((1 << order) - 1)))

This sounds better. Userspace will need to avoid this in general or
keep such ranges short so that most of the guest memory ranges can be
mapped at hugepage granularity. So maybe a pr_warn could be spewed
during binding that the alignment is not optimal.

> +   order--;
> +
> +   *max_order = order;
> +   }
>
> *is_prepared = folio_test_uptodate(folio);
> return folio;
>
>
> > >> Adding checks at binding time will allow hugepage-unaligned offsets (to
> > >> be at parity with non-guest_memfd backing memory) but still fix this
> > >> issue.
> > >>
> > >> lpage_info will make sure that ranges near the bounds will be
> > >> fragmented, but the hugepages in the middle will still be mappable as
> > >> hugepages.
> > >>
> > >> [1] 
> > >> https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg



Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-27 Thread Yan Zhao
On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
> Yan Zhao  writes:
> 
> > On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
> >> Vishal Annapurve  writes:
> >> 
> >> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao  wrote:
> >> >>
> >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >> >> >
> >> >> >
> >> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> >> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> >> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> >> >> > >>> Yan Zhao  writes:
> >> >> > >>>
> >> >> >  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> >> >> > > +/*
> >> >> > > + * Allocates and then caches a folio in the filemap. Returns a 
> >> >> > > folio with
> >> >> > > + * refcount of 2: 1 after allocation, and 1 taken by the 
> >> >> > > filemap.
> >> >> > > + */
> >> >> > > +static struct folio 
> >> >> > > *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> >> >> > > +   
> >> >> > > pgoff_t index)
> >> >> > > +{
> >> >> > > +   struct kvm_gmem_hugetlb *hgmem;
> >> >> > > +   pgoff_t aligned_index;
> >> >> > > +   struct folio *folio;
> >> >> > > +   int nr_pages;
> >> >> > > +   int ret;
> >> >> > > +
> >> >> > > +   hgmem = kvm_gmem_hgmem(inode);
> >> >> > > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, 
> >> >> > > hgmem->spool);
> >> >> > > +   if (IS_ERR(folio))
> >> >> > > +   return folio;
> >> >> > > +
> >> >> > > +   nr_pages = 1UL << huge_page_order(hgmem->h);
> >> >> > > +   aligned_index = round_down(index, nr_pages);
> >> >> >  Maybe a gap here.
> >> >> > 
> >> >> >  When a guest_memfd is bound to a slot where slot->base_gfn is 
> >> >> >  not aligned to
> >> >> >  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G 
> >> >> >  aligned, the
> >> >> >  corresponding GFN is not 2M/1G aligned.
> >> >> > >>>
> >> >> > >>> Thanks for looking into this.
> >> >> > >>>
> >> >> > >>> In 1G page support for guest_memfd, the offset and size are always
> >> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd 
> >> >> > >>> creation
> >> >> > >>> time, and it is true that when binding to a memslot, 
> >> >> > >>> slot->base_gfn and
> >> >> > >>> slot->npages may not be hugepage aligned.
> >> >> > >>>
> >> >> > 
> >> >> >  However, TDX requires that private huge pages be 2M aligned in 
> >> >> >  GFN.
> >> >> > 
> >> >> > >>>
> >> >> > >>> IIUC other factors also contribute to determining the mapping 
> >> >> > >>> level in
> >> >> > >>> the guest page tables, like lpage_info and 
> >> >> > >>> .private_max_mapping_level()
> >> >> > >>> in kvm_x86_ops.
> >> >> > >>>
> >> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, 
> >> >> > >>> lpage_info
> >> >> > >>> will track that and not allow faulting into guest page tables at 
> >> >> > >>> higher
> >> >> > >>> granularity.
> >> >> > >>
> >> >> > >> lpage_info only checks the alignments of slot->base_gfn and
> >> >> > >> slot->base_gfn + npages. e.g.,
> >> >> > >>
> >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> >> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> >> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 
> >> >> > >> 4M+8K);
> >> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 
> >> >> > >> 6M+8K);
> >> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 
> >> >> > >> 8M+8K);
> >> >> >
> >> >> > Should it be?
> >> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> >> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> >> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> >> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> >> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> >> >> Right. Good catch. Thanks!
> >> >>
> >> >> Let me update the example as below:
> >> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
> >> >>
> >> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> >> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> >> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> >> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
> >> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
> >> >>
> >> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB 
> >> >> and GPA
> >> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to 
> >> >> allocate two
> >> >> 2MB folios, whose physical addresses may not be contiguous.
> >> >>
> >> >> Additionally, if the g

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-25 Thread Ackerley Tng
Yan Zhao  writes:

> On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
>> Vishal Annapurve  writes:
>> 
>> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao  wrote:
>> >>
>> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
>> >> >
>> >> >
>> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
>> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
>> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
>> >> > >>> Yan Zhao  writes:
>> >> > >>>
>> >> >  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
>> >> > > +/*
>> >> > > + * Allocates and then caches a folio in the filemap. Returns a 
>> >> > > folio with
>> >> > > + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>> >> > > + */
>> >> > > +static struct folio 
>> >> > > *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
>> >> > > +   
>> >> > > pgoff_t index)
>> >> > > +{
>> >> > > +   struct kvm_gmem_hugetlb *hgmem;
>> >> > > +   pgoff_t aligned_index;
>> >> > > +   struct folio *folio;
>> >> > > +   int nr_pages;
>> >> > > +   int ret;
>> >> > > +
>> >> > > +   hgmem = kvm_gmem_hgmem(inode);
>> >> > > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, 
>> >> > > hgmem->spool);
>> >> > > +   if (IS_ERR(folio))
>> >> > > +   return folio;
>> >> > > +
>> >> > > +   nr_pages = 1UL << huge_page_order(hgmem->h);
>> >> > > +   aligned_index = round_down(index, nr_pages);
>> >> >  Maybe a gap here.
>> >> > 
>> >> >  When a guest_memfd is bound to a slot where slot->base_gfn is not 
>> >> >  aligned to
>> >> >  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G 
>> >> >  aligned, the
>> >> >  corresponding GFN is not 2M/1G aligned.
>> >> > >>>
>> >> > >>> Thanks for looking into this.
>> >> > >>>
>> >> > >>> In 1G page support for guest_memfd, the offset and size are always
>> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd 
>> >> > >>> creation
>> >> > >>> time, and it is true that when binding to a memslot, slot->base_gfn 
>> >> > >>> and
>> >> > >>> slot->npages may not be hugepage aligned.
>> >> > >>>
>> >> > 
>> >> >  However, TDX requires that private huge pages be 2M aligned in GFN.
>> >> > 
>> >> > >>>
>> >> > >>> IIUC other factors also contribute to determining the mapping level 
>> >> > >>> in
>> >> > >>> the guest page tables, like lpage_info and 
>> >> > >>> .private_max_mapping_level()
>> >> > >>> in kvm_x86_ops.
>> >> > >>>
>> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, 
>> >> > >>> lpage_info
>> >> > >>> will track that and not allow faulting into guest page tables at 
>> >> > >>> higher
>> >> > >>> granularity.
>> >> > >>
>> >> > >> lpage_info only checks the alignments of slot->base_gfn and
>> >> > >> slot->base_gfn + npages. e.g.,
>> >> > >>
>> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
>> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
>> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 
>> >> > >> 4M+8K);
>> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 
>> >> > >> 6M+8K);
>> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 
>> >> > >> 8M+8K);
>> >> >
>> >> > Should it be?
>> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
>> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
>> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
>> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
>> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
>> >> Right. Good catch. Thanks!
>> >>
>> >> Let me update the example as below:
>> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>> >>
>> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
>> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
>> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
>> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
>> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>> >>
>> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and 
>> >> GPA
>> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to 
>> >> allocate two
>> >> 2MB folios, whose physical addresses may not be contiguous.
>> >>
>> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 
>> >> 4MB,
>> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 
>> >> 6MB).
>> >> However, guest_memfd just allocates the same 2MB folio for both faults.
>> >>
>> >>
>> >> >
>> >> > >>
>> >> > >>   --

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-24 Thread Yan Zhao
On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
> Vishal Annapurve  writes:
> 
> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao  wrote:
> >>
> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >> >
> >> >
> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> >> > >>> Yan Zhao  writes:
> >> > >>>
> >> >  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> >> > > +/*
> >> > > + * Allocates and then caches a folio in the filemap. Returns a 
> >> > > folio with
> >> > > + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> >> > > + */
> >> > > +static struct folio 
> >> > > *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> >> > > +   
> >> > > pgoff_t index)
> >> > > +{
> >> > > +   struct kvm_gmem_hugetlb *hgmem;
> >> > > +   pgoff_t aligned_index;
> >> > > +   struct folio *folio;
> >> > > +   int nr_pages;
> >> > > +   int ret;
> >> > > +
> >> > > +   hgmem = kvm_gmem_hgmem(inode);
> >> > > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, 
> >> > > hgmem->spool);
> >> > > +   if (IS_ERR(folio))
> >> > > +   return folio;
> >> > > +
> >> > > +   nr_pages = 1UL << huge_page_order(hgmem->h);
> >> > > +   aligned_index = round_down(index, nr_pages);
> >> >  Maybe a gap here.
> >> > 
> >> >  When a guest_memfd is bound to a slot where slot->base_gfn is not 
> >> >  aligned to
> >> >  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, 
> >> >  the
> >> >  corresponding GFN is not 2M/1G aligned.
> >> > >>>
> >> > >>> Thanks for looking into this.
> >> > >>>
> >> > >>> In 1G page support for guest_memfd, the offset and size are always
> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd 
> >> > >>> creation
> >> > >>> time, and it is true that when binding to a memslot, slot->base_gfn 
> >> > >>> and
> >> > >>> slot->npages may not be hugepage aligned.
> >> > >>>
> >> > 
> >> >  However, TDX requires that private huge pages be 2M aligned in GFN.
> >> > 
> >> > >>>
> >> > >>> IIUC other factors also contribute to determining the mapping level 
> >> > >>> in
> >> > >>> the guest page tables, like lpage_info and 
> >> > >>> .private_max_mapping_level()
> >> > >>> in kvm_x86_ops.
> >> > >>>
> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, 
> >> > >>> lpage_info
> >> > >>> will track that and not allow faulting into guest page tables at 
> >> > >>> higher
> >> > >>> granularity.
> >> > >>
> >> > >> lpage_info only checks the alignments of slot->base_gfn and
> >> > >> slot->base_gfn + npages. e.g.,
> >> > >>
> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> >> >
> >> > Should it be?
> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> >> Right. Good catch. Thanks!
> >>
> >> Let me update the example as below:
> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
> >>
> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
> >>
> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and 
> >> GPA
> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate 
> >> two
> >> 2MB folios, whose physical addresses may not be contiguous.
> >>
> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 
> >> 4MB,
> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 
> >> 6MB).
> >> However, guest_memfd just allocates the same 2MB folio for both faults.
> >>
> >>
> >> >
> >> > >>
> >> > >>   -
> >> > >>   |  |  |  |  |  |  |  |  |
> >> > >>   8K2M 2M+8K  4M  4M+8K 6M  6M+8K 8M  8M+8K
> >> 

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-24 Thread Ackerley Tng
Vishal Annapurve  writes:

> On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao  wrote:
>>
>> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
>> >
>> >
>> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
>> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
>> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
>> > >>> Yan Zhao  writes:
>> > >>>
>> >  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
>> > > +/*
>> > > + * Allocates and then caches a folio in the filemap. Returns a 
>> > > folio with
>> > > + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>> > > + */
>> > > +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct 
>> > > inode *inode,
>> > > +   pgoff_t 
>> > > index)
>> > > +{
>> > > +   struct kvm_gmem_hugetlb *hgmem;
>> > > +   pgoff_t aligned_index;
>> > > +   struct folio *folio;
>> > > +   int nr_pages;
>> > > +   int ret;
>> > > +
>> > > +   hgmem = kvm_gmem_hgmem(inode);
>> > > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
>> > > +   if (IS_ERR(folio))
>> > > +   return folio;
>> > > +
>> > > +   nr_pages = 1UL << huge_page_order(hgmem->h);
>> > > +   aligned_index = round_down(index, nr_pages);
>> >  Maybe a gap here.
>> > 
>> >  When a guest_memfd is bound to a slot where slot->base_gfn is not 
>> >  aligned to
>> >  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, 
>> >  the
>> >  corresponding GFN is not 2M/1G aligned.
>> > >>>
>> > >>> Thanks for looking into this.
>> > >>>
>> > >>> In 1G page support for guest_memfd, the offset and size are always
>> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
>> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
>> > >>> slot->npages may not be hugepage aligned.
>> > >>>
>> > 
>> >  However, TDX requires that private huge pages be 2M aligned in GFN.
>> > 
>> > >>>
>> > >>> IIUC other factors also contribute to determining the mapping level in
>> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
>> > >>> in kvm_x86_ops.
>> > >>>
>> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
>> > >>> will track that and not allow faulting into guest page tables at higher
>> > >>> granularity.
>> > >>
>> > >> lpage_info only checks the alignments of slot->base_gfn and
>> > >> slot->base_gfn + npages. e.g.,
>> > >>
>> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
>> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
>> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
>> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
>> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
>> >
>> > Should it be?
>> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
>> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
>> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
>> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
>> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
>> Right. Good catch. Thanks!
>>
>> Let me update the example as below:
>> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>>
>> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
>> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
>> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
>> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
>> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>>
>> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
>> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate 
>> two
>> 2MB folios, whose physical addresses may not be contiguous.
>>
>> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
>> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
>> However, guest_memfd just allocates the same 2MB folio for both faults.
>>
>>
>> >
>> > >>
>> > >>   -
>> > >>   |  |  |  |  |  |  |  |  |
>> > >>   8K2M 2M+8K  4M  4M+8K 6M  6M+8K 8M  8M+8K
>> > >>
>> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
>> > >> page is allowed. Also, they have the same aligned_index 2 in 
>> > >> guest_memfd.
>> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
>> > > Sorry, sent too fast this morning. The example is not right. The correct
>> > > one is:
>> > >
>> > > For GFN 4M and

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-24 Thread Vishal Annapurve
On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao  wrote:
>
> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >
> >
> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> > >>> Yan Zhao  writes:
> > >>>
> >  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> > > +/*
> > > + * Allocates and then caches a folio in the filemap. Returns a folio 
> > > with
> > > + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> > > + */
> > > +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct 
> > > inode *inode,
> > > +   pgoff_t 
> > > index)
> > > +{
> > > +   struct kvm_gmem_hugetlb *hgmem;
> > > +   pgoff_t aligned_index;
> > > +   struct folio *folio;
> > > +   int nr_pages;
> > > +   int ret;
> > > +
> > > +   hgmem = kvm_gmem_hgmem(inode);
> > > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> > > +   if (IS_ERR(folio))
> > > +   return folio;
> > > +
> > > +   nr_pages = 1UL << huge_page_order(hgmem->h);
> > > +   aligned_index = round_down(index, nr_pages);
> >  Maybe a gap here.
> > 
> >  When a guest_memfd is bound to a slot where slot->base_gfn is not 
> >  aligned to
> >  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> >  corresponding GFN is not 2M/1G aligned.
> > >>>
> > >>> Thanks for looking into this.
> > >>>
> > >>> In 1G page support for guest_memfd, the offset and size are always
> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
> > >>> slot->npages may not be hugepage aligned.
> > >>>
> > 
> >  However, TDX requires that private huge pages be 2M aligned in GFN.
> > 
> > >>>
> > >>> IIUC other factors also contribute to determining the mapping level in
> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
> > >>> in kvm_x86_ops.
> > >>>
> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> > >>> will track that and not allow faulting into guest page tables at higher
> > >>> granularity.
> > >>
> > >> lpage_info only checks the alignments of slot->base_gfn and
> > >> slot->base_gfn + npages. e.g.,
> > >>
> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> >
> > Should it be?
> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> Right. Good catch. Thanks!
>
> Let me update the example as below:
> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>
> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>
> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
> 2MB folios, whose physical addresses may not be contiguous.
>
> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
> However, guest_memfd just allocates the same 2MB folio for both faults.
>
>
> >
> > >>
> > >>   -
> > >>   |  |  |  |  |  |  |  |  |
> > >>   8K2M 2M+8K  4M  4M+8K 6M  6M+8K 8M  8M+8K
> > >>
> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
> > > Sorry, sent too fast this morning. The example is not right. The correct
> > > one is:
> > >
> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> > > KVM will create a 2M mapping for them.
> > >
> > > However, in guest_memfd, GFN 4M and GFN 4M+16

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-24 Thread Yan Zhao
On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> 
> 
> On 4/24/2025 12:25 PM, Yan Zhao wrote:
> > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> >>> Yan Zhao  writes:
> >>>
>  On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> > +/*
> > + * Allocates and then caches a folio in the filemap. Returns a folio 
> > with
> > + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> > + */
> > +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct 
> > inode *inode,
> > +   pgoff_t 
> > index)
> > +{
> > +   struct kvm_gmem_hugetlb *hgmem;
> > +   pgoff_t aligned_index;
> > +   struct folio *folio;
> > +   int nr_pages;
> > +   int ret;
> > +
> > +   hgmem = kvm_gmem_hgmem(inode);
> > +   folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> > +   if (IS_ERR(folio))
> > +   return folio;
> > +
> > +   nr_pages = 1UL << huge_page_order(hgmem->h);
> > +   aligned_index = round_down(index, nr_pages);
>  Maybe a gap here.
> 
>  When a guest_memfd is bound to a slot where slot->base_gfn is not 
>  aligned to
>  2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
>  corresponding GFN is not 2M/1G aligned.
> >>>
> >>> Thanks for looking into this.
> >>>
> >>> In 1G page support for guest_memfd, the offset and size are always
> >>> hugepage aligned to the hugepage size requested at guest_memfd creation
> >>> time, and it is true that when binding to a memslot, slot->base_gfn and
> >>> slot->npages may not be hugepage aligned.
> >>>
> 
>  However, TDX requires that private huge pages be 2M aligned in GFN.
> 
> >>>
> >>> IIUC other factors also contribute to determining the mapping level in
> >>> the guest page tables, like lpage_info and .private_max_mapping_level()
> >>> in kvm_x86_ops.
> >>>
> >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> >>> will track that and not allow faulting into guest page tables at higher
> >>> granularity.
> >>  
> >> lpage_info only checks the alignments of slot->base_gfn and
> >> slot->base_gfn + npages. e.g.,
> >>
> >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> 
> Should it be?
> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
Right. Good catch. Thanks!

Let me update the example as below:
slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)

lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);

lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
2MB folios, whose physical addresses may not be contiguous.

Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
However, guest_memfd just allocates the same 2MB folio for both faults.


> 
> >>
> >>   -
> >>   |  |  |  |  |  |  |  |  |
> >>   8K2M 2M+8K  4M  4M+8K 6M  6M+8K 8M  8M+8K
> >>
> >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> >> So, guest_memfd allocates the same huge folio of 2M order for them.
> > Sorry, sent too fast this morning. The example is not right. The correct
> > one is:
> > 
> > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> > KVM will create a 2M mapping for them.
> > 
> > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
> > same 2M folio and physical addresses may not be contiguous.
> > 
> > 
> >> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
> >> It's also weird for a 2M mapping in KVM to stride across 2 huge foli

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-23 Thread Chenyi Qiang



On 4/24/2025 12:25 PM, Yan Zhao wrote:
> On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
>> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
>>> Yan Zhao  writes:
>>>
 On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> +/*
> + * Allocates and then caches a folio in the filemap. Returns a folio with
> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> + */
> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode 
> *inode,
> + pgoff_t index)
> +{
> + struct kvm_gmem_hugetlb *hgmem;
> + pgoff_t aligned_index;
> + struct folio *folio;
> + int nr_pages;
> + int ret;
> +
> + hgmem = kvm_gmem_hgmem(inode);
> + folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> + if (IS_ERR(folio))
> + return folio;
> +
> + nr_pages = 1UL << huge_page_order(hgmem->h);
> + aligned_index = round_down(index, nr_pages);
 Maybe a gap here.

 When a guest_memfd is bound to a slot where slot->base_gfn is not aligned 
 to
 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
 corresponding GFN is not 2M/1G aligned.
>>>
>>> Thanks for looking into this.
>>>
>>> In 1G page support for guest_memfd, the offset and size are always
>>> hugepage aligned to the hugepage size requested at guest_memfd creation
>>> time, and it is true that when binding to a memslot, slot->base_gfn and
>>> slot->npages may not be hugepage aligned.
>>>

 However, TDX requires that private huge pages be 2M aligned in GFN.

>>>
>>> IIUC other factors also contribute to determining the mapping level in
>>> the guest page tables, like lpage_info and .private_max_mapping_level()
>>> in kvm_x86_ops.
>>>
>>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
>>> will track that and not allow faulting into guest page tables at higher
>>> granularity.
>>  
>> lpage_info only checks the alignments of slot->base_gfn and
>> slot->base_gfn + npages. e.g.,
>>
>> if slot->base_gfn is 8K, npages is 8M, then for this slot,
>> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
>> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
>> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
>> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);

Should it be?
lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);

>>
>>   -
>>   |  |  |  |  |  |  |  |  |
>>   8K2M 2M+8K  4M  4M+8K 6M  6M+8K 8M  8M+8K
>>
>> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
>> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
>> So, guest_memfd allocates the same huge folio of 2M order for them.
> Sorry, sent too fast this morning. The example is not right. The correct
> one is:
> 
> For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> KVM will create a 2M mapping for them.
> 
> However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
> same 2M folio and physical addresses may not be contiguous.
> 
> 
>> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
>> It's also weird for a 2M mapping in KVM to stride across 2 huge folios.
>>
>>> Hence I think it is okay to leave it to KVM to fault pages into the
>>> guest correctly. For guest_memfd will just maintain the invariant that
>>> offset and size are hugepage aligned, but not require that
>>> slot->base_gfn and slot->npages are hugepage aligned. This behavior will
>>> be consistent with other backing memory for guests like regular shmem or
>>> HugeTLB.
>>>
> + ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
> +  aligned_index,
> +  htlb_alloc_mask(hgmem->h));
> + WARN_ON(ret);
> +
>   spin_lock(&inode->i_lock);
>   inode->i_blocks += blocks_per_huge_page(hgmem->h);
>   spin_unlock(&inode->i_lock);
>  
> - return page_folio(requested_page);
> + return folio;
> +}
> 




Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-23 Thread Yan Zhao
On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> > Yan Zhao  writes:
> > 
> > > On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> > >> +/*
> > >> + * Allocates and then caches a folio in the filemap. Returns a folio 
> > >> with
> > >> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> > >> + */
> > >> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct 
> > >> inode *inode,
> > >> +pgoff_t 
> > >> index)
> > >> +{
> > >> +struct kvm_gmem_hugetlb *hgmem;
> > >> +pgoff_t aligned_index;
> > >> +struct folio *folio;
> > >> +int nr_pages;
> > >> +int ret;
> > >> +
> > >> +hgmem = kvm_gmem_hgmem(inode);
> > >> +folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> > >> +if (IS_ERR(folio))
> > >> +return folio;
> > >> +
> > >> +nr_pages = 1UL << huge_page_order(hgmem->h);
> > >> +aligned_index = round_down(index, nr_pages);
> > > Maybe a gap here.
> > >
> > > When a guest_memfd is bound to a slot where slot->base_gfn is not aligned 
> > > to
> > > 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> > > corresponding GFN is not 2M/1G aligned.
> > 
> > Thanks for looking into this.
> > 
> > In 1G page support for guest_memfd, the offset and size are always
> > hugepage aligned to the hugepage size requested at guest_memfd creation
> > time, and it is true that when binding to a memslot, slot->base_gfn and
> > slot->npages may not be hugepage aligned.
> > 
> > >
> > > However, TDX requires that private huge pages be 2M aligned in GFN.
> > >
> > 
> > IIUC other factors also contribute to determining the mapping level in
> > the guest page tables, like lpage_info and .private_max_mapping_level()
> > in kvm_x86_ops.
> >
> > If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> > will track that and not allow faulting into guest page tables at higher
> > granularity.
>  
> lpage_info only checks the alignments of slot->base_gfn and
> slot->base_gfn + npages. e.g.,
> 
> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> 
>   -
>   |  |  |  |  |  |  |  |  |
>   8K2M 2M+8K  4M  4M+8K 6M  6M+8K 8M  8M+8K
> 
> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> So, guest_memfd allocates the same huge folio of 2M order for them.
Sorry, sent too fast this morning. The example is not right. The correct
one is:

For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
KVM will create a 2M mapping for them.

However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
same 2M folio and physical addresses may not be contiguous.


> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
> It's also weird for a 2M mapping in KVM to stride across 2 huge folios.
> 
> > Hence I think it is okay to leave it to KVM to fault pages into the
> > guest correctly. For guest_memfd will just maintain the invariant that
> > offset and size are hugepage aligned, but not require that
> > slot->base_gfn and slot->npages are hugepage aligned. This behavior will
> > be consistent with other backing memory for guests like regular shmem or
> > HugeTLB.
> > 
> > >> +ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, 
> > >> folio,
> > >> + aligned_index,
> > >> + 
> > >> htlb_alloc_mask(hgmem->h));
> > >> +WARN_ON(ret);
> > >> +
> > >>  spin_lock(&inode->i_lock);
> > >>  inode->i_blocks += blocks_per_huge_page(hgmem->h);
> > >>  spin_unlock(&inode->i_lock);
> > >>  
> > >> -return page_folio(requested_page);
> > >> +return folio;
> > >> +}



Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-23 Thread Yan Zhao
On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> Yan Zhao  writes:
> 
> > On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> >> +/*
> >> + * Allocates and then caches a folio in the filemap. Returns a folio with
> >> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> >> + */
> >> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode 
> >> *inode,
> >> +  pgoff_t index)
> >> +{
> >> +  struct kvm_gmem_hugetlb *hgmem;
> >> +  pgoff_t aligned_index;
> >> +  struct folio *folio;
> >> +  int nr_pages;
> >> +  int ret;
> >> +
> >> +  hgmem = kvm_gmem_hgmem(inode);
> >> +  folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> >> +  if (IS_ERR(folio))
> >> +  return folio;
> >> +
> >> +  nr_pages = 1UL << huge_page_order(hgmem->h);
> >> +  aligned_index = round_down(index, nr_pages);
> > Maybe a gap here.
> >
> > When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> > 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> > corresponding GFN is not 2M/1G aligned.
> 
> Thanks for looking into this.
> 
> In 1G page support for guest_memfd, the offset and size are always
> hugepage aligned to the hugepage size requested at guest_memfd creation
> time, and it is true that when binding to a memslot, slot->base_gfn and
> slot->npages may not be hugepage aligned.
> 
> >
> > However, TDX requires that private huge pages be 2M aligned in GFN.
> >
> 
> IIUC other factors also contribute to determining the mapping level in
> the guest page tables, like lpage_info and .private_max_mapping_level()
> in kvm_x86_ops.
>
> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> will track that and not allow faulting into guest page tables at higher
> granularity.
 
lpage_info only checks the alignments of slot->base_gfn and
slot->base_gfn + npages. e.g.,

if slot->base_gfn is 8K, npages is 8M, then for this slot,
lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);

  -
  |  |  |  |  |  |  |  |  |
  8K2M 2M+8K  4M  4M+8K 6M  6M+8K 8M  8M+8K

For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
So, guest_memfd allocates the same huge folio of 2M order for them.

However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
It's also weird for a 2M mapping in KVM to stride across 2 huge folios.

> Hence I think it is okay to leave it to KVM to fault pages into the
> guest correctly. For guest_memfd will just maintain the invariant that
> offset and size are hugepage aligned, but not require that
> slot->base_gfn and slot->npages are hugepage aligned. This behavior will
> be consistent with other backing memory for guests like regular shmem or
> HugeTLB.
> 
> >> +  ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
> >> +   aligned_index,
> >> +   htlb_alloc_mask(hgmem->h));
> >> +  WARN_ON(ret);
> >> +
> >>spin_lock(&inode->i_lock);
> >>inode->i_blocks += blocks_per_huge_page(hgmem->h);
> >>spin_unlock(&inode->i_lock);
> >>  
> >> -  return page_folio(requested_page);
> >> +  return folio;
> >> +}



Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-23 Thread Ackerley Tng
Yan Zhao  writes:

> On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
>> +/*
>> + * Allocates and then caches a folio in the filemap. Returns a folio with
>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>> + */
>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode 
>> *inode,
>> +pgoff_t index)
>> +{
>> +struct kvm_gmem_hugetlb *hgmem;
>> +pgoff_t aligned_index;
>> +struct folio *folio;
>> +int nr_pages;
>> +int ret;
>> +
>> +hgmem = kvm_gmem_hgmem(inode);
>> +folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
>> +if (IS_ERR(folio))
>> +return folio;
>> +
>> +nr_pages = 1UL << huge_page_order(hgmem->h);
>> +aligned_index = round_down(index, nr_pages);
> Maybe a gap here.
>
> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> corresponding GFN is not 2M/1G aligned.

Thanks for looking into this.

In 1G page support for guest_memfd, the offset and size are always
hugepage aligned to the hugepage size requested at guest_memfd creation
time, and it is true that when binding to a memslot, slot->base_gfn and
slot->npages may not be hugepage aligned.

>
> However, TDX requires that private huge pages be 2M aligned in GFN.
>

IIUC other factors also contribute to determining the mapping level in
the guest page tables, like lpage_info and .private_max_mapping_level()
in kvm_x86_ops.

If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
will track that and not allow faulting into guest page tables at higher
granularity.

Hence I think it is okay to leave it to KVM to fault pages into the
guest correctly. For guest_memfd will just maintain the invariant that
offset and size are hugepage aligned, but not require that
slot->base_gfn and slot->npages are hugepage aligned. This behavior will
be consistent with other backing memory for guests like regular shmem or
HugeTLB.

>> +ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
>> + aligned_index,
>> + htlb_alloc_mask(hgmem->h));
>> +WARN_ON(ret);
>> +
>>  spin_lock(&inode->i_lock);
>>  inode->i_blocks += blocks_per_huge_page(hgmem->h);
>>  spin_unlock(&inode->i_lock);
>>  
>> -return page_folio(requested_page);
>> +return folio;
>> +}



Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

2025-04-03 Thread Yan Zhao
On Tue, Sep 10, 2024 at 11:44:10PM +, Ackerley Tng wrote:
> +/*
> + * Allocates and then caches a folio in the filemap. Returns a folio with
> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> + */
> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode 
> *inode,
> + pgoff_t index)
> +{
> + struct kvm_gmem_hugetlb *hgmem;
> + pgoff_t aligned_index;
> + struct folio *folio;
> + int nr_pages;
> + int ret;
> +
> + hgmem = kvm_gmem_hgmem(inode);
> + folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> + if (IS_ERR(folio))
> + return folio;
> +
> + nr_pages = 1UL << huge_page_order(hgmem->h);
> + aligned_index = round_down(index, nr_pages);
Maybe a gap here.

When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
corresponding GFN is not 2M/1G aligned.

However, TDX requires that private huge pages be 2M aligned in GFN.

> + ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
> +  aligned_index,
> +  htlb_alloc_mask(hgmem->h));
> + WARN_ON(ret);
> +
>   spin_lock(&inode->i_lock);
>   inode->i_blocks += blocks_per_huge_page(hgmem->h);
>   spin_unlock(&inode->i_lock);
>  
> - return page_folio(requested_page);
> + return folio;
> +}