Yan Zhao <[email protected]> writes:

> On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
>> Yan Zhao <[email protected]> writes:
>>
>> > On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
>> >> Sean Christopherson <[email protected]> writes:
>> >>
>> >> > On Tue, Jun 23, 2026, Yan Zhao wrote:
>> >> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> >> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
>> >> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> >> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 
>> >> >> > > > Relay wrote:
>> >> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> >> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > @@ -3198,8 +3198,12 @@ static int 
>> >> >> > > > > tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t 
>> >> >> > > > > pfn,
>> >> >> > > > >        if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> >> >> > > > >                return -EIO;
>> >> >> > > > >
>> >> >> > > > > -      if (!src_page)
>> >> >> > > > > -              return -EOPNOTSUPP;
>> >> >> > > > > +      if (!src_page) {
>> >> >> > > > > +              if (!gmem_in_place_conversion)
>> >> >> > > > When userspace turns on gmem_in_place_conversion while creating 
>> >> >> > > > guest_memfd
>> >> >> > > > without the MMAP flag, the absence of src_page should still be 
>> >> >> > > > treated as an
>> >> >> > > > error.
>> >> >> > >
>> >> >> > > Why MMAP?
>> >> >> > Hmm, I was showing a scenario that in-place conversion couldn't 
>> >> >> > occur.
>> >> >> > I didn't mean that with the MMAP flag, mmap() and user write must 
>> >> >> > occur.
>> >> >> >
>> >> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
>> >> >> > > because userspace _can_ mmap() the memory doesn't mean userspace 
>> >> >> > > _has_ mmap()'d
>> >> >> > > and written memory.  And when write() lands, MMAP wouldn't be 
>> >> >> > > necessary to
>> >> >> > > initialize the memory.
>> >> >> > Do you mean using up-to-date flag as below?
>> >> >
>> >> > Yes?  I didn't actually look at the implementation details.
>> >> >
>> >> >> > if (!src_page) {
>> >> >> >      src_page = pfn_to_page(pfn);
>> >> >> >      if (!folio_test_uptodate(page_folio(src_page)))
>> >> >> >              return -EOPNOTSUPP;
>> >> >> > }
>> >>
>> >> Yan is right that with the earlier patch "Zero page while getting pfn",
>> >> folio_test_uptodate() here will always return true.
>> >>
>> >> Actually, this is an alternative fix for the issue Sashiko pointed out
>> >> on v7 where userspace can do a populate() (either TDX or SNP) without
>> >> first allocating the page, with src_address == NULL, and leak
>> >> uninitialized memory into the guest.
>> >>
>> >> Advantage of using the uptodate check in populate: if the host never
>> >> allocates the page, populate doesn't incur zeroing before writing the
>> >> page anyway in populate().
>> >>
>> >> Disadvantage: Both TDX and SNP will have to implement this uptodate
>> >> check. guest_memfd can't check centrally because for SNP, for a
>> >> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
>> >> firmware will zero and there's no leakage of uninitialized host memory?
>> > Another disadvantage: the uptodate flag is per-folio. What if the folio
>> > is only partially initialized by the userspace especially after huge page 
>> > is
>> > supported?
>> >
>>
>> Good point on huge pages!
>>
>> The uptodate flag on the folio in guest_memfd means "this folio has been
>> written to". As of now (before patch at [1]), this happens when
>>
>> + folio is zeroed on first use by userspace
>> + folio is zeroed on first use of the guest
>> + folio is populated
>>
>> When huge pages are supported, the folio can't partially be initialized?
>>
>> On allocation, if any part is shared, we split the page. The parts are
>> separate folios that have their own uptodate flags.
>>
>> On splitting, if the huge page is uptodate, the split pages will also be
>> uptodate. If the huge page is not uptodate, the split pages won't be
>> uptodate, but that's ok since they will be marked uptodate on first use.
>>
>> On merging, the non-uptodate parts have to be zeroed and then marked
> If that's true, it would be good.
>
>> uptodate. Any parts that are in use would have been marked uptodate
>> already, so there's no overwriting data that is in use. I'll need to
>> think more about when it's safe to zero.
>>
>> I'm still on the fence between the two options
>>
>> 1. Using uptodate check in populate to reject src_pages that have never
>>    been written to or
>> 2. Always zero before populate
> 2 does not work?
> The flow is
> 1. mmap gmem_fd, make GFN shared, and write initial content.
> 2. convert GFN to private
> 3. invoke ioctl to trigger populate.
>

This flow is correct, is what users of in-place conversion should do.

"Always" is the wrong word, I should have said "zero if not uptodate
before populate", as in, with patch at [1].

By doing the zeroing in __kvm_gmem_get_pfn instead, by the time populate
gets the pfn, the page would be zeroed, either because userspace faulted
it in, and the zeroing happened in kvm_gmem_fault_user_mapping(), or if
userspace never faulted it in, the zeroing would happen because
populate() allocated the page.

>> but whether the uptodate flag is per-folio or not doesn't affect these
>> two options in terms of fixing the leak of uninitialized host memory,
>> right?
> yes, provided "On merging, the non-uptodate parts have to be zeroed and then
> marked uptodate".
>

Thank you so much for bringing this up, I hadn't considered this
before. I'll do that when I get to guest_memfd hugepage restructuring.

>> >
>> >> >> Another concern with this fix is that:
>> >> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always 
>> >> >> marks the
>> >> >> folio uptodate before reaching post_populate().
>> >> >>
>> >> >> [1] 
>> >> >> https://lore.kernel.org/all/[email protected]/
>> >> >>
>> >> >> > One concern is that TDX now does not much care about the up-to-date 
>> >> >> > flag since
>> >> >> > TDX doesn't rely on the flag to clear pages on conversions.
>> >> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
>> >> >> > now the whole folio is marked up-to-date even if only part of it is 
>> >> >> > faulted by
>> >> >> > user access.
>> >> >> > Ensuring that the up-to-date flag works correctly with huge page 
>> >> >> > support seems
>> >> >> > to have more effort than introducing a dedicated flag for TDX.
>> >> >> >
>> >> >> > > > Additionally, to properly enable in-place copying for the TDX 
>> >> >> > > > initial memory
>> >> >> > > > region, userspace must not only specify source_addr to NULL, but 
>> >> >> > > > also follow
>> >> >> > > > a specific sequence (where steps 1/2/3/7 are required only for 
>> >> >> > > > in-place copy):
>> >> >> > > > 1. create guest_memfd with MMAP flag
>> >> >> > > > 2. mmap the guest_memfd.
>> >> >> > > > 3. convert the initial memory range to shared.
>> >> >> > > > 4. copy initial content to the source page.
>> >> >> > > > 5. convert the initial memory range to private
>> >> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> >> >> > > > 7. do not unmap the source backend.
>> >> >> > > >
>> >> >> > > > So, would it be reasonable to introduce a dedicated flag that 
>> >> >> > > > allows userspace
>> >> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> >> >> > >
>> >> >> > > Why?  It's userspace's responsibility to get the above right.  If 
>> >> >> > > userspace fails
>> >> >> > > to provide a src_page when it doesn't want in-place copy, that's a 
>> >> >> > > userspace bug.
>> >>
>> >> Yan, is your concern that userspace forgot to update the code and
>> >> forgets to provide a src_page, and if we keep the "Zero page while
>> > Yes. Previously, it would be rejected after GUP fails.
>> >
>>
>> I see, didn't realize previously it would be rejected because GUP
>> fails. GUP failed because it wasn't faulted into the host?
> GUP fails if 0 is not a valid user address.
> But GUP would not fail if 0 is a valid address. e.g., in below scenario:
>
> #include <sys/mman.h>
> #include <stdio.h>
> int main(void)
> {
>         void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, 
> MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
>         if (p==MAP_FAILED) {
>                 perror("mmap");
>                 return 1;
>         }
>         *(char*)0='Y';
>         printf("addr0=%p val=%c\n",p,*(char*)0);
>         return 0;
> }
>
>
>> That's kind of orthogonal, I don't think GUP fail leading to rejecting
>> populate was meant to help userspace catch these issues. GUP would also
>> fail if the user did mmap(), write to it, unmap using
>> madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
> The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 
> was
> rejected depended on whether the user mmap()'d address 0. If 0 was a valid
> mapping, populate() could proceed.
>
> commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating
> guest memory") changed the behavior though. It would return -EOPNOTSUPP for a > 0
> uaddr.
>

I see, I only looked at this after commit 2a62345b3052.

> But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
> source_addr(not from gmem), I'm not sure if it's good for the kernel to 
> silently
> treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
>

I'd say the original uAPI perhaps just didn't document 0 as an
unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
was perhaps accidentally changed and no customer complained, I think we
can move forward with 0 as an invalid src_address? I wouldn't think
anyone relies on 0 intentionally being a valid address.

I could document that, if it helps?

>
>> >> getting pfn" patch, ends up with the guest silently having a zero page?
>> >> I think that would be found quite early in userspace VMM testing...
>> > I actually encountered this during testing this patch.
>> > I update most code path to follow this sequence. However, still some 
>> > corner ones
>> > for TDVF HOB, which are less obvious and harder to update.
>> > The TD just booted up and hang silently.
>> >
>>
>> I think this is just the life of a close-to-hardware software engineer
>> :P no errors, got stuck somewhere, root cause is some unitialized
>> thing.
>>
>> >> >> > I mean if userspace specifies a NULL source_addr by mistake, it's 
>> >> >> > better for
>> >> >> > kernel to detect this mistake, similar to how it validates whether 
>> >> >> > source_addr
>> >> >> > is PAGE_ALIGNED.
>> >> >
>> >> > The alignment case is different.  If userspace provides an unaligned 
>> >> > value, KVM
>> >> > *can't* do what userspace is asking because hardware and thus KVM only 
>> >> > supports
>> >> > converting on page boundaries.
>> >> >
>> >> > For a NULL source, KVM can still do what userspace is asking.  
>> >> > Rejecting userspace's
>> >> > request would then be making assumptions about what userspace wants.
>> >> >
>> >>
>> >> Also, +1 on this, what if userspace, knowing that pages are zeroed on
>> >> allocation, actually wants to rely on that to get a zero page in the 
>> >> guest?
>> > What if 0 uaddr is a valid address? :)
>> >
>> >> >> > Since userspace already needs to perform additional steps to enable 
>> >> >> > in-place
>> >> >> > copy, specifying a dedicated flag to indicate that the NULL 
>> >> >> > source_addr is
>> >> >> > intentional seems like a reasonable burden.
>> >> >
>> >> > I don't see how it adds any value.  I wouldn't be at all surprised if 
>> >> > most VMMs
>> >> > just wen up with code that does:
>> >> >
>> >> >         if (in-place) {
>> >> >                 src = NULL;
>> >> >                 flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
>> >> >         }
>> >>

Reply via email to