Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-24 Thread Michael Roth
On Fri, Jun 24, 2022 at 04:54:26PM +0800, Chao Peng wrote:
> On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote:
> > On Fri, May 20, 2022 at 06:31:02PM +, Sean Christopherson wrote:
> > > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > > The alternative would be to have some kind of separate table or bitmap 
> > > > (part
> > > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > > 
> > > > What do you all think?
> > > 
> > > My original proposal was to have expolicit shared vs. private memslots, 
> > > and punch
> > > holes in KVM's memslots on conversion, but due to the way KVM (and 
> > > userspace)
> > > handle memslot updates, conversions would be painfully slow.  That's how 
> > > we ended
> > > up with the current propsoal.
> > > 
> > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to 
> > > implement
> > > and wouldn't necessarily even need to interact with the memslots.  It 
> > > could be a
> > > consumer of memslots, e.g. if we wanted to disallow registering regions 
> > > without an
> > > associated memslot, but I think we'd want to avoid even that because 
> > > things will
> > > get messy during memslot updates, e.g. if dirty logging is toggled or a 
> > > shared
> > > memory region is temporarily removed then we wouldn't want to destroy the 
> > > tracking.
> > > 
> > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, 
> > > XArray
> > > should be far more efficient.
> > > 
> > > One benefit to explicitly tracking this in KVM is that it might be useful 
> > > for
> > > software-only protected VMs, e.g. KVM could mark a region in the XArray 
> > > as "pending"
> > > based on guest hypercalls to share/unshare memory, and then complete the 
> > > transaction
> > > when userspace invokes the ioctl() to complete the share/unshare.
> > 
> > Another upside to implementing a KVM ioctl is basically the reverse of the
> > discussion around avoiding double-allocations: *supporting* 
> > double-allocations.
> > 
> > One thing I noticed while testing SNP+UPM support is a fairly dramatic
> > slow-down with how it handles OVMF, which does some really nasty stuff
> > with DMA where it takes 1 or 2 pages and flips them between
> > shared/private on every transaction. Obviously that's not ideal and
> > should be fixed directly at some point, but it's something that exists in 
> > the
> > wild and might not be the only such instance where we need to deal with that
> > sort of usage pattern. 
> > 
> > With the current implementation, one option I had to address this was to
> > disable hole-punching in QEMU when doing shared->private conversions:
> > 
> > Boot time from 1GB guest:
> >SNP:   32s
> >SNP+UPM: 1m43s
> >   SNP+UPM (disable shared discard): 1m08s
> > 
> > Of course, we don't have the option of disabling discard/hole-punching
> > for private memory to see if we get similar gains there, since that also
> > doubles as the interface for doing private->shared conversions.
> 
> Private should be the same, minus time consumed for private memory, the
> data should be close to SNP case. You can't try that in current version
> due to we rely on the existence of the private page to tell a page is
> private.
> 
> > A separate
> > KVM ioctl to decouple these 2 things would allow for that, and allow for a
> > way for userspace to implement things like batched/lazy-discard of
> > previously-converted pages to deal with cases like these.
> 
> The planned ioctl includes two responsibilities:
>   - Mark the range as private/shared
>   - Zap the existing SLPT mapping for the range
> 
> Whether doing the hole-punching or not on the fd is unrelated to this
> ioctl, userspace has freedom to do that or not. Since we don't reply on
> the fact that private memoy should have been allocated, we can support
> lazy faulting and don't need explicit fallocate(). That means, whether
> the memory is discarded or not in the memory backing store is not
> required by KVM, but be a userspace option.

Nice, that sounds promising.

> 
> > 
> > Another motivator for these separate ioctl is that, since we're considering
> > 'out-of-band' interactions with private memfd where userspace might
> > erroneously/inadvertently do things like double allocations, another thing 
> > it
> > might do is pre-allocating pages in the private memfd prior to associating
> > the memfd with a private memslot. Since the notifiers aren't registered 
> > until
> > that point, any associated callbacks that would normally need to be done as
> > part of those fallocate() notification would be missed unless we do 
> > something
> > like 'replay' all the notifications once the private memslot is registered 
> > and
> > associating with a memfile notifier. But that seems a bit ugly, and I'm not
> > sure how well that would work. This also seems to hint at this additional
> > 'conversion' state being something that should be 

Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-24 Thread Chao Peng
On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote:
> On Fri, May 20, 2022 at 06:31:02PM +, Sean Christopherson wrote:
> > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > The alternative would be to have some kind of separate table or bitmap 
> > > (part
> > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > 
> > > What do you all think?
> > 
> > My original proposal was to have expolicit shared vs. private memslots, and 
> > punch
> > holes in KVM's memslots on conversion, but due to the way KVM (and 
> > userspace)
> > handle memslot updates, conversions would be painfully slow.  That's how we 
> > ended
> > up with the current propsoal.
> > 
> > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to 
> > implement
> > and wouldn't necessarily even need to interact with the memslots.  It could 
> > be a
> > consumer of memslots, e.g. if we wanted to disallow registering regions 
> > without an
> > associated memslot, but I think we'd want to avoid even that because things 
> > will
> > get messy during memslot updates, e.g. if dirty logging is toggled or a 
> > shared
> > memory region is temporarily removed then we wouldn't want to destroy the 
> > tracking.
> > 
> > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, 
> > XArray
> > should be far more efficient.
> > 
> > One benefit to explicitly tracking this in KVM is that it might be useful 
> > for
> > software-only protected VMs, e.g. KVM could mark a region in the XArray as 
> > "pending"
> > based on guest hypercalls to share/unshare memory, and then complete the 
> > transaction
> > when userspace invokes the ioctl() to complete the share/unshare.
> 
> Another upside to implementing a KVM ioctl is basically the reverse of the
> discussion around avoiding double-allocations: *supporting* 
> double-allocations.
> 
> One thing I noticed while testing SNP+UPM support is a fairly dramatic
> slow-down with how it handles OVMF, which does some really nasty stuff
> with DMA where it takes 1 or 2 pages and flips them between
> shared/private on every transaction. Obviously that's not ideal and
> should be fixed directly at some point, but it's something that exists in the
> wild and might not be the only such instance where we need to deal with that
> sort of usage pattern. 
> 
> With the current implementation, one option I had to address this was to
> disable hole-punching in QEMU when doing shared->private conversions:
> 
> Boot time from 1GB guest:
>SNP:   32s
>SNP+UPM: 1m43s
>   SNP+UPM (disable shared discard): 1m08s
> 
> Of course, we don't have the option of disabling discard/hole-punching
> for private memory to see if we get similar gains there, since that also
> doubles as the interface for doing private->shared conversions.

Private should be the same, minus time consumed for private memory, the
data should be close to SNP case. You can't try that in current version
due to we rely on the existence of the private page to tell a page is
private.

> A separate
> KVM ioctl to decouple these 2 things would allow for that, and allow for a
> way for userspace to implement things like batched/lazy-discard of
> previously-converted pages to deal with cases like these.

The planned ioctl includes two responsibilities:
  - Mark the range as private/shared
  - Zap the existing SLPT mapping for the range

Whether doing the hole-punching or not on the fd is unrelated to this
ioctl, userspace has freedom to do that or not. Since we don't reply on
the fact that private memoy should have been allocated, we can support
lazy faulting and don't need explicit fallocate(). That means, whether
the memory is discarded or not in the memory backing store is not
required by KVM, but be a userspace option.

> 
> Another motivator for these separate ioctl is that, since we're considering
> 'out-of-band' interactions with private memfd where userspace might
> erroneously/inadvertently do things like double allocations, another thing it
> might do is pre-allocating pages in the private memfd prior to associating
> the memfd with a private memslot. Since the notifiers aren't registered until
> that point, any associated callbacks that would normally need to be done as
> part of those fallocate() notification would be missed unless we do something
> like 'replay' all the notifications once the private memslot is registered and
> associating with a memfile notifier. But that seems a bit ugly, and I'm not
> sure how well that would work. This also seems to hint at this additional
> 'conversion' state being something that should be owned and managed directly
> by KVM rather than hooking into the allocations.

Right, once we move the private/shared state into KVM then we don't rely
on those callbacks so the 'replay' thing is unneeded. fallocate()
notification is useless for sure, invalidate() is likely still needed,
just like the 

Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-23 Thread Michael Roth
On Fri, May 20, 2022 at 06:31:02PM +, Sean Christopherson wrote:
> On Fri, May 20, 2022, Andy Lutomirski wrote:
> > The alternative would be to have some kind of separate table or bitmap (part
> > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > 
> > What do you all think?
> 
> My original proposal was to have expolicit shared vs. private memslots, and 
> punch
> holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> handle memslot updates, conversions would be painfully slow.  That's how we 
> ended
> up with the current propsoal.
> 
> But a dedicated KVM ioctl() to add/remove shared ranges would be easy to 
> implement
> and wouldn't necessarily even need to interact with the memslots.  It could 
> be a
> consumer of memslots, e.g. if we wanted to disallow registering regions 
> without an
> associated memslot, but I think we'd want to avoid even that because things 
> will
> get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> memory region is temporarily removed then we wouldn't want to destroy the 
> tracking.
> 
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.
> 
> One benefit to explicitly tracking this in KVM is that it might be useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray as 
> "pending"
> based on guest hypercalls to share/unshare memory, and then complete the 
> transaction
> when userspace invokes the ioctl() to complete the share/unshare.

Another upside to implementing a KVM ioctl is basically the reverse of the
discussion around avoiding double-allocations: *supporting* double-allocations.

One thing I noticed while testing SNP+UPM support is a fairly dramatic
slow-down with how it handles OVMF, which does some really nasty stuff
with DMA where it takes 1 or 2 pages and flips them between
shared/private on every transaction. Obviously that's not ideal and
should be fixed directly at some point, but it's something that exists in the
wild and might not be the only such instance where we need to deal with that
sort of usage pattern. 

With the current implementation, one option I had to address this was to
disable hole-punching in QEMU when doing shared->private conversions:

Boot time from 1GB guest:
   SNP:   32s
   SNP+UPM: 1m43s
  SNP+UPM (disable shared discard): 1m08s

Of course, we don't have the option of disabling discard/hole-punching
for private memory to see if we get similar gains there, since that also
doubles as the interface for doing private->shared conversions. A separate
KVM ioctl to decouple these 2 things would allow for that, and allow for a
way for userspace to implement things like batched/lazy-discard of
previously-converted pages to deal with cases like these.

Another motivator for these separate ioctl is that, since we're considering
'out-of-band' interactions with private memfd where userspace might
erroneously/inadvertently do things like double allocations, another thing it
might do is pre-allocating pages in the private memfd prior to associating
the memfd with a private memslot. Since the notifiers aren't registered until
that point, any associated callbacks that would normally need to be done as
part of those fallocate() notification would be missed unless we do something
like 'replay' all the notifications once the private memslot is registered and
associating with a memfile notifier. But that seems a bit ugly, and I'm not
sure how well that would work. This also seems to hint at this additional
'conversion' state being something that should be owned and managed directly
by KVM rather than hooking into the allocations.

It would also nicely solve the question of how to handle in-place
encryption, since unlike userspace, KVM is perfectly capable of copying
data from shared->private prior to conversion / guest start, and
disallowing such things afterward. Would just need an extra flag basically.



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-20 Thread Chao Peng
On Fri, Jun 17, 2022 at 09:27:25PM +, Sean Christopherson wrote:
> On Fri, Jun 17, 2022, Sean Christopherson wrote:
> > > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
> > >   */
> > >  #define KVM_MEM_LOG_DIRTY_PAGES  (1UL << 0)
> > >  #define KVM_MEM_READONLY (1UL << 1)
> > > +#define KVM_MEM_PRIVATE  (1UL << 2)
> > 
> > Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps 
> > private
> > and/or shared memory.  Strictly speaking, we don't actually need a new 
> > flag.  Valid
> > file descriptors must be >=0, so the logic for specifying a memslot that 
> > can be
> > converted between private and shared could be that "(int)private_fd < 0" 
> > means
> > "not convertible", i.e. derive the flag from private_fd.
> > 
> > And looking at the two KVM consumers of the flag, via 
> > kvm_slot_is_private(), they're
> > both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should 
> > operate
> > on the _fault_, not the slot.  So it would actually be a positive to not 
> > have an easy
> > way to query if a slot supports conversion.
> 
> I take that back, the usage in kvm_faultin_pfn() is correct, but the names 
> ends
> up being confusing because it suggests that it always faults in a private pfn.

Make sense, will change the naming, thanks.

> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b6d75016e48c..e1008f00609d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4045,7 +4045,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> struct kvm_page_fault *fault)
> return RET_PF_EMULATE;
> }
> 
> -   if (fault->is_private) {
> +   if (kvm_slot_can_be_private(slot)) {
> r = kvm_faultin_pfn_private(vcpu, fault);
> if (r != RET_PF_CONTINUE)
> return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 31f704c83099..c5126190fb71 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -583,9 +583,9 @@ struct kvm_memory_slot {
> struct kvm *kvm;
>  };
> 
> -static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot 
> *slot)
>  {
> -   return slot && (slot->flags & KVM_MEM_PRIVATE);
> +   return slot && !!slot->private_file;
>  }
> 
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot 
> *slot)



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-20 Thread Chao Peng
On Fri, Jun 17, 2022 at 08:52:15PM +, Sean Christopherson wrote:
> On Thu, May 19, 2022, Chao Peng wrote:
> > @@ -653,12 +662,12 @@ struct kvm_irq_routing_table {
> >  };
> >  #endif
> >  
> > -#ifndef KVM_PRIVATE_MEM_SLOTS
> > -#define KVM_PRIVATE_MEM_SLOTS 0
> > +#ifndef KVM_INTERNAL_MEM_SLOTS
> > +#define KVM_INTERNAL_MEM_SLOTS 0
> >  #endif
> 
> This rename belongs in a separate patch.

Will separate it out, thanks.

> 
> >  #define KVM_MEM_SLOTS_NUM SHRT_MAX
> > -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
> > +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
> >  
> >  #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> >  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> > @@ -1087,9 +1096,9 @@ enum kvm_mr_change {
> >  };
> >  
> >  int kvm_set_memory_region(struct kvm *kvm,
> > - const struct kvm_userspace_memory_region *mem);
> > + const struct kvm_user_mem_region *mem);
> >  int __kvm_set_memory_region(struct kvm *kvm,
> > -   const struct kvm_userspace_memory_region *mem);
> > +   const struct kvm_user_mem_region *mem);
> >  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
> >  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
> >  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index e10d131edd80..28cacd3656d4 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region {
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> >  };
> >  
> > +struct kvm_userspace_memory_region_ext {
> > +   struct kvm_userspace_memory_region region;
> > +   __u64 private_offset;
> > +   __u32 private_fd;
> > +   __u32 pad1;
> > +   __u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/* Internal helper, the layout must match above user visible structures */
> 
> It's worth explicity calling out which structureso this aliases.  And rather 
> than
> add a comment about the layout needing to match that, enforce it in code. I
> personally wouldn't bother with an expolicit comment about the layout, IMO 
> that's
> a fairly obvious implication of aliasing.
> 
> /*
>  * kvm_user_mem_region is a kernel-only alias of 
> kvm_userspace_memory_region_ext
>  * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
>  * all fields from the top-level "extended" region.
>  */
> 

Thanks.

> 
> And I think it's in this patch that you missed a conversion to the alias, in 
> the
> prototype for check_memory_region_flags() (looks like it gets fixed up later 
> in
> the series).
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0f81bf0407be..8765b334477d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1466,7 +1466,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
>  }
> 
> -static int check_memory_region_flags(const struct 
> kvm_userspace_memory_region *mem)
> +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
>  {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> 
> @@ -4514,6 +4514,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
> return fd;
>  }
> 
> +#define SANITY_CHECK_MEM_REGION_FIELD(field) 
>   \
> +do { 
>   \
> +   BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=   
>   \
> +offsetof(struct kvm_userspace_memory_region, field));
>   \
> +   BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=   
>   \
> +sizeof_field(struct kvm_userspace_memory_region, 
> field));  \
> +} while (0)
> +
> +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) 
>   \
> +do { 
>   \
> +   BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) !=   
>   \
> +offsetof(struct kvm_userspace_memory_region_ext, 
> field));  \
> +   BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) !=   
>   \
> +sizeof_field(struct kvm_userspace_memory_region_ext, 
> field));  \
> +} while (0)
> +
> +static void kvm_sanity_check_user_mem_region_alias(void)
> +{
> +   SANITY_CHECK_MEM_REGION_FIELD(slot);
> +   SANITY_CHECK_MEM_REGION_FIELD(flags);
> +   SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
> +   SANITY_CHECK_MEM_REGION_FIELD(memory_size);
> +   SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
> +   SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
> +   SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,

Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-17 Thread Sean Christopherson
On Fri, Jun 17, 2022, Sean Christopherson wrote:
> > @@ -110,6 +133,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES(1UL << 0)
> >  #define KVM_MEM_READONLY   (1UL << 1)
> > +#define KVM_MEM_PRIVATE(1UL << 2)
> 
> Hmm, KVM_MEM_PRIVATE is technically wrong now that a "private" memslot maps 
> private
> and/or shared memory.  Strictly speaking, we don't actually need a new flag.  
> Valid
> file descriptors must be >=0, so the logic for specifying a memslot that can 
> be
> converted between private and shared could be that "(int)private_fd < 0" means
> "not convertible", i.e. derive the flag from private_fd.
> 
> And looking at the two KVM consumers of the flag, via kvm_slot_is_private(), 
> they're
> both wrong.  Both kvm_faultin_pfn() and kvm_mmu_max_mapping_level() should 
> operate
> on the _fault_, not the slot.  So it would actually be a positive to not have 
> an easy
> way to query if a slot supports conversion.

I take that back, the usage in kvm_faultin_pfn() is correct, but the names ends
up being confusing because it suggests that it always faults in a private pfn.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b6d75016e48c..e1008f00609d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4045,7 +4045,7 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault)
return RET_PF_EMULATE;
}

-   if (fault->is_private) {
+   if (kvm_slot_can_be_private(slot)) {
r = kvm_faultin_pfn_private(vcpu, fault);
if (r != RET_PF_CONTINUE)
return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31f704c83099..c5126190fb71 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -583,9 +583,9 @@ struct kvm_memory_slot {
struct kvm *kvm;
 };

-static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
 {
-   return slot && (slot->flags & KVM_MEM_PRIVATE);
+   return slot && !!slot->private_file;
 }

 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot 
*slot)




Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-17 Thread Sean Christopherson
On Thu, May 19, 2022, Chao Peng wrote:
> @@ -653,12 +662,12 @@ struct kvm_irq_routing_table {
>  };
>  #endif
>  
> -#ifndef KVM_PRIVATE_MEM_SLOTS
> -#define KVM_PRIVATE_MEM_SLOTS 0
> +#ifndef KVM_INTERNAL_MEM_SLOTS
> +#define KVM_INTERNAL_MEM_SLOTS 0
>  #endif

This rename belongs in a separate patch.

>  #define KVM_MEM_SLOTS_NUM SHRT_MAX
> -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS)
> +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
>  
>  #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
>  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
> @@ -1087,9 +1096,9 @@ enum kvm_mr_change {
>  };
>  
>  int kvm_set_memory_region(struct kvm *kvm,
> -   const struct kvm_userspace_memory_region *mem);
> +   const struct kvm_user_mem_region *mem);
>  int __kvm_set_memory_region(struct kvm *kvm,
> - const struct kvm_userspace_memory_region *mem);
> + const struct kvm_user_mem_region *mem);
>  void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
>  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e10d131edd80..28cacd3656d4 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -103,6 +103,29 @@ struct kvm_userspace_memory_region {
>   __u64 userspace_addr; /* start of the userspace allocated memory */
>  };
>  
> +struct kvm_userspace_memory_region_ext {
> + struct kvm_userspace_memory_region region;
> + __u64 private_offset;
> + __u32 private_fd;
> + __u32 pad1;
> + __u64 pad2[14];
> +};
> +
> +#ifdef __KERNEL__
> +/* Internal helper, the layout must match above user visible structures */

It's worth explicity calling out which structureso this aliases.  And rather 
than
add a comment about the layout needing to match that, enforce it in code. I
personally wouldn't bother with an expolicit comment about the layout, IMO 
that's
a fairly obvious implication of aliasing.

/*
 * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext
 * that "unpacks" kvm_userspace_memory_region so that KVM can directly access
 * all fields from the top-level "extended" region.
 */


And I think it's in this patch that you missed a conversion to the alias, in the
prototype for check_memory_region_flags() (looks like it gets fixed up later in
the series).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f81bf0407be..8765b334477d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1466,7 +1466,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
 }

-static int check_memory_region_flags(const struct kvm_userspace_memory_region 
*mem)
+static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
 {
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

@@ -4514,6 +4514,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm)
return fd;
 }

+#define SANITY_CHECK_MEM_REGION_FIELD(field)   
\
+do {   
\
+   BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != 
\
+offsetof(struct kvm_userspace_memory_region, field));  
\
+   BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != 
\
+sizeof_field(struct kvm_userspace_memory_region, field));  
\
+} while (0)
+
+#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field)   
\
+do {   
\
+   BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != 
\
+offsetof(struct kvm_userspace_memory_region_ext, field));  
\
+   BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != 
\
+sizeof_field(struct kvm_userspace_memory_region_ext, 
field));  \
+} while (0)
+
+static void kvm_sanity_check_user_mem_region_alias(void)
+{
+   SANITY_CHECK_MEM_REGION_FIELD(slot);
+   SANITY_CHECK_MEM_REGION_FIELD(flags);
+   SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr);
+   SANITY_CHECK_MEM_REGION_FIELD(memory_size);
+   SANITY_CHECK_MEM_REGION_FIELD(userspace_addr);
+   SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset);
+   SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
@@ -4541,6 +4568,8 @@ static long kvm_vm_ioctl(struct file *filp,
unsigned long size;
u32 flags;

+   kvm_sanity_check_user_mem_region_alias();
+
memset(, 0, sizeof(mem));

r = -EFAULT;

> +struct 

Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-14 Thread Chao Peng
On Fri, Jun 10, 2022 at 04:14:21PM +, Sean Christopherson wrote:
> On Mon, May 30, 2022, Chao Peng wrote:
> > On Mon, May 23, 2022 at 03:22:32PM +, Sean Christopherson wrote:
> > > Actually, if the semantics are that userspace declares memory as private, 
> > > then we
> > > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and 
> > > KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
> > > be a little gross because we'd need to slightly redefine the semantics 
> > > for TDX, SNP,
> > > and software-protected VM types, e.g. the ioctls() currently require a 
> > > pre-exisitng
> > > memslot.  But I think it'd work...
> > 
> > These existing ioctls looks good for TDX and probably SNP as well. For
> > softrware-protected VM types, it may not be enough. Maybe for the first
> > step we can reuse this for all hardware based solutions and invent new
> > interface when software-protected solution gets really supported.
> > 
> > There is semantics difference for fd-based private memory. Current above
> > two ioctls() use userspace addreess(hva) while for fd-based it should be
> > fd+offset, and probably it's better to use gpa in this case. Then we
> > will need change existing semantics and break backward-compatibility.
> 
> My thought was to keep the existing semantics for VMs with type==0, i.e. SEV 
> and
> SEV-ES VMs.  It's a bit gross, but the pinning behavior is a dead end for SNP 
> and
> TDX, so it effectively needs to be deprecated anyways. 

Yes agreed.

> I'm definitely not opposed
> to a new ioctl if Paolo or others think this is too awful, but burning an 
> ioctl
> for this seems wasteful.

Yes, I also feel confortable if it's acceptable to reuse kvm_enc_region
to pass _gpa_ range for this new type.

> 
> Then generic KVM can do something like:
> 
>   case KVM_MEMORY_ENCRYPT_REG_REGION:
>   case KVM_MEMORY_ENCRYPT_UNREG_REGION:
>   struct kvm_enc_region region;
> 
>   if (!kvm_arch_vm_supports_private_memslots(kvm))
>   goto arch_vm_ioctl;
> 
>   r = -EFAULT;
>   if (copy_from_user(, argp, sizeof(region)))
>   goto out;
> 
>   r = kvm_set_encrypted_region(ioctl, );
>   break;
>   default:
> arch_vm_ioctl:
>   r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> 
> 
> where common KVM provides
> 
>   __weak void kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
>   {
>   return false;
>   }

I already had kvm_arch_private_mem_supported() introduced in patch-07
so that can be reused.

> 
> and x86 overrides that to
> 
>   bool kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
>   {
>   /* I can't remember what we decided on calling type '0' VMs. */
>   return !!kvm->vm_type;
>   }
> 
> and if someone ever wants to enable private memslot for SEV/SEV-ES guests we 
> can
> always add a capability or even a new VM type.
> 
> pKVM on arm can then obviously implement 
> kvm_arch_vm_supports_private_memslots()
> to grab whatever identifies a pKVM VM.



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-06-10 Thread Sean Christopherson
On Mon, May 30, 2022, Chao Peng wrote:
> On Mon, May 23, 2022 at 03:22:32PM +, Sean Christopherson wrote:
> > Actually, if the semantics are that userspace declares memory as private, 
> > then we
> > can reuse KVM_MEMORY_ENCRYPT_REG_REGION and 
> > KVM_MEMORY_ENCRYPT_UNREG_REGION.  It'd
> > be a little gross because we'd need to slightly redefine the semantics for 
> > TDX, SNP,
> > and software-protected VM types, e.g. the ioctls() currently require a 
> > pre-exisitng
> > memslot.  But I think it'd work...
> 
> These existing ioctls looks good for TDX and probably SNP as well. For
> softrware-protected VM types, it may not be enough. Maybe for the first
> step we can reuse this for all hardware based solutions and invent new
> interface when software-protected solution gets really supported.
> 
> There is semantics difference for fd-based private memory. Current above
> two ioctls() use userspace addreess(hva) while for fd-based it should be
> fd+offset, and probably it's better to use gpa in this case. Then we
> will need change existing semantics and break backward-compatibility.

My thought was to keep the existing semantics for VMs with type==0, i.e. SEV and
SEV-ES VMs.  It's a bit gross, but the pinning behavior is a dead end for SNP 
and
TDX, so it effectively needs to be deprecated anyways.  I'm definitely not 
opposed
to a new ioctl if Paolo or others think this is too awful, but burning an ioctl
for this seems wasteful.

Then generic KVM can do something like:

case KVM_MEMORY_ENCRYPT_REG_REGION:
case KVM_MEMORY_ENCRYPT_UNREG_REGION:
struct kvm_enc_region region;

if (!kvm_arch_vm_supports_private_memslots(kvm))
goto arch_vm_ioctl;

r = -EFAULT;
if (copy_from_user(, argp, sizeof(region)))
goto out;

r = kvm_set_encrypted_region(ioctl, );
break;
default:
arch_vm_ioctl:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);


where common KVM provides

  __weak void kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
  {
return false;
  }

and x86 overrides that to

  bool kvm_arch_vm_supports_private_memslots(struct kvm *kvm)
  {
/* I can't remember what we decided on calling type '0' VMs. */
return !!kvm->vm_type;
  }

and if someone ever wants to enable private memslot for SEV/SEV-ES guests we can
always add a capability or even a new VM type.

pKVM on arm can then obviously implement kvm_arch_vm_supports_private_memslots()
to grab whatever identifies a pKVM VM.



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-30 Thread Chao Peng
On Mon, May 23, 2022 at 03:22:32PM +, Sean Christopherson wrote:
> On Mon, May 23, 2022, Chao Peng wrote:
> > On Fri, May 20, 2022 at 06:31:02PM +, Sean Christopherson wrote:
> > > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > > The alternative would be to have some kind of separate table or bitmap 
> > > > (part
> > > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > > 
> > > > What do you all think?
> > > 
> > > My original proposal was to have expolicit shared vs. private memslots, 
> > > and punch
> > > holes in KVM's memslots on conversion, but due to the way KVM (and 
> > > userspace)
> > > handle memslot updates, conversions would be painfully slow.  That's how 
> > > we ended
> > > up with the current propsoal.
> > > 
> > > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to 
> > > implement
> > > and wouldn't necessarily even need to interact with the memslots.  It 
> > > could be a
> > > consumer of memslots, e.g. if we wanted to disallow registering regions 
> > > without an
> > > associated memslot, but I think we'd want to avoid even that because 
> > > things will
> > > get messy during memslot updates, e.g. if dirty logging is toggled or a 
> > > shared
> > > memory region is temporarily removed then we wouldn't want to destroy the 
> > > tracking.
> > 
> > Even we don't tight that to memslots, that info can only be effective
> > for private memslot, right? Setting this ioctl to memory ranges defined
> > in a traditional non-private memslots just makes no sense, I guess we can
> > comment that in the API document.
> 
> Hrm, applying it universally would be funky, e.g. emulated MMIO would need to 
> be
> declared "shared".  But, applying it selectively would arguably be worse, e.g.
> letting userspace map memory into the guest as shared for a region that's 
> registered
> as private...
> 
> On option to that mess would be to make memory shared by default, and so 
> userspace
> must declare regions that are private.  Then there's no weirdness with 
> emulated MMIO
> or "legacy" memslots.
> 
> On page fault, KVM does a lookup to see if the GPA is shared or private.  If 
> the
> GPA is private, but there is no memslot or the memslot doesn't have a private 
> fd,
> KVM exits to userspace.  If there's a memslot with a private fd, the 
> shared/private
> flag is used to resolve the 
> 
> And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump 
> the
> notifier sequence, i.e. force the page fault to retry if the GPA may have been
> (un)registered between checking the type and acquiring mmu_lock.

Yeah, that makes sense.

> 
> > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, 
> > > XArray
> > > should be far more efficient.
> > 
> > What about the mis-behaved guest? I don't want to design for the worst
> > case, but people may raise concern on the attack from such guest.
> 
> That's why cgroups exist.  E.g. a malicious/broken L1 can similarly abuse 
> nested
> EPT/NPT to generate a large number of shadow page tables.

I havn't seen we had that in KVM. Is there any plan/discussion to add that?

> 
> > > One benefit to explicitly tracking this in KVM is that it might be useful 
> > > for
> > > software-only protected VMs, e.g. KVM could mark a region in the XArray 
> > > as "pending"
> > > based on guest hypercalls to share/unshare memory, and then complete the 
> > > transaction
> > > when userspace invokes the ioctl() to complete the share/unshare.
> > 
> > OK, then this can be another field of states/flags/attributes. Let me
> > dig up certain level of details:
> > 
> > First, introduce below KVM ioctl
> > 
> > KVM_SET_MEMORY_ATTR
> 
> Actually, if the semantics are that userspace declares memory as private, 
> then we
> can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  
> It'd
> be a little gross because we'd need to slightly redefine the semantics for 
> TDX, SNP,
> and software-protected VM types, e.g. the ioctls() currently require a 
> pre-exisitng
> memslot.  But I think it'd work...

These existing ioctls looks good for TDX and probably SNP as well. For
softrware-protected VM types, it may not be enough. Maybe for the first
step we can reuse this for all hardware based solutions and invent new
interface when software-protected solution gets really supported.

There is semantics difference for fd-based private memory. Current above
two ioctls() use userspace addreess(hva) while for fd-based it should be
fd+offset, and probably it's better to use gpa in this case. Then we
will need change existing semantics and break backward-compatibility.

Chao

> 
> I'll think more on this...



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-23 Thread Sean Christopherson
On Mon, May 23, 2022, Chao Peng wrote:
> On Fri, May 20, 2022 at 06:31:02PM +, Sean Christopherson wrote:
> > On Fri, May 20, 2022, Andy Lutomirski wrote:
> > > The alternative would be to have some kind of separate table or bitmap 
> > > (part
> > > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > > 
> > > What do you all think?
> > 
> > My original proposal was to have expolicit shared vs. private memslots, and 
> > punch
> > holes in KVM's memslots on conversion, but due to the way KVM (and 
> > userspace)
> > handle memslot updates, conversions would be painfully slow.  That's how we 
> > ended
> > up with the current propsoal.
> > 
> > But a dedicated KVM ioctl() to add/remove shared ranges would be easy to 
> > implement
> > and wouldn't necessarily even need to interact with the memslots.  It could 
> > be a
> > consumer of memslots, e.g. if we wanted to disallow registering regions 
> > without an
> > associated memslot, but I think we'd want to avoid even that because things 
> > will
> > get messy during memslot updates, e.g. if dirty logging is toggled or a 
> > shared
> > memory region is temporarily removed then we wouldn't want to destroy the 
> > tracking.
> 
> Even we don't tight that to memslots, that info can only be effective
> for private memslot, right? Setting this ioctl to memory ranges defined
> in a traditional non-private memslots just makes no sense, I guess we can
> comment that in the API document.

Hrm, applying it universally would be funky, e.g. emulated MMIO would need to be
declared "shared".  But, applying it selectively would arguably be worse, e.g.
letting userspace map memory into the guest as shared for a region that's 
registered
as private...

On option to that mess would be to make memory shared by default, and so 
userspace
must declare regions that are private.  Then there's no weirdness with emulated 
MMIO
or "legacy" memslots.

On page fault, KVM does a lookup to see if the GPA is shared or private.  If the
GPA is private, but there is no memslot or the memslot doesn't have a private 
fd,
KVM exits to userspace.  If there's a memslot with a private fd, the 
shared/private
flag is used to resolve the 

And to handle the ioctl(), KVM can use kvm_zap_gfn_range(), which will bump the
notifier sequence, i.e. force the page fault to retry if the GPA may have been
(un)registered between checking the type and acquiring mmu_lock.

> > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, 
> > XArray
> > should be far more efficient.
> 
> What about the mis-behaved guest? I don't want to design for the worst
> case, but people may raise concern on the attack from such guest.

That's why cgroups exist.  E.g. a malicious/broken L1 can similarly abuse nested
EPT/NPT to generate a large number of shadow page tables.

> > One benefit to explicitly tracking this in KVM is that it might be useful 
> > for
> > software-only protected VMs, e.g. KVM could mark a region in the XArray as 
> > "pending"
> > based on guest hypercalls to share/unshare memory, and then complete the 
> > transaction
> > when userspace invokes the ioctl() to complete the share/unshare.
> 
> OK, then this can be another field of states/flags/attributes. Let me
> dig up certain level of details:
> 
> First, introduce below KVM ioctl
> 
> KVM_SET_MEMORY_ATTR

Actually, if the semantics are that userspace declares memory as private, then 
we
can reuse KVM_MEMORY_ENCRYPT_REG_REGION and KVM_MEMORY_ENCRYPT_UNREG_REGION.  
It'd
be a little gross because we'd need to slightly redefine the semantics for TDX, 
SNP,
and software-protected VM types, e.g. the ioctls() currently require a 
pre-exisitng
memslot.  But I think it'd work...

I'll think more on this...



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-23 Thread Chao Peng
On Fri, May 20, 2022 at 06:31:02PM +, Sean Christopherson wrote:
> On Fri, May 20, 2022, Andy Lutomirski wrote:
> > The alternative would be to have some kind of separate table or bitmap (part
> > of the memslot?) that tells KVM whether a GPA should map to the fd.
> > 
> > What do you all think?
> 
> My original proposal was to have expolicit shared vs. private memslots, and 
> punch
> holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
> handle memslot updates, conversions would be painfully slow.  That's how we 
> ended
> up with the current propsoal.
> 
> But a dedicated KVM ioctl() to add/remove shared ranges would be easy to 
> implement
> and wouldn't necessarily even need to interact with the memslots.  It could 
> be a
> consumer of memslots, e.g. if we wanted to disallow registering regions 
> without an
> associated memslot, but I think we'd want to avoid even that because things 
> will
> get messy during memslot updates, e.g. if dirty logging is toggled or a shared
> memory region is temporarily removed then we wouldn't want to destroy the 
> tracking.

Even we don't tight that to memslots, that info can only be effective
for private memslot, right? Setting this ioctl to memory ranges defined
in a traditional non-private memslots just makes no sense, I guess we can
comment that in the API document.

> 
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.

What about the mis-behaved guest? I don't want to design for the worst
case, but people may raise concern on the attack from such guest.

> 
> One benefit to explicitly tracking this in KVM is that it might be useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray as 
> "pending"
> based on guest hypercalls to share/unshare memory, and then complete the 
> transaction
> when userspace invokes the ioctl() to complete the share/unshare.

OK, then this can be another field of states/flags/attributes. Let me
dig up certain level of details:

First, introduce below KVM ioctl

KVM_SET_MEMORY_ATTR

struct kvm_memory_attr {
__u64 addr; /* page aligned */
__u64 size; /* page aligned */
#define KVM_MEMORY_ATTR_SHARED  (1 << 0)
#define KVM_MEMORY_ATTR_PRIVATE (1 << 1)
__u64 flags;
}

Second, check the KVM maintained guest memory attributes in page fault
handler (instead of checking memory existence in private fd)

Third, the memfile_notifier_ops (populate/invalidate) will be removed
from current code, the old mapping zapping can be directly handled in
this new KVM ioctl().
 
Thought?

Since this info is stored in KVM, which I think is reasonable. But for
other potential memfile_notifier users like VFIO, some KVM-to-VFIO APIs
might be needed depends on the implementaion.

It is also possible to maintain this info purely in userspace. The only
trick bit is implicit conversion support that has to be checked in KVM
page fault handler and is in the fast path.

Thanks,
Chao



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-21 Thread Andy Lutomirski



On Fri, May 20, 2022, at 11:31 AM, Sean Christopherson wrote:

> But a dedicated KVM ioctl() to add/remove shared ranges would be easy 
> to implement
> and wouldn't necessarily even need to interact with the memslots.  It 
> could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions 
> without an
> associated memslot, but I think we'd want to avoid even that because 
> things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a 
> shared
> memory region is temporarily removed then we wouldn't want to destroy 
> the tracking.
>
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.
>
> One benefit to explicitly tracking this in KVM is that it might be 
> useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray 
> as "pending"
> based on guest hypercalls to share/unshare memory, and then complete 
> the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

That makes sense.

If KVM goes this route, perhaps there the allowed states for a GPA should 
include private, shared, and also private-and-shared.  Then anyone who wanted 
to use the same masked GPA for shared and private on TDX could do so if they 
wanted to.



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-20 Thread Sean Christopherson
On Fri, May 20, 2022, Andy Lutomirski wrote:
> The alternative would be to have some kind of separate table or bitmap (part
> of the memslot?) that tells KVM whether a GPA should map to the fd.
> 
> What do you all think?

My original proposal was to have expolicit shared vs. private memslots, and 
punch
holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
handle memslot updates, conversions would be painfully slow.  That's how we 
ended
up with the current propsoal.

But a dedicated KVM ioctl() to add/remove shared ranges would be easy to 
implement
and wouldn't necessarily even need to interact with the memslots.  It could be a
consumer of memslots, e.g. if we wanted to disallow registering regions without 
an
associated memslot, but I think we'd want to avoid even that because things will
get messy during memslot updates, e.g. if dirty logging is toggled or a shared
memory region is temporarily removed then we wouldn't want to destroy the 
tracking.

I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
should be far more efficient.

One benefit to explicitly tracking this in KVM is that it might be useful for
software-only protected VMs, e.g. KVM could mark a region in the XArray as 
"pending"
based on guest hypercalls to share/unshare memory, and then complete the 
transaction
when userspace invokes the ioctl() to complete the share/unshare.



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-20 Thread Andy Lutomirski

On 5/19/22 08:37, Chao Peng wrote:

Extend the memslot definition to provide guest private memory through a
file descriptor(fd) instead of userspace_addr(hva). Such guest private
memory(fd) may never be mapped into userspace so no userspace_addr(hva)
can be used. Instead add another two new fields
(private_fd/private_offset), plus the existing memory_size to represent
the private memory range. Such memslot can still have the existing
userspace_addr(hva). When use, a single memslot can maintain both
private memory through private fd(private_fd/private_offset) and shared
memory through hva(userspace_addr). A GPA is considered private by KVM
if the memslot has private fd and that corresponding page in the private
fd is populated, otherwise, it's shared.




So this is a strange API and, IMO, a layering violation.  I want to make 
sure that we're all actually on board with making this a permanent part 
of the Linux API.  Specifically, we end up with a multiplexing situation 
as you have described. For a given GPA, there are *two* possible host 
backings: an fd-backed one (from the fd, which is private for now might 
might end up potentially shared depending on future extensions) and a 
VMA-backed one.  The selection of which one backs the address is made 
internally by whatever backs the fd.


This, IMO, a clear layering violation.  Normally, an fd has an 
associated address space, and pages in that address space can have 
contents, can be holes that appear to contain all zeros, or could have 
holes that are inaccessible.  If you try to access a hole, you get 
whatever is in the hole.


But now, with this patchset, the fd is more of an overlay and you get 
*something else* if you try to access through the hole.


This results in operations on the fd bubbling up to the KVM mapping in 
what is, IMO, a strange way.  If the user punches a hole, KVM has to 
modify its mappings such that the GPA goes to whatever VMA may be there. 
 (And update the RMP, the hypervisor's tables, or whatever else might 
actually control privateness.)  Conversely, if the user does fallocate 
to fill a hole, the guest mapping *to an unrelated page* has to be 
zapped so that the fd's page shows up.  And the RMP needs updating, etc.


I am lukewarm on this for a few reasons.

1. This is weird.  AFAIK nothing else works like this.  Obviously this 
is subjecting, but "weird" and "layering violation" sometimes translate 
to "problematic locking".


2. fd-backed private memory can't have normal holes.  If I make a memfd, 
punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that 
reads as zero.  If I write to it, the page gets allocated.  But with 
this new mechanism, if I punch a hole and put it in a memslot, reads and 
writes go somewhere else.  So what if I actually wanted lazily allocated 
private zeros?


2b. For a hypothetical future extension in which an fd can also have 
shared pages (for conversion, for example, or simply because the fd 
backing might actually be more efficient than indirecting through VMAs 
and therefore get used for shared memory or entirely-non-confidential 
VMs), lazy fd-backed zeros sound genuinely useful.


3. TDX hardware capability is not fully exposed.  TDX can have a private 
page and a shared page at GPAs that differ only by the private bit. 
Sure, no one plans to use this today, but baking this into the user ABI 
throws away half the potential address space.


3b. Any software solution that works like TDX (which IMO seems like an 
eminently reasonable design to me) has the same issue.



The alternative would be to have some kind of separate table or bitmap 
(part of the memslot?) that tells KVM whether a GPA should map to the fd.


What do you all think?



[PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-19 Thread Chao Peng
Extend the memslot definition to provide guest private memory through a
file descriptor(fd) instead of userspace_addr(hva). Such guest private
memory(fd) may never be mapped into userspace so no userspace_addr(hva)
can be used. Instead add another two new fields
(private_fd/private_offset), plus the existing memory_size to represent
the private memory range. Such memslot can still have the existing
userspace_addr(hva). When use, a single memslot can maintain both
private memory through private fd(private_fd/private_offset) and shared
memory through hva(userspace_addr). A GPA is considered private by KVM
if the memslot has private fd and that corresponding page in the private
fd is populated, otherwise, it's shared.

Since there is no userspace mapping for private fd so we cannot
rely on get_user_pages() to get the pfn in KVM, instead we add a new
memfile_notifier in the memslot and rely on it to get pfn by interacting
its callbacks from memory backing store with the fd/offset.

This new extension is indicated by a new flag KVM_MEM_PRIVATE. At
compile time, a new config HAVE_KVM_PRIVATE_MEM is added and right now
it is selected on X86_64 for Intel TDX usage.

To make KVM easy, internally we use a binary compatible struct
kvm_user_mem_region to handle both the normal and the '_ext' variants.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
---
 Documentation/virt/kvm/api.rst   | 38 ++--
 arch/mips/include/asm/kvm_host.h |  2 +-
 arch/x86/include/asm/kvm_host.h  |  2 +-
 arch/x86/kvm/Kconfig |  2 ++
 arch/x86/kvm/x86.c   |  2 +-
 include/linux/kvm_host.h | 19 +++-
 include/uapi/linux/kvm.h | 24 
 virt/kvm/Kconfig |  3 +++
 virt/kvm/kvm_main.c  | 33 +--
 9 files changed, 103 insertions(+), 22 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 23baf7fce038..b959445b64cc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1311,7 +1311,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1324,9 +1324,18 @@ yet and must be cleared on entry.
__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+   struct kvm_userspace_memory_region region;
+   __u64 private_offset;
+   __u32 private_fd;
+   __u32 pad1;
+   __u64 pad2[14];
+};
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES  (1UL << 0)
   #define KVM_MEM_READONLY (1UL << 1)
+  #define KVM_MEM_PRIVATE  (1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1357,12 +1366,27 @@ It is recommended that the lower 21 bits of 
guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region
+fields. It also includes additional fields for some specific features. See
+below description of flags field for more information. It's recommended to use
+kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports below flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to
+  memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to use it.
+
+- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to
+  make a new slot read-only.  In this case, writes to this memory will be 
posted
+  to userspace as KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed 
by
+  a file descirptor(fd) and the content of the private memory is invisible to
+  userspace. In this case, userspace should use private_fd/private_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory to
+  guest. Userspace should guarantee not to map the same pfn indicated by
+  private_fd/private_offset to different gfns with multiple memslots. Failed to
+  do this may result undefined behavior.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region