Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-12-02 Thread Andy Lutomirski

On 11/19/21 05:47, Chao Peng wrote:

From: "Kirill A. Shutemov" 

The new seal type provides semantics required for KVM guest private
memory support. A file descriptor with the seal set is going to be used
as source of guest memory in confidential computing environments such as
Intel TDX and AMD SEV.

F_SEAL_GUEST can only be set on empty memfd. After the seal is set
userspace cannot read, write or mmap the memfd.


I don't have a strong objection here, but, given that you're only 
supporting it for memfd, would a memfd_create() flag be more 
straightforward?  If nothing else, it would avoid any possible locking 
issue.


I'm also very very slightly nervous about a situation in which one 
program sends a memfd to an untrusted other process and that process 
truncates the memfd and then F_SEAL_GUESTs it.  This could be mostly 
mitigated by also requiring that no other seals be set when F_SEAL_GUEST 
happens, but the alternative MFD_GUEST would eliminate this issue too.




Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-23 Thread Jason Gunthorpe
On Tue, Nov 23, 2021 at 10:06:02AM +0100, Paolo Bonzini wrote:

> I think it's great that memfd hooks are usable by more than one subsystem,
> OTOH it's fair that whoever needs it does the work---and VFIO does not need
> it for confidential VMs, yet, so it should be fine for now to have a single
> user.

I think adding a new interface to a core kernel subsystem should come
with a greater requirement to work out something generally useful and
not be overly wedded to a single use case (eg F_SEAL_GUEST)

Especially if something like 'single user' is not just a small
implementation artifact but a key design tennant of the whole eventual
solution.

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-23 Thread David Hildenbrand
On 23.11.21 10:06, Paolo Bonzini wrote:
> On 11/19/21 16:39, David Hildenbrand wrote:
>>> If qmeu can put all the guest memory in a memfd and not map it, then
>>> I'd also like to see that the IOMMU can use this interface too so we
>>> can have VFIO working in this configuration.
>>
>> In QEMU we usually want to (and must) be able to access guest memory
>> from user space, with the current design we wouldn't even be able to
>> temporarily mmap it -- which makes sense for encrypted memory only. The
>> corner case really is encrypted memory. So I don't think we'll see a
>> broad use of this feature outside of encrypted VMs in QEMU. I might be
>> wrong, most probably I am:)
> 
> It's not _that_ crazy an idea, but it's going to be some work to teach 
> KVM that it has to kmap/kunmap around all memory accesses.

I'm also concerned about userspace access. But you sound like you have a
plan :)

-- 
Thanks,

David / dhildenb




Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-23 Thread Chao Peng
On Tue, Nov 23, 2021 at 10:06:02AM +0100, Paolo Bonzini wrote:
> On 11/19/21 16:39, David Hildenbrand wrote:
> > > If qmeu can put all the guest memory in a memfd and not map it, then
> > > I'd also like to see that the IOMMU can use this interface too so we
> > > can have VFIO working in this configuration.
> > 
> > In QEMU we usually want to (and must) be able to access guest memory
> > from user space, with the current design we wouldn't even be able to
> > temporarily mmap it -- which makes sense for encrypted memory only. The
> > corner case really is encrypted memory. So I don't think we'll see a
> > broad use of this feature outside of encrypted VMs in QEMU. I might be
> > wrong, most probably I am:)
> 
> It's not _that_ crazy an idea, but it's going to be some work to teach KVM
> that it has to kmap/kunmap around all memory accesses.
> 
> I think it's great that memfd hooks are usable by more than one subsystem,
> OTOH it's fair that whoever needs it does the work---and VFIO does not need
> it for confidential VMs, yet, so it should be fine for now to have a single
> user.
> 
> On the other hand, as I commented already, the lack of locking in the
> register/unregister functions has to be fixed even with a single user.
> Another thing we can do already is change the guest_ops/guest_mem_ops to
> something like memfd_falloc_notifier_ops/memfd_pfn_ops, and the
> register/unregister functions to memfd_register/unregister_falloc_notifier.

I'm satisified with this naming ;)

> 
> Chao, can you also put this under a new CONFIG such as "bool MEMFD_OPS", and
> select it from KVM?

Yes, reasonable.

> 
> Thanks,
> 
> Paolo



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-23 Thread Paolo Bonzini

On 11/19/21 16:39, David Hildenbrand wrote:

If qmeu can put all the guest memory in a memfd and not map it, then
I'd also like to see that the IOMMU can use this interface too so we
can have VFIO working in this configuration.


In QEMU we usually want to (and must) be able to access guest memory
from user space, with the current design we wouldn't even be able to
temporarily mmap it -- which makes sense for encrypted memory only. The
corner case really is encrypted memory. So I don't think we'll see a
broad use of this feature outside of encrypted VMs in QEMU. I might be
wrong, most probably I am:)


It's not _that_ crazy an idea, but it's going to be some work to teach 
KVM that it has to kmap/kunmap around all memory accesses.


I think it's great that memfd hooks are usable by more than one 
subsystem, OTOH it's fair that whoever needs it does the work---and VFIO 
does not need it for confidential VMs, yet, so it should be fine for now 
to have a single user.


On the other hand, as I commented already, the lack of locking in the 
register/unregister functions has to be fixed even with a single user. 
Another thing we can do already is change the guest_ops/guest_mem_ops to 
something like memfd_falloc_notifier_ops/memfd_pfn_ops, and the 
register/unregister functions to memfd_register/unregister_falloc_notifier.


Chao, can you also put this under a new CONFIG such as "bool MEMFD_OPS", 
and select it from KVM?


Thanks,

Paolo



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-23 Thread Paolo Bonzini

On 11/19/21 14:47, Chao Peng wrote:

+static void guest_invalidate_page(struct inode *inode,
+ struct page *page, pgoff_t start, pgoff_t end)
+{
+   struct shmem_inode_info *info = SHMEM_I(inode);
+
+   if (!info->guest_ops || !info->guest_ops->invalidate_page_range)
+   return;
+
+   start = max(start, page->index);
+   end = min(end, page->index + thp_nr_pages(page)) - 1;
+
+   info->guest_ops->invalidate_page_range(inode, info->guest_owner,
+  start, end);
+}


The lack of protection makes the API quite awkward to use;
the usual way to do this is with refcount_inc_not_zero (aka 
kvm_get_kvm_safe).


Can you use the shmem_inode_info spinlock to protect against this?  If 
register/unregister take the spinlock, the invalidate and fallocate can 
take a reference under the same spinlock, like this:


if (!info->guest_ops)
return;

spin_lock(&info->lock);
ops = info->guest_ops;
if (!ops) {
spin_unlock(&info->lock);
return;
}

/* Calls kvm_get_kvm_safe.  */
r = ops->get_guest_owner(info->guest_owner);
spin_unlock(&info->lock);
if (r < 0)
return;

start = max(start, page->index);
end = min(end, page->index + thp_nr_pages(page)) - 1;

ops->invalidate_page_range(inode, info->guest_owner,
   start, end);
ops->put_guest_owner(info->guest_owner);

Considering that you have to take a mutex anyway in patch 13, and that 
the critical section here is very small, the extra indirect calls are 
cheaper than walking the vm_list; and it makes the API clearer.


Paolo



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread David Hildenbrand
On 22.11.21 16:09, Jason Gunthorpe wrote:
> On Mon, Nov 22, 2021 at 03:57:17PM +0100, David Hildenbrand wrote:
>> On 22.11.21 15:01, Jason Gunthorpe wrote:
>>> On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
 On 22.11.21 14:31, Jason Gunthorpe wrote:
> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
>
>> I do wonder if we want to support sharing such memfds between processes
>> in all cases ... we most certainly don't want to be able to share
>> encrypted memory between VMs (I heard that the kernel has to forbid
>> that). It would make sense in the use case you describe, though.
>
> If there is a F_SEAL_XX that blocks every kind of new access, who
> cares if userspace passes the FD around or not?
 I was imagining that you actually would want to do some kind of "change
 ownership". But yeah, the intended semantics and all use cases we have
 in mind are not fully clear to me yet. If it's really "no new access"
 (side note: is "access" the right word?) then sure, we can pass the fd
 around.
>>>
>>> What is "ownership" in a world with kvm and iommu are reading pages
>>> out of the same fd?
>>
>> In the world of encrypted memory / TDX, KVM somewhat "owns" that memory
>> IMHO (for example, only it can migrate or swap out these pages; it's
>> might be debatable if the TDX module or KVM actually "own" these pages ).
> 
> Sounds like it is a swap provider more than an owner?

Yes, I think we can phrase it that way, + "migrate provider"


-- 
Thanks,

David / dhildenb




Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread Jason Gunthorpe
On Mon, Nov 22, 2021 at 03:57:17PM +0100, David Hildenbrand wrote:
> On 22.11.21 15:01, Jason Gunthorpe wrote:
> > On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
> >> On 22.11.21 14:31, Jason Gunthorpe wrote:
> >>> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
> >>>
>  I do wonder if we want to support sharing such memfds between processes
>  in all cases ... we most certainly don't want to be able to share
>  encrypted memory between VMs (I heard that the kernel has to forbid
>  that). It would make sense in the use case you describe, though.
> >>>
> >>> If there is a F_SEAL_XX that blocks every kind of new access, who
> >>> cares if userspace passes the FD around or not?
> >> I was imagining that you actually would want to do some kind of "change
> >> ownership". But yeah, the intended semantics and all use cases we have
> >> in mind are not fully clear to me yet. If it's really "no new access"
> >> (side note: is "access" the right word?) then sure, we can pass the fd
> >> around.
> > 
> > What is "ownership" in a world with kvm and iommu are reading pages
> > out of the same fd?
> 
> In the world of encrypted memory / TDX, KVM somewhat "owns" that memory
> IMHO (for example, only it can migrate or swap out these pages; it's
> might be debatable if the TDX module or KVM actually "own" these pages ).

Sounds like it is a swap provider more than an owner?

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread David Hildenbrand
On 22.11.21 15:01, Jason Gunthorpe wrote:
> On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
>> On 22.11.21 14:31, Jason Gunthorpe wrote:
>>> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
>>>
 I do wonder if we want to support sharing such memfds between processes
 in all cases ... we most certainly don't want to be able to share
 encrypted memory between VMs (I heard that the kernel has to forbid
 that). It would make sense in the use case you describe, though.
>>>
>>> If there is a F_SEAL_XX that blocks every kind of new access, who
>>> cares if userspace passes the FD around or not?
>> I was imagining that you actually would want to do some kind of "change
>> ownership". But yeah, the intended semantics and all use cases we have
>> in mind are not fully clear to me yet. If it's really "no new access"
>> (side note: is "access" the right word?) then sure, we can pass the fd
>> around.
> 
> What is "ownership" in a world with kvm and iommu are reading pages
> out of the same fd?

In the world of encrypted memory / TDX, KVM somewhat "owns" that memory
IMHO (for example, only it can migrate or swap out these pages; it's
might be debatable if the TDX module or KVM actually "own" these pages ).

-- 
Thanks,

David / dhildenb




Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread Jason Gunthorpe
On Mon, Nov 22, 2021 at 02:35:49PM +0100, David Hildenbrand wrote:
> On 22.11.21 14:31, Jason Gunthorpe wrote:
> > On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
> > 
> >> I do wonder if we want to support sharing such memfds between processes
> >> in all cases ... we most certainly don't want to be able to share
> >> encrypted memory between VMs (I heard that the kernel has to forbid
> >> that). It would make sense in the use case you describe, though.
> > 
> > If there is a F_SEAL_XX that blocks every kind of new access, who
> > cares if userspace passes the FD around or not?
> I was imagining that you actually would want to do some kind of "change
> ownership". But yeah, the intended semantics and all use cases we have
> in mind are not fully clear to me yet. If it's really "no new access"
> (side note: is "access" the right word?) then sure, we can pass the fd
> around.

What is "ownership" in a world with kvm and iommu are reading pages
out of the same fd?

"no new access" makes sense to me, we have access through
read/write/mmap/splice/etc and access to pages through the private in
kernel interface (kvm, iommu)

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread Kirill A. Shutemov
On Fri, Nov 19, 2021 at 02:51:11PM +0100, David Hildenbrand wrote:
> On 19.11.21 14:47, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > The new seal type provides semantics required for KVM guest private
> > memory support. A file descriptor with the seal set is going to be used
> > as source of guest memory in confidential computing environments such as
> > Intel TDX and AMD SEV.
> > 
> > F_SEAL_GUEST can only be set on empty memfd. After the seal is set
> > userspace cannot read, write or mmap the memfd.
> > 
> > Userspace is in charge of guest memory lifecycle: it can allocate the
> > memory with falloc or punch hole to free memory from the guest.
> > 
> > The file descriptor passed down to KVM as guest memory backend. KVM
> > register itself as the owner of the memfd via memfd_register_guest().
> > 
> > KVM provides callback that needed to be called on fallocate and punch
> > hole.
> > 
> > memfd_register_guest() returns callbacks that need be used for
> > requesting a new page from memfd.
> > 
> 
> Repeating the feedback I already shared in a private mail thread:
> 
> 
> As long as page migration / swapping is not supported, these pages
> behave like any longterm pinned pages (e.g., VFIO) or secretmem pages.
> 
> 1. These pages are not MOVABLE. They must not end up on ZONE_MOVABLE or
> MIGRATE_CMA.
> 
> That should be easy to handle, you have to adjust the gfp_mask to
>   mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> just as mm/secretmem.c:secretmem_file_create() does.

Okay, fair enough. mapping_set_unevictable() also makes sesne.

> 2. These pages behave like mlocked pages and should be accounted as such.
> 
> This is probably where the accounting "fun" starts, but maybe it's
> easier than I think to handle.
> 
> See mm/secretmem.c:secretmem_mmap(), where we account the pages as
> VM_LOCKED and will consequently check per-process mlock limits. As we
> don't mmap(), the same approach cannot be reused.
> 
> See drivers/vfio/vfio_iommu_type1.c:vfio_pin_map_dma() and
> vfio_pin_pages_remote() on how to manually account via mm->locked_vm .
> 
> But it's a bit hairy because these pages are not actually mapped into
> the page tables of the MM, so it might need some thought. Similarly,
> these pages actually behave like "pinned" (as in mm->pinned_vm), but we
> just don't increase the refcount AFAIR. Again, accounting really is a
> bit hairy ...

Accounting is fun indeed. Non-mapped mlocked memory is going to be
confusing. Hm...

I will look closer.

-- 
 Kirill A. Shutemov



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread David Hildenbrand
On 22.11.21 14:31, Jason Gunthorpe wrote:
> On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:
> 
>> I do wonder if we want to support sharing such memfds between processes
>> in all cases ... we most certainly don't want to be able to share
>> encrypted memory between VMs (I heard that the kernel has to forbid
>> that). It would make sense in the use case you describe, though.
> 
> If there is a F_SEAL_XX that blocks every kind of new access, who
> cares if userspace passes the FD around or not?
I was imagining that you actually would want to do some kind of "change
ownership". But yeah, the intended semantics and all use cases we have
in mind are not fully clear to me yet. If it's really "no new access"
(side note: is "access" the right word?) then sure, we can pass the fd
around.

-- 
Thanks,

David / dhildenb




Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread Jason Gunthorpe
On Mon, Nov 22, 2021 at 10:26:12AM +0100, David Hildenbrand wrote:

> I do wonder if we want to support sharing such memfds between processes
> in all cases ... we most certainly don't want to be able to share
> encrypted memory between VMs (I heard that the kernel has to forbid
> that). It would make sense in the use case you describe, though.

If there is a F_SEAL_XX that blocks every kind of new access, who
cares if userspace passes the FD around or not?

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-22 Thread David Hildenbrand
On 19.11.21 17:00, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 04:39:15PM +0100, David Hildenbrand wrote:
> 
>>> If qmeu can put all the guest memory in a memfd and not map it, then
>>> I'd also like to see that the IOMMU can use this interface too so we
>>> can have VFIO working in this configuration.
>>
>> In QEMU we usually want to (and must) be able to access guest memory
>> from user space, with the current design we wouldn't even be able to
>> temporarily mmap it -- which makes sense for encrypted memory only. The
>> corner case really is encrypted memory. So I don't think we'll see a
>> broad use of this feature outside of encrypted VMs in QEMU. I might be
>> wrong, most probably I am :)
> 
> Interesting..
> 
> The non-encrypted case I had in mind is the horrible flow in VFIO to
> support qemu re-execing itself (VFIO_DMA_UNMAP_FLAG_VADDR).

Thanks for sharing!

> 
> Here VFIO is connected to a VA in a mm_struct that will become invalid
> during the kexec period, but VFIO needs to continue to access it. For
> IOMMU cases this is OK because the memory is already pinned, but for
> the 'emulated iommu' used by mdevs pages are pinned dynamically. qemu
> needs to ensure that VFIO can continue to access the pages across the
> kexec, even though there is nothing to pin_user_pages() on.
> 
> This flow would work a lot better if VFIO was connected to the memfd
> that is storing the guest memory. Then it naturally doesn't get
> disrupted by exec() and we don't need the mess in the kernel..

I do wonder if we want to support sharing such memfds between processes
in all cases ... we most certainly don't want to be able to share
encrypted memory between VMs (I heard that the kernel has to forbid
that). It would make sense in the use case you describe, though.

> 
> I was wondering if we could get here using the direct_io APIs but this
> would do the job too.
> 
>> Apart from the special "encrypted memory" semantics, I assume nothing
>> speaks against allowing for mmaping these memfds, for example, for any
>> other VFIO use cases.
> 
> We will eventually have VFIO with "encrypted memory". There was a talk
> in LPC about the enabling work for this.

Yes, I heard about that as well. In the foreseeable future, we'll have
shared memory only visible for VFIO devices.

> 
> So, if the plan is to put fully encrpyted memory inside a memfd, then
> we still will eventually need a way to pull the pfns it into the
> IOMMU, presumably along with the access control parameters needed to
> pass to the secure monitor to join a PCI device to the secure memory.

Long-term, agreed.

-- 
Thanks,

David / dhildenb




Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-20 Thread Jason Gunthorpe
On Sat, Nov 20, 2021 at 01:23:16AM +, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > On Fri, Nov 19, 2021 at 10:21:39PM +, Sean Christopherson wrote:
> > > On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > > > On Fri, Nov 19, 2021 at 07:18:00PM +, Sean Christopherson wrote:
> > > > > No ideas for the kernel API, but that's also less concerning since
> > > > > it's not set in stone.  I'm also not sure that dedicated APIs for
> > > > > each high-ish level use case would be a bad thing, as the semantics
> > > > > are unlikely to be different to some extent.  E.g. for the KVM use
> > > > > case, there can be at most one guest associated with the fd, but
> > > > > there can be any number of VFIO devices attached to the fd.
> > > > 
> > > > Even the kvm thing is not a hard restriction when you take away
> > > > confidential compute.
> > > > 
> > > > Why can't we have multiple KVMs linked to the same FD if the memory
> > > > isn't encrypted? Sure it isn't actually useful but it should work
> > > > fine.
> > > 
> > > Hmm, true, but I want the KVM semantics to be 1:1 even if memory
> > > isn't encrypted.
> > 
> > That is policy and it doesn't belong hardwired into the kernel.
> 
> Agreed.  I had a blurb typed up about that policy just being an "exclusive" 
> flag
> in the kernel API that KVM would set when creating a confidential
> VM,

I still think that is policy in the kernel, what is wrong with
userspace doing it?

> > Your explanation makes me think that the F_SEAL_XX isn't defined
> > properly. It should be a userspace trap door to prevent any new
> > external accesses, including establishing new kvms, iommu's, rdmas,
> > mmaps, read/write, etc.
> 
> Hmm, the way I was thinking of it is that it the F_SEAL_XX itself would 
> prevent
> mapping/accessing it from userspace, and that any policy beyond that would be
> done via kernel APIs and thus handled by whatever in-kernel agent can access 
> the
> memory.  E.g. in the confidential VM case, without support for trusted 
> devices,
> KVM would require that it be the sole owner of the file.

And how would kvm know if there is support for trusted devices?
Again seems like policy choices that should be left in userspace.

Especially for what could be a general in-kernel mechanism with many
users and not tightly linked to KVM as imagined here.

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Sean Christopherson
On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 10:21:39PM +, Sean Christopherson wrote:
> > On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > > On Fri, Nov 19, 2021 at 07:18:00PM +, Sean Christopherson wrote:
> > > > No ideas for the kernel API, but that's also less concerning since
> > > > it's not set in stone.  I'm also not sure that dedicated APIs for
> > > > each high-ish level use case would be a bad thing, as the semantics
> > > > are unlikely to be different to some extent.  E.g. for the KVM use
> > > > case, there can be at most one guest associated with the fd, but
> > > > there can be any number of VFIO devices attached to the fd.
> > > 
> > > Even the kvm thing is not a hard restriction when you take away
> > > confidential compute.
> > > 
> > > Why can't we have multiple KVMs linked to the same FD if the memory
> > > isn't encrypted? Sure it isn't actually useful but it should work
> > > fine.
> > 
> > Hmm, true, but I want the KVM semantics to be 1:1 even if memory
> > isn't encrypted.
> 
> That is policy and it doesn't belong hardwired into the kernel.

Agreed.  I had a blurb typed up about that policy just being an "exclusive" flag
in the kernel API that KVM would set when creating a confidential VM, but 
deleted
it and forgot to restore it when I went down the tangent of removing userspace
from the TCB without an assist from hardware/firmware.

> Your explanation makes me think that the F_SEAL_XX isn't defined
> properly. It should be a userspace trap door to prevent any new
> external accesses, including establishing new kvms, iommu's, rdmas,
> mmaps, read/write, etc.

Hmm, the way I was thinking of it is that it the F_SEAL_XX itself would prevent
mapping/accessing it from userspace, and that any policy beyond that would be
done via kernel APIs and thus handled by whatever in-kernel agent can access the
memory.  E.g. in the confidential VM case, without support for trusted devices,
KVM would require that it be the sole owner of the file.



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Jason Gunthorpe
On Fri, Nov 19, 2021 at 10:21:39PM +, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> > On Fri, Nov 19, 2021 at 07:18:00PM +, Sean Christopherson wrote:
> > > No ideas for the kernel API, but that's also less concerning since
> > > it's not set in stone.  I'm also not sure that dedicated APIs for
> > > each high-ish level use case would be a bad thing, as the semantics
> > > are unlikely to be different to some extent.  E.g. for the KVM use
> > > case, there can be at most one guest associated with the fd, but
> > > there can be any number of VFIO devices attached to the fd.
> > 
> > Even the kvm thing is not a hard restriction when you take away
> > confidential compute.
> > 
> > Why can't we have multiple KVMs linked to the same FD if the memory
> > isn't encrypted? Sure it isn't actually useful but it should work
> > fine.
> 
> Hmm, true, but I want the KVM semantics to be 1:1 even if memory
> isn't encrypted.

That is policy and it doesn't belong hardwired into the kernel.

Your explanation makes me think that the F_SEAL_XX isn't defined
properly. It should be a userspace trap door to prevent any new
external accesses, including establishing new kvms, iommu's, rdmas,
mmaps, read/write, etc.

> It's not just avoiding the linked list, there's a trust element as
> well.  E.g. in the scenario where a device can access a confidential
> VM's encrypted private memory, the guest is still the "owner" of the
> memory and needs to explicitly grant access to a third party,
> e.g. the device or perhaps another VM.

Authorization is some other issue - the internal kAPI should be able
to indicate it is secured memory and the API user should do whatever
dance to gain access to it. Eg for VFIO ask the realm manager to
associate the pci_device with the owner realm.

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Sean Christopherson
On Fri, Nov 19, 2021, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 07:18:00PM +, Sean Christopherson wrote:
> > No ideas for the kernel API, but that's also less concerning since
> > it's not set in stone.  I'm also not sure that dedicated APIs for
> > each high-ish level use case would be a bad thing, as the semantics
> > are unlikely to be different to some extent.  E.g. for the KVM use
> > case, there can be at most one guest associated with the fd, but
> > there can be any number of VFIO devices attached to the fd.
> 
> Even the kvm thing is not a hard restriction when you take away
> confidential compute.
> 
> Why can't we have multiple KVMs linked to the same FD if the memory
> isn't encrypted? Sure it isn't actually useful but it should work
> fine.

Hmm, true, but I want the KVM semantics to be 1:1 even if memory isn't 
encrypted.
Encrypting memory with a key that isn't available to the host is necessary to
(mostly) remove the host kernel from the guest's TCB, but it's not necessary to
remove host userspace from the TCB.  KVM absolutely can and should be able to do
that without relying on additional hardware/firmware.  Ignoring attestation and
whether or not the guest fully trusts the host kernel, there's value in 
preventing
a buggy or compromised userspace from attacking/corrupting the guest by 
remapping
guest memory or by mapping the same memory into multiple guests.

> Supporting only one thing is just a way to avoid having a linked list
> of clients to broadcast invalidations too - for instance by using a
> standard notifier block...

It's not just avoiding the linked list, there's a trust element as well.  E.g. 
in
the scenario where a device can access a confidential VM's encrypted private 
memory,
the guest is still the "owner" of the memory and needs to explicitly grant 
access to
a third party, e.g. the device or perhaps another VM.

That said, I'm certainly not dead set on having "guest" in the name, nor am I
opposed to implementing multi-consumer support from the get-go so we don't end
up with a mess later on.

> Also, how does dirty tracking work on this memory?

For KVM usage, KVM would provide the dirty bit info.  No idea how VFIO or other
use cases would work.



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Jason Gunthorpe
On Fri, Nov 19, 2021 at 07:18:00PM +, Sean Christopherson wrote:
> On Fri, Nov 19, 2021, David Hildenbrand wrote:
> > On 19.11.21 16:19, Jason Gunthorpe wrote:
> > > As designed the above looks useful to import a memfd to a VFIO
> > > container but could you consider some more generic naming than calling
> > > this 'guest' ?
> > 
> > +1 the guest terminology is somewhat sob-optimal.
> 
> For the F_SEAL part, maybe F_SEAL_UNMAPPABLE?

Perhaps INACCESSIBLE?

> No ideas for the kernel API, but that's also less concerning since
> it's not set in stone.  I'm also not sure that dedicated APIs for
> each high-ish level use case would be a bad thing, as the semantics
> are unlikely to be different to some extent.  E.g. for the KVM use
> case, there can be at most one guest associated with the fd, but
> there can be any number of VFIO devices attached to the fd.

Even the kvm thing is not a hard restriction when you take away
confidential compute.

Why can't we have multiple KVMs linked to the same FD if the memory
isn't encrypted? Sure it isn't actually useful but it should work
fine.

Supporting only one thing is just a way to avoid having a linked list
of clients to broadcast invalidations too - for instance by using a
standard notifier block...

Also, how does dirty tracking work on this memory?

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Sean Christopherson
On Fri, Nov 19, 2021, David Hildenbrand wrote:
> On 19.11.21 16:19, Jason Gunthorpe wrote:
> > As designed the above looks useful to import a memfd to a VFIO
> > container but could you consider some more generic naming than calling
> > this 'guest' ?
> 
> +1 the guest terminology is somewhat sob-optimal.

For the F_SEAL part, maybe F_SEAL_UNMAPPABLE?

No ideas for the kernel API, but that's also less concerning since it's not set
in stone.  I'm also not sure that dedicated APIs for each high-ish level use 
case
would be a bad thing, as the semantics are unlikely to be different to some 
extent.
E.g. for the KVM use case, there can be at most one guest associated with the 
fd,
but there can be any number of VFIO devices attached to the fd.



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Jason Gunthorpe
On Fri, Nov 19, 2021 at 04:39:15PM +0100, David Hildenbrand wrote:

> > If qmeu can put all the guest memory in a memfd and not map it, then
> > I'd also like to see that the IOMMU can use this interface too so we
> > can have VFIO working in this configuration.
> 
> In QEMU we usually want to (and must) be able to access guest memory
> from user space, with the current design we wouldn't even be able to
> temporarily mmap it -- which makes sense for encrypted memory only. The
> corner case really is encrypted memory. So I don't think we'll see a
> broad use of this feature outside of encrypted VMs in QEMU. I might be
> wrong, most probably I am :)

Interesting..

The non-encrypted case I had in mind is the horrible flow in VFIO to
support qemu re-execing itself (VFIO_DMA_UNMAP_FLAG_VADDR).

Here VFIO is connected to a VA in a mm_struct that will become invalid
during the kexec period, but VFIO needs to continue to access it. For
IOMMU cases this is OK because the memory is already pinned, but for
the 'emulated iommu' used by mdevs pages are pinned dynamically. qemu
needs to ensure that VFIO can continue to access the pages across the
kexec, even though there is nothing to pin_user_pages() on.

This flow would work a lot better if VFIO was connected to the memfd
that is storing the guest memory. Then it naturally doesn't get
disrupted by exec() and we don't need the mess in the kernel..

I was wondering if we could get here using the direct_io APIs but this
would do the job too.

> Apart from the special "encrypted memory" semantics, I assume nothing
> speaks against allowing for mmaping these memfds, for example, for any
> other VFIO use cases.

We will eventually have VFIO with "encrypted memory". There was a talk
in LPC about the enabling work for this.

So, if the plan is to put fully encrpyted memory inside a memfd, then
we still will eventually need a way to pull the pfns it into the
IOMMU, presumably along with the access control parameters needed to
pass to the secure monitor to join a PCI device to the secure memory.

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread David Hildenbrand
On 19.11.21 16:19, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 09:47:27PM +0800, Chao Peng wrote:
>> From: "Kirill A. Shutemov" 
>>
>> The new seal type provides semantics required for KVM guest private
>> memory support. A file descriptor with the seal set is going to be used
>> as source of guest memory in confidential computing environments such as
>> Intel TDX and AMD SEV.
>>
>> F_SEAL_GUEST can only be set on empty memfd. After the seal is set
>> userspace cannot read, write or mmap the memfd.
>>
>> Userspace is in charge of guest memory lifecycle: it can allocate the
>> memory with falloc or punch hole to free memory from the guest.
>>
>> The file descriptor passed down to KVM as guest memory backend. KVM
>> register itself as the owner of the memfd via memfd_register_guest().
>>
>> KVM provides callback that needed to be called on fallocate and punch
>> hole.
>>
>> memfd_register_guest() returns callbacks that need be used for
>> requesting a new page from memfd.
>>
>> Signed-off-by: Kirill A. Shutemov 
>> Signed-off-by: Chao Peng 
>>  include/linux/memfd.h  |  24 
>>  include/linux/shmem_fs.h   |   9 +++
>>  include/uapi/linux/fcntl.h |   1 +
>>  mm/memfd.c |  33 +-
>>  mm/shmem.c | 123 -
>>  5 files changed, 186 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
>> index 4f1600413f91..ff920ef28688 100644
>> +++ b/include/linux/memfd.h
>> @@ -4,13 +4,37 @@
>>  
>>  #include 
>>  
>> +struct guest_ops {
>> +void (*invalidate_page_range)(struct inode *inode, void *owner,
>> +  pgoff_t start, pgoff_t end);
>> +void (*fallocate)(struct inode *inode, void *owner,
>> +  pgoff_t start, pgoff_t end);
>> +};
>> +
>> +struct guest_mem_ops {
>> +unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
>> +  bool alloc, int *order);
>> +void (*put_unlock_pfn)(unsigned long pfn);
>> +
>> +};
> 
> Ignoring confidential compute for a moment
> 
> If qmeu can put all the guest memory in a memfd and not map it, then
> I'd also like to see that the IOMMU can use this interface too so we
> can have VFIO working in this configuration.

In QEMU we usually want to (and must) be able to access guest memory
from user space, with the current design we wouldn't even be able to
temporarily mmap it -- which makes sense for encrypted memory only. The
corner case really is encrypted memory. So I don't think we'll see a
broad use of this feature outside of encrypted VMs in QEMU. I might be
wrong, most probably I am :)

> 
> As designed the above looks useful to import a memfd to a VFIO
> container but could you consider some more generic naming than calling
> this 'guest' ?

+1 the guest terminology is somewhat sob-optimal.

> 
> Along the same lines, to support fast migration, we'd want to be able
> to send these things to the RDMA subsytem as well so we can do data
> xfer. Very similar to VFIO.
> 
> Also, shouldn't this be two patches? F_SEAL is not really related to
> these acessors, is it?


Apart from the special "encrypted memory" semantics, I assume nothing
speaks against allowing for mmaping these memfds, for example, for any
other VFIO use cases.

-- 
Thanks,

David / dhildenb




Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Jason Gunthorpe
On Fri, Nov 19, 2021 at 09:47:27PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" 
> 
> The new seal type provides semantics required for KVM guest private
> memory support. A file descriptor with the seal set is going to be used
> as source of guest memory in confidential computing environments such as
> Intel TDX and AMD SEV.
> 
> F_SEAL_GUEST can only be set on empty memfd. After the seal is set
> userspace cannot read, write or mmap the memfd.
> 
> Userspace is in charge of guest memory lifecycle: it can allocate the
> memory with falloc or punch hole to free memory from the guest.
> 
> The file descriptor passed down to KVM as guest memory backend. KVM
> register itself as the owner of the memfd via memfd_register_guest().
> 
> KVM provides callback that needed to be called on fallocate and punch
> hole.
> 
> memfd_register_guest() returns callbacks that need be used for
> requesting a new page from memfd.
> 
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
>  include/linux/memfd.h  |  24 
>  include/linux/shmem_fs.h   |   9 +++
>  include/uapi/linux/fcntl.h |   1 +
>  mm/memfd.c |  33 +-
>  mm/shmem.c | 123 -
>  5 files changed, 186 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..ff920ef28688 100644
> +++ b/include/linux/memfd.h
> @@ -4,13 +4,37 @@
>  
>  #include 
>  
> +struct guest_ops {
> + void (*invalidate_page_range)(struct inode *inode, void *owner,
> +   pgoff_t start, pgoff_t end);
> + void (*fallocate)(struct inode *inode, void *owner,
> +   pgoff_t start, pgoff_t end);
> +};
> +
> +struct guest_mem_ops {
> + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
> +   bool alloc, int *order);
> + void (*put_unlock_pfn)(unsigned long pfn);
> +
> +};

Ignoring confidential compute for a moment

If qmeu can put all the guest memory in a memfd and not map it, then
I'd also like to see that the IOMMU can use this interface too so we
can have VFIO working in this configuration.

As designed the above looks useful to import a memfd to a VFIO
container but could you consider some more generic naming than calling
this 'guest' ?

Along the same lines, to support fast migration, we'd want to be able
to send these things to the RDMA subsytem as well so we can do data
xfer. Very similar to VFIO.

Also, shouldn't this be two patches? F_SEAL is not really related to
these acessors, is it?

> +extern inline int memfd_register_guest(struct inode *inode, void *owner,
> +const struct guest_ops *guest_ops,
> +const struct guest_mem_ops 
> **guest_mem_ops);

Why does this take an inode and not a file *?

> +int shmem_register_guest(struct inode *inode, void *owner,
> +  const struct guest_ops *guest_ops,
> +  const struct guest_mem_ops **guest_mem_ops)
> +{
> + struct shmem_inode_info *info = SHMEM_I(inode);
> +
> + if (!owner)
> + return -EINVAL;
> +
> + if (info->guest_owner && info->guest_owner != owner)
> + return -EPERM;

And this looks like it means only a single subsytem can use this API
at once, not so nice..

Jason



Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread David Hildenbrand
On 19.11.21 14:47, Chao Peng wrote:
> From: "Kirill A. Shutemov" 
> 
> The new seal type provides semantics required for KVM guest private
> memory support. A file descriptor with the seal set is going to be used
> as source of guest memory in confidential computing environments such as
> Intel TDX and AMD SEV.
> 
> F_SEAL_GUEST can only be set on empty memfd. After the seal is set
> userspace cannot read, write or mmap the memfd.
> 
> Userspace is in charge of guest memory lifecycle: it can allocate the
> memory with falloc or punch hole to free memory from the guest.
> 
> The file descriptor passed down to KVM as guest memory backend. KVM
> register itself as the owner of the memfd via memfd_register_guest().
> 
> KVM provides callback that needed to be called on fallocate and punch
> hole.
> 
> memfd_register_guest() returns callbacks that need be used for
> requesting a new page from memfd.
> 

Repeating the feedback I already shared in a private mail thread:


As long as page migration / swapping is not supported, these pages
behave like any longterm pinned pages (e.g., VFIO) or secretmem pages.

1. These pages are not MOVABLE. They must not end up on ZONE_MOVABLE or
MIGRATE_CMA.

That should be easy to handle, you have to adjust the gfp_mask to
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
just as mm/secretmem.c:secretmem_file_create() does.

2. These pages behave like mlocked pages and should be accounted as such.

This is probably where the accounting "fun" starts, but maybe it's
easier than I think to handle.

See mm/secretmem.c:secretmem_mmap(), where we account the pages as
VM_LOCKED and will consequently check per-process mlock limits. As we
don't mmap(), the same approach cannot be reused.

See drivers/vfio/vfio_iommu_type1.c:vfio_pin_map_dma() and
vfio_pin_pages_remote() on how to manually account via mm->locked_vm .

But it's a bit hairy because these pages are not actually mapped into
the page tables of the MM, so it might need some thought. Similarly,
these pages actually behave like "pinned" (as in mm->pinned_vm), but we
just don't increase the refcount AFAIR. Again, accounting really is a
bit hairy ...



-- 
Thanks,

David / dhildenb




[RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-11-19 Thread Chao Peng
From: "Kirill A. Shutemov" 

The new seal type provides semantics required for KVM guest private
memory support. A file descriptor with the seal set is going to be used
as source of guest memory in confidential computing environments such as
Intel TDX and AMD SEV.

F_SEAL_GUEST can only be set on empty memfd. After the seal is set
userspace cannot read, write or mmap the memfd.

Userspace is in charge of guest memory lifecycle: it can allocate the
memory with falloc or punch hole to free memory from the guest.

The file descriptor passed down to KVM as guest memory backend. KVM
register itself as the owner of the memfd via memfd_register_guest().

KVM provides callback that needed to be called on fallocate and punch
hole.

memfd_register_guest() returns callbacks that need be used for
requesting a new page from memfd.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Chao Peng 
---
 include/linux/memfd.h  |  24 
 include/linux/shmem_fs.h   |   9 +++
 include/uapi/linux/fcntl.h |   1 +
 mm/memfd.c |  33 +-
 mm/shmem.c | 123 -
 5 files changed, 186 insertions(+), 4 deletions(-)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..ff920ef28688 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -4,13 +4,37 @@
 
 #include 
 
+struct guest_ops {
+   void (*invalidate_page_range)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+   void (*fallocate)(struct inode *inode, void *owner,
+ pgoff_t start, pgoff_t end);
+};
+
+struct guest_mem_ops {
+   unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset,
+ bool alloc, int *order);
+   void (*put_unlock_pfn)(unsigned long pfn);
+
+};
+
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long 
arg);
+
+extern inline int memfd_register_guest(struct inode *inode, void *owner,
+  const struct guest_ops *guest_ops,
+  const struct guest_mem_ops 
**guest_mem_ops);
 #else
 static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a)
 {
return -EINVAL;
 }
+static inline int memfd_register_guest(struct inode *inode, void *owner,
+  const struct guest_ops *guest_ops,
+  const struct guest_mem_ops 
**guest_mem_ops)
+{
+   return -EINVAL;
+}
 #endif
 
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..8280c918775a 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -12,6 +12,9 @@
 
 /* inode in-kernel data */
 
+struct guest_ops;
+struct guest_mem_ops;
+
 struct shmem_inode_info {
spinlock_t  lock;
unsigned intseals;  /* shmem seals */
@@ -25,6 +28,8 @@ struct shmem_inode_info {
struct simple_xattrsxattrs; /* list of xattrs */
atomic_tstop_eviction;  /* hold when working on inode */
struct inodevfs_inode;
+   void*guest_owner;
+   const struct guest_ops  *guest_ops;
 };
 
 struct shmem_sb_info {
@@ -96,6 +101,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct 
*vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end);
 
+extern int shmem_register_guest(struct inode *inode, void *owner,
+   const struct guest_ops *guest_ops,
+   const struct guest_mem_ops **guest_mem_ops);
+
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
SGP_READ,   /* don't exceed i_size, don't allocate page */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..c79bc8572721 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
 #define F_SEAL_GROW0x0004  /* prevent file from growing */
 #define F_SEAL_WRITE   0x0008  /* prevent writes */
 #define F_SEAL_FUTURE_WRITE0x0010  /* prevent future writes while mapped */
+#define F_SEAL_GUEST   0x0020
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..a98b30bcf982 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,11 +130,25 @@ static unsigned int *memfd_file_seals_ptr(struct file 
*file)
return NULL;
 }
 
+int memfd_register_guest(struct inode *inode, void *owner,
+const struct guest_ops *guest_ops,
+const struct guest_mem_ops **guest_mem_ops)
+{
+   if (shmem_mapping(inode->i_mapping)) {
+   return shmem_register_guest(inode, owner,
+