Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-17 Thread Chao Peng
On Wed, Nov 16, 2022 at 09:40:23AM +, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> > On Mon, Nov 14, 2022 at 11:43:37AM +, Alex Bennée wrote:
> >> 
> >> Chao Peng  writes:
> >> 
> >> 
> >> > Introduction
> >> > 
> >> > KVM userspace being able to crash the host is horrible. Under current
> >> > KVM architecture, all guest memory is inherently accessible from KVM
> >> > userspace and is exposed to the mentioned crash issue. The goal of this
> >> > series is to provide a solution to align mm and KVM, on a userspace
> >> > inaccessible approach of exposing guest memory. 
> >> >
> >> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> >> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> >> > table). This requires guest memory being mmaped into KVM userspace, but
> >> > this is also the source where the mentioned crash issue can happen. In
> >> > theory, apart from those 'shared' memory for device emulation etc, guest
> >> > memory doesn't have to be mmaped into KVM userspace.
> >> >
> >> > This series introduces fd-based guest memory which will not be mmaped
> >> > into KVM userspace. KVM populates secondary page table by using a
> >> > fd/offset pair backed by a memory file system. The fd can be created
> >> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> >> > directly interact with them with newly introduced in-kernel interface,
> >> > therefore remove the KVM userspace from the path of accessing/mmaping
> >> > the guest memory. 
> >> >
> >> > Kirill had a patch [2] to address the same issue in a different way. It
> >> > tracks guest encrypted memory at the 'struct page' level and relies on
> >> > HWPOISON to reject the userspace access. The patch has been discussed in
> >> > several online and offline threads and resulted in a design document [3]
> >> > which is also the original proposal for this series. Later this patch
> >> > series evolved as more comments received in community but the major
> >> > concepts in [3] still hold true so recommend reading.
> >> >
> >> > The patch series may also be useful for other usages, for example, pure
> >> > software approach may use it to harden itself against unintentional
> >> > access to guest memory. This series is designed with these usages in
> >> > mind but doesn't have code directly support them and extension might be
> >> > needed.
> >> 
> >> There are a couple of additional use cases where having a consistent
> >> memory interface with the kernel would be useful.
> >
> > Thanks very much for the info. But I'm not so confident that the current
> > memfd_restricted() implementation can be useful for all these usages. 
> >
> >> 
> >>   - Xen DomU guests providing other domains with VirtIO backends
> >> 
> >>   Xen by default doesn't give other domains special access to a domains
> >>   memory. The guest can grant access to regions of its memory to other
> >>   domains for this purpose. 
> >
> > I'm trying to form my understanding on how this could work and what's
> > the benefit for a DomU guest to provide memory through memfd_restricted().
> > AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
> > but I assume VirtIO backends are still in DomU uerspace and need access
> > that memory, right?
> 
> They need access to parts of the memory. At the moment you run your
> VirtIO domains in the Dom0 and give them access to the whole of a DomU's
> address space - however the Xen model is by default the guests memory is
> inaccessible to other domains on the system. The DomU guest uses the Xen
> grant model to expose portions of its address space to other domains -
> namely for the VirtIO queues themselves and any pages containing buffers
> involved in the VirtIO transaction. My thought was that looks like a
> guest memory interface which is mostly inaccessible (private) with some
> holes in it where memory is being explicitly shared with other domains.

Yes, similar in conception. For KVM, memfd_restricted() is used by host
OS, guest will issue conversion between private and shared for its
memory range. This is similar to Xen DomU guest grants its memory to
other domains. Similarly, I guess to make memfd_restricted() being really
useful for Xen, it should be run on the VirtIO backend domain (e.g.
equivalent to the host position for KVM).

> 
> What I want to achieve is a common userspace API with defined semantics
> for what happens when private and shared regions are accessed. Because
> having each hypervisor/confidential computing architecture define its
> own special API for accessing this memory is just a recipe for
> fragmentation and makes sharing common VirtIO backends impossible.

Yes, I agree. That's interesting to explore.

> 
> >
> >> 
> >>   - pKVM on ARM
> >> 
> >>   Similar to Xen, pKVM moves the management of the page tables into the
> >>   hypervisor and again doesn't allow those domains to share memory by
> >>   default.
> >
> > 

Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-16 Thread Alex Bennée


Chao Peng  writes:

> On Mon, Nov 14, 2022 at 11:43:37AM +, Alex Bennée wrote:
>> 
>> Chao Peng  writes:
>> 
>> 
>> > Introduction
>> > 
>> > KVM userspace being able to crash the host is horrible. Under current
>> > KVM architecture, all guest memory is inherently accessible from KVM
>> > userspace and is exposed to the mentioned crash issue. The goal of this
>> > series is to provide a solution to align mm and KVM, on a userspace
>> > inaccessible approach of exposing guest memory. 
>> >
>> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
>> > virtual address (hva) from core mm page table (e.g. x86 userspace page
>> > table). This requires guest memory being mmaped into KVM userspace, but
>> > this is also the source where the mentioned crash issue can happen. In
>> > theory, apart from those 'shared' memory for device emulation etc, guest
>> > memory doesn't have to be mmaped into KVM userspace.
>> >
>> > This series introduces fd-based guest memory which will not be mmaped
>> > into KVM userspace. KVM populates secondary page table by using a
>> > fd/offset pair backed by a memory file system. The fd can be created
>> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
>> > directly interact with them with newly introduced in-kernel interface,
>> > therefore remove the KVM userspace from the path of accessing/mmaping
>> > the guest memory. 
>> >
>> > Kirill had a patch [2] to address the same issue in a different way. It
>> > tracks guest encrypted memory at the 'struct page' level and relies on
>> > HWPOISON to reject the userspace access. The patch has been discussed in
>> > several online and offline threads and resulted in a design document [3]
>> > which is also the original proposal for this series. Later this patch
>> > series evolved as more comments received in community but the major
>> > concepts in [3] still hold true so recommend reading.
>> >
>> > The patch series may also be useful for other usages, for example, pure
>> > software approach may use it to harden itself against unintentional
>> > access to guest memory. This series is designed with these usages in
>> > mind but doesn't have code directly support them and extension might be
>> > needed.
>> 
>> There are a couple of additional use cases where having a consistent
>> memory interface with the kernel would be useful.
>
> Thanks very much for the info. But I'm not so confident that the current
> memfd_restricted() implementation can be useful for all these usages. 
>
>> 
>>   - Xen DomU guests providing other domains with VirtIO backends
>> 
>>   Xen by default doesn't give other domains special access to a domains
>>   memory. The guest can grant access to regions of its memory to other
>>   domains for this purpose. 
>
> I'm trying to form my understanding on how this could work and what's
> the benefit for a DomU guest to provide memory through memfd_restricted().
> AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
> but I assume VirtIO backends are still in DomU uerspace and need access
> that memory, right?

They need access to parts of the memory. At the moment you run your
VirtIO domains in the Dom0 and give them access to the whole of a DomU's
address space - however the Xen model is by default the guests memory is
inaccessible to other domains on the system. The DomU guest uses the Xen
grant model to expose portions of its address space to other domains -
namely for the VirtIO queues themselves and any pages containing buffers
involved in the VirtIO transaction. My thought was that looks like a
guest memory interface which is mostly inaccessible (private) with some
holes in it where memory is being explicitly shared with other domains.

What I want to achieve is a common userspace API with defined semantics
for what happens when private and shared regions are accessed. Because
having each hypervisor/confidential computing architecture define its
own special API for accessing this memory is just a recipe for
fragmentation and makes sharing common VirtIO backends impossible.

>
>> 
>>   - pKVM on ARM
>> 
>>   Similar to Xen, pKVM moves the management of the page tables into the
>>   hypervisor and again doesn't allow those domains to share memory by
>>   default.
>
> Right, we already had some discussions on this in the past versions.
>
>> 
>>   - VirtIO loopback
>> 
>>   This allows for VirtIO devices for the host kernel to be serviced by
>>   backends running in userspace. Obviously the memory userspace is
>>   allowed to access is strictly limited to the buffers and queues
>>   because giving userspace unrestricted access to the host kernel would
>>   have consequences.
>
> Okay, but normal memfd_create() should work for it, right? And
> memfd_restricted() instead may not work as it unmaps the memory from
> userspace.
>
>> 
>> All of these VirtIO backends work with vhost-user which uses memfds to
>> pass references to guest memory 

Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-15 Thread Chao Peng
On Mon, Nov 14, 2022 at 11:43:37AM +, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> 
> > Introduction
> > 
> > KVM userspace being able to crash the host is horrible. Under current
> > KVM architecture, all guest memory is inherently accessible from KVM
> > userspace and is exposed to the mentioned crash issue. The goal of this
> > series is to provide a solution to align mm and KVM, on a userspace
> > inaccessible approach of exposing guest memory. 
> >
> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > table). This requires guest memory being mmaped into KVM userspace, but
> > this is also the source where the mentioned crash issue can happen. In
> > theory, apart from those 'shared' memory for device emulation etc, guest
> > memory doesn't have to be mmaped into KVM userspace.
> >
> > This series introduces fd-based guest memory which will not be mmaped
> > into KVM userspace. KVM populates secondary page table by using a
> > fd/offset pair backed by a memory file system. The fd can be created
> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> > directly interact with them with newly introduced in-kernel interface,
> > therefore remove the KVM userspace from the path of accessing/mmaping
> > the guest memory. 
> >
> > Kirill had a patch [2] to address the same issue in a different way. It
> > tracks guest encrypted memory at the 'struct page' level and relies on
> > HWPOISON to reject the userspace access. The patch has been discussed in
> > several online and offline threads and resulted in a design document [3]
> > which is also the original proposal for this series. Later this patch
> > series evolved as more comments received in community but the major
> > concepts in [3] still hold true so recommend reading.
> >
> > The patch series may also be useful for other usages, for example, pure
> > software approach may use it to harden itself against unintentional
> > access to guest memory. This series is designed with these usages in
> > mind but doesn't have code directly support them and extension might be
> > needed.
> 
> There are a couple of additional use cases where having a consistent
> memory interface with the kernel would be useful.

Thanks very much for the info. But I'm not so confident that the current
memfd_restricted() implementation can be useful for all these usages. 

> 
>   - Xen DomU guests providing other domains with VirtIO backends
> 
>   Xen by default doesn't give other domains special access to a domains
>   memory. The guest can grant access to regions of its memory to other
>   domains for this purpose. 

I'm trying to form my understanding on how this could work and what's
the benefit for a DomU guest to provide memory through memfd_restricted().
AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
but I assume VirtIO backends are still in DomU uerspace and need access
that memory, right?

> 
>   - pKVM on ARM
> 
>   Similar to Xen, pKVM moves the management of the page tables into the
>   hypervisor and again doesn't allow those domains to share memory by
>   default.

Right, we already had some discussions on this in the past versions.

> 
>   - VirtIO loopback
> 
>   This allows for VirtIO devices for the host kernel to be serviced by
>   backends running in userspace. Obviously the memory userspace is
>   allowed to access is strictly limited to the buffers and queues
>   because giving userspace unrestricted access to the host kernel would
>   have consequences.

Okay, but normal memfd_create() should work for it, right? And
memfd_restricted() instead may not work as it unmaps the memory from
userspace.

> 
> All of these VirtIO backends work with vhost-user which uses memfds to
> pass references to guest memory from the VMM to the backend
> implementation.

Sounds to me these are the places where normal memfd_create() can act on.
VirtIO backends work on the mmap-ed memory which currently is not the
case for memfd_restricted(). memfd_restricted() has different design
purpose that unmaps the memory from userspace and employs some kernel
callbacks so other kernel modules can make use of the memory with these
callbacks instead of userspace virtual address.

Chao
> 
> > mm change
> > =
> > Introduces a new memfd_restricted system call which can create memory
> > file that is restricted from userspace access via normal MMU operations
> > like read(), write() or mmap() etc and the only way to use it is
> > passing it to a third kernel module like KVM and relying on it to
> > access the fd through the newly added restrictedmem kernel interface.
> > The restrictedmem interface bridges the memory file subsystems
> > (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> > bi-directional communication between them. 
> >
> >
> > KVM change
> > ==
> > Extends the KVM memslot to 

Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-15 Thread Kirill A. Shutemov
On Wed, Nov 09, 2022 at 06:54:04PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> > On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> > Vishal Annapurve  wrote:
> > 
> > > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng  
> > > wrote:
> > > >
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > >
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >   - 01: mm change, target for mm tree
> > > >   - 02-08: KVM change, target for KVM tree
> > > >
> > > > Given KVM is the only current user for the mm part, I have chatted with
> > > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > > reviewed-by/acked-by is still expected from the mm people.
> > > >
> > > > The patches have been verified in Intel TDX environment, but Vishal has
> > > > done an excellent work on the selftests[4] which are dedicated for this
> > > > series, making it possible to test this series without innovative
> > > > hardware and fancy steps of building a VM environment. See Test section
> > > > below for more info.
> > > >
> > > >
> > > > Introduction
> > > > 
> > > > KVM userspace being able to crash the host is horrible. Under current
> > > > KVM architecture, all guest memory is inherently accessible from KVM
> > > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > > series is to provide a solution to align mm and KVM, on a userspace
> > > > inaccessible approach of exposing guest memory.
> > > >
> > > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > > table). This requires guest memory being mmaped into KVM userspace, but
> > > > this is also the source where the mentioned crash issue can happen. In
> > > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > > memory doesn't have to be mmaped into KVM userspace.
> > > >
> > > > This series introduces fd-based guest memory which will not be mmaped
> > > > into KVM userspace. KVM populates secondary page table by using a
> > > 
> > > With no mappings in place for userspace VMM, IIUC, looks like the host
> > > kernel will not be able to find the culprit userspace process in case
> > > of Machine check error on guest private memory. As implemented in
> > > hwpoison_user_mappings, host kernel tries to look at the processes
> > > which have mapped the pfns with hardware error.
> > > 
> > > Is there a modification needed in mce handling logic of the host
> > > kernel to immediately send a signal to the vcpu thread accessing
> > > faulting pfn backing guest private memory?
> > 
> > mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
> > includes host key id in addition to real physical address.  By searching 
> > used
> > hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> > yes, send SIGBUS.
> > 
> > kvm_machine_check() can be enhanced for KVM specific use.  This is before
> > memory_failure() is called, though.
> > 
> > any other ideas?
> 
> That's too KVM-centric. It will not work for other possible user of
> restricted memfd.
> 
> I tried to find a way to get it right: we need to get restricted memfd
> code info about corrupted page so it can invalidate its users. On the next
> request of the page the user will see an error. In case of KVM, the error
> will likely escalate to SIGBUS.
> 
> The problem is that core-mm code that handles memory failure knows nothing
> about restricted memfd. It only sees that the page belongs to a normal
> memfd.
> 
> AFAICS, there's no way to get it intercepted from the shim level. shmem
> code has to be patches. shmem_error_remove_page() has to call into
> restricted memfd code.
> 
> Hugh, are you okay with this? Or maybe you have a better idea?

Okay, here is what I've come up with. It doesn't touch shmem code, but
hooks up directly into memory-failure.c. It is still ugly, but should be
tolerable.

restrictedmem_error_page() loops over all restrictedmem inodes. It is
slow, but memory failure is not hot path (I hope).

Only build-tested. Chao, could you hook up ->error for KVM and get it
tested?

diff --git a/include/linux/restrictedmem.h 

Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-14 Thread Alex Bennée


Chao Peng  writes:


> Introduction
> 
> KVM userspace being able to crash the host is horrible. Under current
> KVM architecture, all guest memory is inherently accessible from KVM
> userspace and is exposed to the mentioned crash issue. The goal of this
> series is to provide a solution to align mm and KVM, on a userspace
> inaccessible approach of exposing guest memory. 
>
> Normally, KVM populates secondary page table (e.g. EPT) by using a host
> virtual address (hva) from core mm page table (e.g. x86 userspace page
> table). This requires guest memory being mmaped into KVM userspace, but
> this is also the source where the mentioned crash issue can happen. In
> theory, apart from those 'shared' memory for device emulation etc, guest
> memory doesn't have to be mmaped into KVM userspace.
>
> This series introduces fd-based guest memory which will not be mmaped
> into KVM userspace. KVM populates secondary page table by using a
> fd/offset pair backed by a memory file system. The fd can be created
> from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> directly interact with them with newly introduced in-kernel interface,
> therefore remove the KVM userspace from the path of accessing/mmaping
> the guest memory. 
>
> Kirill had a patch [2] to address the same issue in a different way. It
> tracks guest encrypted memory at the 'struct page' level and relies on
> HWPOISON to reject the userspace access. The patch has been discussed in
> several online and offline threads and resulted in a design document [3]
> which is also the original proposal for this series. Later this patch
> series evolved as more comments received in community but the major
> concepts in [3] still hold true so recommend reading.
>
> The patch series may also be useful for other usages, for example, pure
> software approach may use it to harden itself against unintentional
> access to guest memory. This series is designed with these usages in
> mind but doesn't have code directly support them and extension might be
> needed.

There are a couple of additional use cases where having a consistent
memory interface with the kernel would be useful.

  - Xen DomU guests providing other domains with VirtIO backends

  Xen by default doesn't give other domains special access to a domains
  memory. The guest can grant access to regions of its memory to other
  domains for this purpose. 

  - pKVM on ARM

  Similar to Xen, pKVM moves the management of the page tables into the
  hypervisor and again doesn't allow those domains to share memory by
  default.

  - VirtIO loopback

  This allows for VirtIO devices for the host kernel to be serviced by
  backends running in userspace. Obviously the memory userspace is
  allowed to access is strictly limited to the buffers and queues
  because giving userspace unrestricted access to the host kernel would
  have consequences.

All of these VirtIO backends work with vhost-user which uses memfds to
pass references to guest memory from the VMM to the backend
implementation.

> mm change
> =
> Introduces a new memfd_restricted system call which can create memory
> file that is restricted from userspace access via normal MMU operations
> like read(), write() or mmap() etc and the only way to use it is
> passing it to a third kernel module like KVM and relying on it to
> access the fd through the newly added restrictedmem kernel interface.
> The restrictedmem interface bridges the memory file subsystems
> (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> bi-directional communication between them. 
>
>
> KVM change
> ==
> Extends the KVM memslot to provide guest private (encrypted) memory from
> a fd. With this extension, a single memslot can maintain both private
> memory through private fd (restricted_fd/restricted_offset) and shared
> (unencrypted) memory through userspace mmaped host virtual address
> (userspace_addr). For a particular guest page, the corresponding page in
> KVM memslot can be only either private or shared and only one of the
> shared/private parts of the memslot is visible to guest. For how this
> new extension is used in QEMU, please refer to kvm_set_phys_mem() in
> below TDX-enabled QEMU repo.
>
> Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
> chance on decision-making for shared <-> private memory conversion. The
> exit can be an implicit conversion in KVM page fault handler or an
> explicit conversion from guest OS.
>
> Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
> convert a guest page between private <-> shared. The data maintained in
> these ioctls tells the truth whether a guest page is private or shared
> and this information will be used in KVM page fault handler to decide
> whether the private or the shared part of the memslot is visible to
> guest.
>


-- 
Alex Bennée



Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-09 Thread Kirill A. Shutemov
On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> Vishal Annapurve  wrote:
> 
> > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng  
> > wrote:
> > >
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > >
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >   - 01: mm change, target for mm tree
> > >   - 02-08: KVM change, target for KVM tree
> > >
> > > Given KVM is the only current user for the mm part, I have chatted with
> > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > reviewed-by/acked-by is still expected from the mm people.
> > >
> > > The patches have been verified in Intel TDX environment, but Vishal has
> > > done an excellent work on the selftests[4] which are dedicated for this
> > > series, making it possible to test this series without innovative
> > > hardware and fancy steps of building a VM environment. See Test section
> > > below for more info.
> > >
> > >
> > > Introduction
> > > 
> > > KVM userspace being able to crash the host is horrible. Under current
> > > KVM architecture, all guest memory is inherently accessible from KVM
> > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > series is to provide a solution to align mm and KVM, on a userspace
> > > inaccessible approach of exposing guest memory.
> > >
> > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > table). This requires guest memory being mmaped into KVM userspace, but
> > > this is also the source where the mentioned crash issue can happen. In
> > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > memory doesn't have to be mmaped into KVM userspace.
> > >
> > > This series introduces fd-based guest memory which will not be mmaped
> > > into KVM userspace. KVM populates secondary page table by using a
> > 
> > With no mappings in place for userspace VMM, IIUC, looks like the host
> > kernel will not be able to find the culprit userspace process in case
> > of Machine check error on guest private memory. As implemented in
> > hwpoison_user_mappings, host kernel tries to look at the processes
> > which have mapped the pfns with hardware error.
> > 
> > Is there a modification needed in mce handling logic of the host
> > kernel to immediately send a signal to the vcpu thread accessing
> > faulting pfn backing guest private memory?
> 
> mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
> includes host key id in addition to real physical address.  By searching used
> hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> yes, send SIGBUS.
> 
> kvm_machine_check() can be enhanced for KVM specific use.  This is before
> memory_failure() is called, though.
> 
> any other ideas?

That's too KVM-centric. It will not work for other possible user of
restricted memfd.

I tried to find a way to get it right: we need to get restricted memfd
code info about corrupted page so it can invalidate its users. On the next
request of the page the user will see an error. In case of KVM, the error
will likely escalate to SIGBUS.

The problem is that core-mm code that handles memory failure knows nothing
about restricted memfd. It only sees that the page belongs to a normal
memfd.

AFAICS, there's no way to get it intercepted from the shim level. shmem
code has to be patches. shmem_error_remove_page() has to call into
restricted memfd code.

Hugh, are you okay with this? Or maybe you have a better idea?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-07 Thread Isaku Yamahata
On Thu, Nov 03, 2022 at 05:43:52PM +0530,
Vishal Annapurve  wrote:

> On Tue, Oct 25, 2022 at 8:48 PM Chao Peng  wrote:
> >
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> >
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-08: KVM change, target for KVM tree
> >
> > Given KVM is the only current user for the mm part, I have chatted with
> > Paolo and he is OK to merge the mm change through KVM tree, but
> > reviewed-by/acked-by is still expected from the mm people.
> >
> > The patches have been verified in Intel TDX environment, but Vishal has
> > done an excellent work on the selftests[4] which are dedicated for this
> > series, making it possible to test this series without innovative
> > hardware and fancy steps of building a VM environment. See Test section
> > below for more info.
> >
> >
> > Introduction
> > 
> > KVM userspace being able to crash the host is horrible. Under current
> > KVM architecture, all guest memory is inherently accessible from KVM
> > userspace and is exposed to the mentioned crash issue. The goal of this
> > series is to provide a solution to align mm and KVM, on a userspace
> > inaccessible approach of exposing guest memory.
> >
> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > table). This requires guest memory being mmaped into KVM userspace, but
> > this is also the source where the mentioned crash issue can happen. In
> > theory, apart from those 'shared' memory for device emulation etc, guest
> > memory doesn't have to be mmaped into KVM userspace.
> >
> > This series introduces fd-based guest memory which will not be mmaped
> > into KVM userspace. KVM populates secondary page table by using a
> 
> With no mappings in place for userspace VMM, IIUC, looks like the host
> kernel will not be able to find the culprit userspace process in case
> of Machine check error on guest private memory. As implemented in
> hwpoison_user_mappings, host kernel tries to look at the processes
> which have mapped the pfns with hardware error.
> 
> Is there a modification needed in mce handling logic of the host
> kernel to immediately send a signal to the vcpu thread accessing
> faulting pfn backing guest private memory?

mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
includes host key id in addition to real physical address.  By searching used
hkid by KVM, we can determine if the page is assigned to guest TD or not. If
yes, send SIGBUS.

kvm_machine_check() can be enhanced for KVM specific use.  This is before
memory_failure() is called, though.

any other ideas?
-- 
Isaku Yamahata 



Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-03 Thread Vishal Annapurve
On Tue, Oct 25, 2022 at 8:48 PM Chao Peng  wrote:
>
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
>
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
>   - 01: mm change, target for mm tree
>   - 02-08: KVM change, target for KVM tree
>
> Given KVM is the only current user for the mm part, I have chatted with
> Paolo and he is OK to merge the mm change through KVM tree, but
> reviewed-by/acked-by is still expected from the mm people.
>
> The patches have been verified in Intel TDX environment, but Vishal has
> done an excellent work on the selftests[4] which are dedicated for this
> series, making it possible to test this series without innovative
> hardware and fancy steps of building a VM environment. See Test section
> below for more info.
>
>
> Introduction
> 
> KVM userspace being able to crash the host is horrible. Under current
> KVM architecture, all guest memory is inherently accessible from KVM
> userspace and is exposed to the mentioned crash issue. The goal of this
> series is to provide a solution to align mm and KVM, on a userspace
> inaccessible approach of exposing guest memory.
>
> Normally, KVM populates secondary page table (e.g. EPT) by using a host
> virtual address (hva) from core mm page table (e.g. x86 userspace page
> table). This requires guest memory being mmaped into KVM userspace, but
> this is also the source where the mentioned crash issue can happen. In
> theory, apart from those 'shared' memory for device emulation etc, guest
> memory doesn't have to be mmaped into KVM userspace.
>
> This series introduces fd-based guest memory which will not be mmaped
> into KVM userspace. KVM populates secondary page table by using a

With no mappings in place for userspace VMM, IIUC, looks like the host
kernel will not be able to find the culprit userspace process in case
of Machine check error on guest private memory. As implemented in
hwpoison_user_mappings, host kernel tries to look at the processes
which have mapped the pfns with hardware error.

Is there a modification needed in mce handling logic of the host
kernel to immediately send a signal to the vcpu thread accessing
faulting pfn backing guest private memory?


> fd/offset pair backed by a memory file system. The fd can be created
> from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> directly interact with them with newly introduced in-kernel interface,
> therefore remove the KVM userspace from the path of accessing/mmaping
> the guest memory.
>
> Kirill had a patch [2] to address the same issue in a different way. It
> tracks guest encrypted memory at the 'struct page' level and relies on
> HWPOISON to reject the userspace access. The patch has been discussed in
> several online and offline threads and resulted in a design document [3]
> which is also the original proposal for this series. Later this patch
> series evolved as more comments received in community but the major
> concepts in [3] still hold true so recommend reading.
>
> The patch series may also be useful for other usages, for example, pure
> software approach may use it to harden itself against unintentional
> access to guest memory. This series is designed with these usages in
> mind but doesn't have code directly support them and extension might be
> needed.
>
>
> mm change
> =
> Introduces a new memfd_restricted system call which can create memory
> file that is restricted from userspace access via normal MMU operations
> like read(), write() or mmap() etc and the only way to use it is
> passing it to a third kernel module like KVM and relying on it to
> access the fd through the newly added restrictedmem kernel interface.
> The restrictedmem interface bridges the memory file subsystems
> (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> bi-directional communication between them.
>
>
> KVM change
> ==
> Extends the KVM memslot to provide guest private (encrypted) memory from
> a fd. With this extension, a single memslot can maintain both private
> memory through private fd (restricted_fd/restricted_offset) and shared
> (unencrypted) memory through userspace mmaped host virtual address
> (userspace_addr). For a particular guest page, the corresponding page in
> KVM memslot can be only either private or shared and only one of the
> shared/private parts 

[PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-10-25 Thread Chao Peng
This patch series implements KVM guest private memory for confidential
computing scenarios like Intel TDX[1]. If a TDX host accesses
TDX-protected guest memory, machine check can happen which can further
crash the running host system, this is terrible for multi-tenant
configurations. The host accesses include those from KVM userspace like
QEMU. This series addresses KVM userspace induced crash by introducing
new mm and KVM interfaces so KVM userspace can still manage guest memory
via a fd-based approach, but it can never access the guest memory
content.

The patch series touches both core mm and KVM code. I appreciate
Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
reviews are always welcome.
  - 01: mm change, target for mm tree
  - 02-08: KVM change, target for KVM tree

Given KVM is the only current user for the mm part, I have chatted with
Paolo and he is OK to merge the mm change through KVM tree, but
reviewed-by/acked-by is still expected from the mm people.

The patches have been verified in Intel TDX environment, but Vishal has
done an excellent work on the selftests[4] which are dedicated for this
series, making it possible to test this series without innovative
hardware and fancy steps of building a VM environment. See Test section
below for more info.


Introduction

KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory. 

Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.

This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory. 

Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.

The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.


mm change
=
Introduces a new memfd_restricted system call which can create memory
file that is restricted from userspace access via normal MMU operations
like read(), write() or mmap() etc and the only way to use it is
passing it to a third kernel module like KVM and relying on it to
access the fd through the newly added restrictedmem kernel interface.
The restrictedmem interface bridges the memory file subsystems
(tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
bi-directional communication between them. 


KVM change
==
Extends the KVM memslot to provide guest private (encrypted) memory from
a fd. With this extension, a single memslot can maintain both private
memory through private fd (restricted_fd/restricted_offset) and shared
(unencrypted) memory through userspace mmaped host virtual address
(userspace_addr). For a particular guest page, the corresponding page in
KVM memslot can be only either private or shared and only one of the
shared/private parts of the memslot is visible to guest. For how this
new extension is used in QEMU, please refer to kvm_set_phys_mem() in
below TDX-enabled QEMU repo.

Introduces new KVM_EXIT_MEMORY_FAULT exit to allow userspace to get the
chance on decision-making for shared <-> private memory conversion. The
exit can be an implicit conversion in KVM page fault handler or an
explicit conversion from guest OS.

Extends existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to
convert a guest page between private <-> shared. The data maintained in
these ioctls tells the truth whether a guest page is private or shared
and this information will be used in KVM page fault handler to decide
whether the private or the