Re: [RFC PATCH 00/19] hugetlb support for KVM guest_mem

2023-06-21 Thread Vishal Annapurve
On Fri, Jun 16, 2023 at 11:28 AM Mike Kravetz  wrote:
>
> On 06/06/23 19:03, Ackerley Tng wrote:
> > Hello,
> >
> > This patchset builds upon a soon-to-be-published WIP patchset that Sean
> > published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, 
> > mentioned
> > at [1].
> >
> > The tree can be found at:
> > https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
> >
> > In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is 
> > introduced,
> > allowing VM private memory (for confidential computing) to be backed by 
> > hugetlb
> > pages.
> >
> > guest_mem provides userspace with a handle, with which userspace can 
> > allocate
> > and deallocate memory for confidential VMs without mapping the memory into
> > userspace.
>
> Hello Ackerley,
>
> I am not sure if you are aware or, have been following the hugetlb HGM
> discussion in this thread:
> https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/
>
> There we are trying to decide if HGM should be added to hugetlb, or if
> perhaps a new filesystem/driver/allocator should be created.  The concern
> is added complexity to hugetlb as well as core mm special casing.  Note
> that HGM is addressing issues faced by existing hugetlb users.
>
> Your proposal here suggests modifying hugetlb so that it can be used in
> a new way (use case) by KVM's guest_mem.  As such it really seems like
> something that should be done in a separate filesystem/driver/allocator.
> You will likely not get much support for modifying hugetlb.
>
> --
> Mike Kravetz
>

IIUC mm/hugetlb.c implements memory manager for Hugetlb pages and
fd/hugetlbfs/inode.c implements the filesystem logic for hugetlbfs.

This series implements a new filesystem with limited operations
parallel to hugetlbfs filesystem but tries to reuse hugetlb memory
manager. The effort here is to not add any new feature to hugetlb
memory manager but clean it up so that it can be used by a new
filesystem.

guest_mem warrants a new filesystem since it supports limited
operations on the underlying files but there is no additional
restriction on underlying memory management. Though one could argue
that memory management for guest_mem files can be a very simple one
that goes inline with limited operations on the files.

If this series were to go a separate way of implementing a new memory
manager, one immediate requirement that might spring up, would be to
convert memory from hugetlb managed memory to be managed by this newly
introduced memory manager and vice a versa at runtime since there
could be a mix of VMs on the same platform using guest_mem and
hugetlb.
Maybe this can be satisfied by having a separate global pool for
reservation that's consumed by both, which would need more changes in
my understanding.

Using guest_mem for all the VMs by default would be a future work
contingent on all existing usecases/requirements being satisfied.

Regards,
Vishal



Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2023-01-10 Thread Vishal Annapurve
On Tue, Jan 10, 2023 at 1:19 AM Chao Peng  wrote:
> >
> > Regarding the userspace side of things, please include Vishal's selftests 
> > in v11,
> > it's impossible to properly review the uAPI changes without seeing the 
> > userspace
> > side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> > massage it into a set of patches that you can incorporate into your series.
>
> Previously I included Vishal's selftests in the github repo, but not
> include them in this patch series. It's OK for me to incorporate them
> directly into this series and review together if Vishal is fine.
>

Yeah, I am ok with incorporating selftest patches into this series and
reviewing them together.

Regards,
Vishal

> Chao
> >
> > [*] 
> > https://lore.kernel.org/all/20221205232341.4131240-1-vannapu...@google.com



Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-01-05 Thread Vishal Annapurve
On Thu, Dec 1, 2022 at 10:20 PM Chao Peng  wrote:
>
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> +pgoff_t start, pgoff_t end,
> +gfn_t *gfn_start, gfn_t *gfn_end)
> +{
> +   unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> +
> +   if (start > base_pgoff)
> +   *gfn_start = slot->base_gfn + start - base_pgoff;

There should be a check for overflow here in case start is a very big
value. Additional check can look like:
if (start >= base_pgoff + slot->npages)
   return false;

> +   else
> +   *gfn_start = slot->base_gfn;
> +
> +   if (end < base_pgoff + slot->npages)
> +   *gfn_end = slot->base_gfn + end - base_pgoff;

If "end" is smaller than base_pgoff, this can cause overflow and
return the range as valid. There should be additional check:
if (end < base_pgoff)
 return false;


> +   else
> +   *gfn_end = slot->base_gfn + slot->npages;
> +
> +   if (*gfn_start >= *gfn_end)
> +   return false;
> +
> +   return true;
> +}
> +



Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-01 Thread Vishal Annapurve
On Tue, Oct 25, 2022 at 8:18 AM Chao Peng  wrote:
>
> From: "Kirill A. Shutemov" 
>
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
>
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
>
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
>
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
>
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
>
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
>
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
>
> The system call is currently wired up for x86 arch.
>
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h  |  62 ++
>  include/linux/syscalls.h   |   1 +
>  include/uapi/asm-generic/unistd.h  |   5 +-
>  include/uapi/linux/magic.h |   1 +
>  kernel/sys_ni.c|   3 +
>  mm/Kconfig |   4 +
>  mm/Makefile|   1 +
>  mm/restrictedmem.c | 250 +
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448i386process_mreleasesys_process_mrelease
>  449i386futex_waitv sys_futex_waitv
>  450i386set_mempolicy_home_node sys_set_mempolicy_home_node
> +451i386memfd_restrictedsys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448common  process_mreleasesys_process_mrelease
>  449common  futex_waitv sys_futex_waitv
>  450common  set_mempolicy_home_node sys_set_mempolicy_home_node
> +451common  memfd_restrictedsys_memfd_restricted
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index ..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include 
> +#include 
> +#include 
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +   void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +pgoff_t start, pgoff_t end);
> +   void (*invalidate_end)(struct 

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-29 Thread Vishal Annapurve
On Mon, Nov 28, 2022 at 4:37 PM Michael Roth  wrote:
>
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> >
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through a new in-kernel interface by a third kernel module.
> >
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> >
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> >
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> >
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> >
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When these operations happen, KVM can get notified
> > through restrictedmem_notifier, it then gets chance to remove any
> > mapped entries of the range in the secondary page tables.
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> >
> > The system call is currently wired up for x86 arch.
> >
> > Signed-off-by: Kirill A. Shutemov 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h  |  62 ++
> >  include/linux/syscalls.h   |   1 +
> >  include/uapi/asm-generic/unistd.h  |   5 +-
> >  include/uapi/linux/magic.h |   1 +
> >  kernel/sys_ni.c|   3 +
> >  mm/Kconfig |   4 +
> >  mm/Makefile|   1 +
> >  mm/restrictedmem.c | 250 +
> >  10 files changed, 328 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448  i386process_mreleasesys_process_mrelease
> >  449  i386futex_waitv sys_futex_waitv
> >  450  i386set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451  i386memfd_restrictedsys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448  common  process_mreleasesys_process_mrelease
> >  449  common  futex_waitv sys_futex_waitv
> >  450  common  set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451  common  memfd_restrictedsys_memfd_restricted
> >
> >  #
> >  # Due to a historical design error, certain syscalls are numbered 
> > differently
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index ..9c37c3ea3180
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > 

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-11-03 Thread Vishal Annapurve
On Mon, Oct 24, 2022 at 8:30 PM Kirill A . Shutemov
 wrote:
>
> On Fri, Oct 21, 2022 at 04:18:14PM +, Sean Christopherson wrote:
> > On Fri, Oct 21, 2022, Chao Peng wrote:
> > > >
> > > > In the context of userspace inaccessible memfd, what would be a
> > > > suggested way to enforce NUMA memory policy for physical memory
> > > > allocation? mbind[1] won't work here in absence of virtual address
> > > > range.
> > >
> > > How about set_mempolicy():
> > > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> >
> > Andy Lutomirski brought this up in an off-list discussion way back when the 
> > whole
> > private-fd thing was first being proposed.
> >
> >   : The current Linux NUMA APIs (mbind, move_pages) work on virtual 
> > addresses.  If
> >   : we want to support them for TDX private memory, we either need TDX 
> > private
> >   : memory to have an HVA or we need file-based equivalents. Arguably we 
> > should add
> >   : fmove_pages and fbind syscalls anyway, since the current API is quite 
> > awkward
> >   : even for tools like numactl.
>
> Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
> addressed in the initial submission.
>
> BTW, it is not regression comparing to old KVM slots, if the memory is
> backed by memfd or other file:
>
> MBIND(2)
>The  specified policy will be ignored for any MAP_SHARED mappings in 
> the
>specified memory range.  Rather the pages will be allocated according 
> to
>the  memory  policy  of the thread that caused the page to be 
> allocated.
>Again, this may not be the thread that called mbind().
>
> It is not clear how to define fbind(2) semantics, considering that multiple
> processes may compete for the same region of page cache.
>
> Should it be per-inode or per-fd? Or maybe per-range in inode/fd?
>

David's analysis on mempolicy with shmem seems to be right. set_policy
on virtual address range does seem to change the shared policy for the
inode irrespective of the mapping type.

Maybe having a way to set numa policy per-range in the inode would be
at par with what we can do today via mbind on virtual address ranges.



> fmove_pages(2) should be relatively straight forward, since it is
> best-effort and does not guarantee that the page will note be moved
> somewhare else just after return from the syscall.
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-03 Thread Vishal Annapurve
On Tue, Oct 25, 2022 at 8:48 PM Chao Peng  wrote:
>
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
>
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
>   - 01: mm change, target for mm tree
>   - 02-08: KVM change, target for KVM tree
>
> Given KVM is the only current user for the mm part, I have chatted with
> Paolo and he is OK to merge the mm change through KVM tree, but
> reviewed-by/acked-by is still expected from the mm people.
>
> The patches have been verified in Intel TDX environment, but Vishal has
> done an excellent work on the selftests[4] which are dedicated for this
> series, making it possible to test this series without innovative
> hardware and fancy steps of building a VM environment. See Test section
> below for more info.
>
>
> Introduction
> 
> KVM userspace being able to crash the host is horrible. Under current
> KVM architecture, all guest memory is inherently accessible from KVM
> userspace and is exposed to the mentioned crash issue. The goal of this
> series is to provide a solution to align mm and KVM, on a userspace
> inaccessible approach of exposing guest memory.
>
> Normally, KVM populates secondary page table (e.g. EPT) by using a host
> virtual address (hva) from core mm page table (e.g. x86 userspace page
> table). This requires guest memory being mmaped into KVM userspace, but
> this is also the source where the mentioned crash issue can happen. In
> theory, apart from those 'shared' memory for device emulation etc, guest
> memory doesn't have to be mmaped into KVM userspace.
>
> This series introduces fd-based guest memory which will not be mmaped
> into KVM userspace. KVM populates secondary page table by using a

With no mappings in place for userspace VMM, IIUC, looks like the host
kernel will not be able to find the culprit userspace process in case
of Machine check error on guest private memory. As implemented in
hwpoison_user_mappings, host kernel tries to look at the processes
which have mapped the pfns with hardware error.

Is there a modification needed in mce handling logic of the host
kernel to immediately send a signal to the vcpu thread accessing
faulting pfn backing guest private memory?


> fd/offset pair backed by a memory file system. The fd can be created
> from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> directly interact with them with newly introduced in-kernel interface,
> therefore remove the KVM userspace from the path of accessing/mmaping
> the guest memory.
>
> Kirill had a patch [2] to address the same issue in a different way. It
> tracks guest encrypted memory at the 'struct page' level and relies on
> HWPOISON to reject the userspace access. The patch has been discussed in
> several online and offline threads and resulted in a design document [3]
> which is also the original proposal for this series. Later this patch
> series evolved as more comments received in community but the major
> concepts in [3] still hold true so recommend reading.
>
> The patch series may also be useful for other usages, for example, pure
> software approach may use it to harden itself against unintentional
> access to guest memory. This series is designed with these usages in
> mind but doesn't have code directly support them and extension might be
> needed.
>
>
> mm change
> =
> Introduces a new memfd_restricted system call which can create memory
> file that is restricted from userspace access via normal MMU operations
> like read(), write() or mmap() etc and the only way to use it is
> passing it to a third kernel module like KVM and relying on it to
> access the fd through the newly added restrictedmem kernel interface.
> The restrictedmem interface bridges the memory file subsystems
> (tmpfs/hugetlbfs etc) and their users (KVM in this case) and provides
> bi-directional communication between them.
>
>
> KVM change
> ==
> Extends the KVM memslot to provide guest private (encrypted) memory from
> a fd. With this extension, a single memslot can maintain both private
> memory through private fd (restricted_fd/restricted_offset) and shared
> (unencrypted) memory through userspace mmaped host virtual address
> (userspace_addr). For a particular guest page, the corresponding page in
> KVM memslot can be only either private or shared and only one of the
> shared/private parts 

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-20 Thread Vishal Annapurve
On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
 wrote:
>
> On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> >  wrote:
> > >
> > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" 
> > > > > > >
> > > > > > > KVM can use memfd-provided memory for guest memory. For normal 
> > > > > > > userspace
> > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into 
> > > > > > > its
> > > > > > > virtual address space and then tells KVM to use the virtual 
> > > > > > > address to
> > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > >
> > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > memfd-provided memory may be encrypted with special key for 
> > > > > > > special
> > > > > > > software domain (e.g. KVM guest) and is not expected to be 
> > > > > > > directly
> > > > > > > accessed by userspace. Precisely, userspace access to such 
> > > > > > > encrypted
> > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > >
> > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace 
> > > > > > > through
> > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > > in-kernel interface so KVM can directly interact with core-mm 
> > > > > > > without
> > > > > > > the need to map the memory into KVM userspace.
> > > > > > >
> > > > > > > It provides semantics required for KVM guest private(encrypted) 
> > > > > > > memory
> > > > > > > support that a file descriptor with this flag set is going to be 
> > > > > > > used as
> > > > > > > the source of guest memory in confidential computing environments 
> > > > > > > such
> > > > > > > as Intel TDX/AMD SEV.
> > > > > > >
> > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly 
> > > > > > > added
> > > > > > > in this patch to obtain the physical memory address and then 
> > > > > > > populate
> > > > > > > the secondary page table entries.
> > > > > > >
> > > > > > > The userspace inaccessible memfd can be fallocate-ed and 
> > > > > > > hole-punched
> > > > > > > from userspace. When hole-punching happens, KVM can get notified 
> > > > > > > through
> > > > > > > inaccessible_notifier it then gets chance to remove any mapped 
> > > > > > > entries
> > > > > > > of the range in the secondary page tables.
> > > > > > >
> > > > > > > The userspace inaccessible memfd itself is implemented as a shim 
> > > > > > > layer
> > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this 
> > > > > > > patch
> > > > > > > only implemented tmpfs. The allocated memory is currently marked 
> > > > > > > as
> > > > > > > unmovable and unevictable, this is required for current 
> > > > > > > confidential
> > > > > > > usage. But in future this might be changed.
> > > > > > >
> > > > > > > Signed-off-by: Kirill A. Shutemov 
> > > > > > > 
> > > > > > > Signed-off-by: Chao Peng 
> > > > > > > ---
> > > > > >
> > > > > > ...
> > > > > >
> > > > &

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-19 Thread Vishal Annapurve
On Thu, Sep 15, 2022 at 8:04 PM Chao Peng  wrote:
>
> From: "Kirill A. Shutemov" 
>
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
>
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
>
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
>
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
>
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
>
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
>
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---
>  include/linux/memfd.h  |  24 
>  include/uapi/linux/magic.h |   1 +
>  include/uapi/linux/memfd.h |   1 +
>  mm/Makefile|   2 +-
>  mm/memfd.c |  25 -
>  mm/memfd_inaccessible.c| 219 +
>  6 files changed, 270 insertions(+), 2 deletions(-)
>  create mode 100644 mm/memfd_inaccessible.c
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..334ddff08377 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -3,6 +3,7 @@
>  #define __LINUX_MEMFD_H
>
>  #include 
> +#include 
>
>  #ifdef CONFIG_MEMFD_CREATE
>  extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long 
> arg);
> @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned 
> int c, unsigned long a)
>  }
>  #endif
>
> +struct inaccessible_notifier;
> +
> +struct inaccessible_notifier_ops {
> +   void (*invalidate)(struct inaccessible_notifier *notifier,
> +  pgoff_t start, pgoff_t end);
> +};
> +
> +struct inaccessible_notifier {
> +   struct list_head list;
> +   const struct inaccessible_notifier_ops *ops;
> +};
> +
> +void inaccessible_register_notifier(struct file *file,
> +   struct inaccessible_notifier *notifier);
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +int *order);
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd);
> +
>  #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..9d066be3d7e8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC  0x444d4142  /* "DMAB" */
>  #define DEVMEM_MAGIC   0x454d444d  /* "DMEM" */
>  #define SECRETMEM_MAGIC0x5345434d  /* "SECM" */
> +#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
>  #define MFD_CLOEXEC0x0001U
>  #define MFD_ALLOW_SEALING  0x0002U
>  #define MFD_HUGETLB0x0004U
> +#define MFD_INACCESSIBLE   0x0008U
>
>  /*
>   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..f82e5d4b4388 100644
> --- 

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-18 Thread Vishal Annapurve
On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
 wrote:
>
> On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > From: "Kirill A. Shutemov" 
> > > > >
> > > > > KVM can use memfd-provided memory for guest memory. For normal 
> > > > > userspace
> > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > >
> > > > > With confidential computing technologies like Intel TDX, the
> > > > > memfd-provided memory may be encrypted with special key for special
> > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > memory may lead to host crash so it should be prevented.
> > > > >
> > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > the need to map the memory into KVM userspace.
> > > > >
> > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > support that a file descriptor with this flag set is going to be used 
> > > > > as
> > > > > the source of guest memory in confidential computing environments such
> > > > > as Intel TDX/AMD SEV.
> > > > >
> > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > in this patch to obtain the physical memory address and then populate
> > > > > the secondary page table entries.
> > > > >
> > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > from userspace. When hole-punching happens, KVM can get notified 
> > > > > through
> > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > of the range in the secondary page tables.
> > > > >
> > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > unmovable and unevictable, this is required for current confidential
> > > > > usage. But in future this might be changed.
> > > > >
> > > > > Signed-off-by: Kirill A. Shutemov 
> > > > > Signed-off-by: Chao Peng 
> > > > > ---
> > > >
> > > > ...
> > > >
> > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > +  loff_t offset, loff_t len)
> > > > > +{
> > > > > +   struct inaccessible_data *data = 
> > > > > file->f_mapping->private_data;
> > > > > +   struct file *memfd = data->memfd;
> > > > > +   int ret;
> > > > > +
> > > > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > +   return -EINVAL;
> > > > > +   }
> > > > > +
> > > > > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> > > >
> > > > Wonder if invalidate should precede the actual hole punch, otherwise we 
> > > > open
> > > > a window where the page tables point to memory no longer valid?
> > >
> > > Yes, you are right. Thanks for catching this.
> >
> > I also noticed this. But then thought the memory would be anyways zeroed
> > (hole punched) before this call?
>
> Hole punching can free pages, given that offset/len covers full page.
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

I think moving this notifier_invalidate before fallocate may not solve
the problem completely. Is it possible that between invalidate and
fallocate, KVM tries to handle the page fault for the guest VM from
another vcpu and uses the pages to be freed to back gpa ranges? Should
hole punching here also update mem_attr first to say that KVM should
consider the corresponding gpa ranges to be no more backed by
inaccessible memfd?



Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions

2022-08-19 Thread Vishal Annapurve
> ...
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 230c8ff9659c..bb714c2a4b06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> +#define KVM_MEM_ATTR_PRIVATE   0x0001
> +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int 
> ioctl,
> +struct kvm_enc_region *region)
> +{
> +   unsigned long start, end;
> +   void *entry;
> +   int r;
> +
> +   if (region->size == 0 || region->addr + region->size < region->addr)
> +   return -EINVAL;
> +   if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
> +   return -EINVAL;
> +
> +   start = region->addr >> PAGE_SHIFT;
> +   end = (region->addr + region->size - 1) >> PAGE_SHIFT;
> +
> +   entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
> +   xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
> +
> +   r = xa_err(xa_store_range(>mem_attr_array, start, end,
> +   entry, GFP_KERNEL_ACCOUNT));

xa_store_range seems to create multi-index entries by default.
Subsequent xa_store_range call changes all the entries stored
previously.
xa_store needs to be used here instead of xa_store_range to achieve
the intended behavior.

> +
> +   kvm_zap_gfn_range(kvm, start, end + 1);
> +
> +   return r;
> +}
> +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
> +
> ...



Re: [PATCH v6 6/8] KVM: Handle page fault for private memory

2022-07-20 Thread Vishal Annapurve
> > Hmm, so a new slot->arch.page_attr array shouldn't be necessary, KVM can 
> > instead
> > update slot->arch.lpage_info on shared<->private conversions.  Detecting 
> > whether
> > a given range is partially mapped could get nasty if KVM defers tracking to 
> > the
> > backing store, but if KVM itself does the tracking as was previously 
> > suggested[*],
> > then updating lpage_info should be relatively straightfoward, e.g. use
> > xa_for_each_range() to see if a given 2mb/1gb range is completely covered 
> > (fully
> > shared) or not covered at all (fully private).
> >
> > [*] https://lore.kernel.org/all/yofezps9yxgtp...@google.com
>
> Yes, slot->arch.page_attr was introduced to help identify whether a page
> is completely shared/private at given level. It seems XARRAY can serve
> the same purpose, though I know nothing about it. Looking forward to
> seeing the patch of using XARRAY.
>
> yes, update slot->arch.lpage_info is good to utilize the existing logic
> and Isaku has applied it to slot->arch.lpage_info for 2MB support patches.

Chao, are you planning to implement these changes to ensure proper
handling of hugepages partially mapped as private/shared in subsequent
versions of this series?
Or is this something left to be handled by the architecture specific code?

Regards,
Vishal



Re: [PATCH v6 6/8] KVM: Handle page fault for private memory

2022-06-30 Thread Vishal Annapurve
...
> > > /*
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index afe18d70ece7..e18460e0d743 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2899,6 +2899,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > if (max_level == PG_LEVEL_4K)
> > > return PG_LEVEL_4K;
> > >
> > > +   if (kvm_slot_is_private(slot))
> > > +   return max_level;
> >
> > Can you explain the rationale behind the above change?
> > AFAIU, this overrides the transparent_hugepage=never setting for both
> > shared and private mappings.
>
> As Sean pointed out, this should check against fault->is_private instead
> of the slot. For private fault, the level is retrieved and stored to
> fault->max_level in kvm_faultin_pfn_private() instead of here.
>
> For shared fault, it will continue to query host_level below. For
> private fault, the host level has already been accounted in
> kvm_faultin_pfn_private().
>
> Chao
> >

With transparent_hugepages=always setting I see issues with the
current implementation.

Scenario:
1) Guest accesses a gfn range 0x800-0xa00 as private
2) Guest calls mapgpa to convert the range 0x84d-0x86e as shared
3) Guest tries to access recently converted memory as shared for the first time
Guest VM shutdown is observed after step 3 -> Guest is unable to
proceed further since somehow code section is not as expected

Corresponding KVM trace logs after step 3:
VCPU-0-61883   [078] . 72276.115679: kvm_page_fault: address
84d000 error_code 4
VCPU-0-61883   [078] . 72276.127005: kvm_mmu_spte_requested: gfn
84d pfn 100b4a4d level 2
VCPU-0-61883   [078] . 72276.127008: kvm_tdp_mmu_spte_changed: as
id 0 gfn 800 level 2 old_spte 100b1b16827 new_spte 100b4a00ea7
VCPU-0-61883   [078] . 72276.127009: kvm_mmu_prepare_zap_page: sp
gen 0 gfn 800 l1 8-byte q0 direct wux nxe ad root 0 sync
VCPU-0-61883   [078] . 72276.127009: kvm_tdp_mmu_spte_changed: as
id 0 gfn 800 level 1 old_spte 1003eb27e67 new_spte 5a0
VCPU-0-61883   [078] . 72276.127010: kvm_tdp_mmu_spte_changed: as
id 0 gfn 801 level 1 old_spte 10056cc8e67 new_spte 5a0
VCPU-0-61883   [078] . 72276.127010: kvm_tdp_mmu_spte_changed: as
id 0 gfn 802 level 1 old_spte 10056fa2e67 new_spte 5a0
VCPU-0-61883   [078] . 72276.127010: kvm_tdp_mmu_spte_changed: as
id 0 gfn 803 level 1 old_spte 0 new_spte 5a0

 VCPU-0-61883   [078] . 72276.127089: kvm_tdp_mmu_spte_changed: as
id 0 gfn 9ff level 1 old_spte 100a43f4e67 new_spte 5a0
 VCPU-0-61883   [078] . 72276.127090: kvm_mmu_set_spte: gfn 800
spte 100b4a00ea7 (rwxu) level 2 at 10052fa5020
 VCPU-0-61883   [078] . 72276.127091: kvm_fpu: unload

Looks like with transparent huge pages enabled kvm tried to handle the
shared memory fault on 0x84d gfn by coalescing nearby 4K pages
to form a contiguous 2MB page mapping at gfn 0x800, since level 2 was
requested in kvm_mmu_spte_requested.
This caused the private memory contents from regions 0x800-0x84c and
0x86e-0xa00 to get unmapped from the guest leading to guest vm
shutdown.

Does getting the mapping level as per the fault access type help
address the above issue? Any such coalescing should not cross between
private to
shared or shared to private memory regions.

> > > host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
> > > return min(host_level, max_level);
> > >  }
> >

Regards,
Vishal



Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-08 Thread Vishal Annapurve
...
> With this patch series, it's actually even not possible for userspace VMM
> to allocate private page by a direct write, it's basically unmapped from
> there. If it really wants to, it should so something special, by intention,
> that's basically the conversion, which we should allow.
>

A VM can pass GPA backed by private pages to userspace VMM and when
Userspace VMM accesses the backing hva there will be pages allocated
to back the shared fd causing 2 sets of pages backing the same guest
memory range.

> Thanks for bringing this up. But in my mind I still think userspace VMM
> can do and it's its responsibility to guarantee that, if that is hard
> required. By design, userspace VMM is the decision-maker for page
> conversion and has all the necessary information to know which page is
> shared/private. It also has the necessary knobs to allocate/free the
> physical pages for guest memory. Definitely, we should make userspace
> VMM more robust.

Making Userspace VMM more robust to avoid double allocation can get
complex, it will have to keep track of all in-use (by Userspace VMM)
shared fd memory to disallow conversion from shared to private and
will have to ensure that all guest supplied addresses belong to shared
GPA ranges.
A coarser but simpler alternative could be to always allow shared to
private conversion with unbacking the memory from shared fd and exit
if the VMM runs in double allocation scenarios. In either cases,
unbacking shared fd memory ideally should prevent memory allocation on
subsequent write accesses to ensure double allocation scenarios are
caught early.

Regards,
Vishal



Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-06 Thread Vishal Annapurve
>
> Private memory map/unmap and conversion
> ---
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
>   - map: default fallocate() with mode=0.
>   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.
>

>QEMU: https://github.com/chao-p/qemu/tree/privmem-v6
>
> An example QEMU command line for TDX test:
> -object tdx-guest,id=tdx \
> -object memory-backend-memfd-private,id=ram1,size=2G \
> -machine 
> q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1
>

There should be more discussion around double allocation scenarios
when using the private fd approach. A malicious guest or buggy
userspace VMM can cause physical memory getting allocated for both
shared (memory accessible from host) and private fds backing the guest
memory.
Userspace VMM will need to unback the shared guest memory while
handling the conversion from shared to private in order to prevent
double allocation even with malicious guests or bugs in userspace VMM.

Options to unback shared guest memory seem to be:
1) madvise(.., MADV_DONTNEED/MADV_REMOVE) - This option won't stop
kernel from backing the shared memory on subsequent write accesses
2) fallocate(..., FALLOC_FL_PUNCH_HOLE...) - For file backed shared
guest memory, this option still is similar to madvice since this would
still allow shared memory to get backed on write accesses
3) munmap - This would give away the contiguous virtual memory region
reservation with holes in the guest backing memory, which might make
guest memory management difficult.
4) mprotect(... PROT_NONE) - This would keep the virtual memory
address range backing the guest memory preserved

ram_block_discard_range_fd from reference implementation:
https://github.com/chao-p/qemu/tree/privmem-v6 seems to be relying on
fallocate/madvise.

Any thoughts/suggestions around better ways to unback the shared
memory in order to avoid double allocation scenarios?

Regards,
Vishal



Re: [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag

2022-05-31 Thread Vishal Annapurve
On Thu, May 19, 2022 at 8:41 AM Chao Peng  wrote:
>
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace through ordinary MMU
> access (e.g., read/write/mmap). However, the file content can be
> accessed via a different mechanism (e.g. KVM MMU) indirectly.
>

SEV, TDX, pkvm and software-only VMs seem to have usecases to set up
initial guest boot memory with the needed blobs.
TDX already supports a KVM IOCTL to transfer contents to private
memory using the TDX module but rest of the implementations will need
to invent
a way to do this.

Is there a plan to support a common implementation for either allowing
initial write access from userspace to private fd or adding a KVM
IOCTL to transfer contents to such a file,
as part of this series through future revisions?

Regards,
Vishal



Re: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag

2022-04-22 Thread Vishal Annapurve
On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
>
> From: "Kirill A. Shutemov" 
>
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace through ordinary MMU
> access (e.g., read/write/mmap). However, the file content can be
> accessed via a different mechanism (e.g. KVM MMU) indirectly.
>
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this flag set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>
> Since page migration/swapping is not yet supported for such usages
> so these pages are currently marked as UNMOVABLE and UNEVICTABLE
> which makes them behave like long-term pinned pages.
>
> The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> also impossible for a memfd created with this flag.
>
> At this time only shmem implements this flag.
>
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---
>  include/linux/shmem_fs.h   |  7 +
>  include/uapi/linux/memfd.h |  1 +
>  mm/memfd.c | 26 +++--
>  mm/shmem.c | 57 ++
>  4 files changed, 88 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index e65b80ed09e7..2dde843f28ef 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -12,6 +12,9 @@
>
>  /* inode in-kernel data */
>
> +/* shmem extended flags */
> +#define SHM_F_INACCESSIBLE 0x0001  /* prevent ordinary MMU access (e.g. 
> read/write/mmap) to file content */
> +
>  struct shmem_inode_info {
> spinlock_t  lock;
> unsigned intseals;  /* shmem seals */
> @@ -24,6 +27,7 @@ struct shmem_inode_info {
> struct shared_policypolicy; /* NUMA memory alloc policy */
> struct simple_xattrsxattrs; /* list of xattrs */
> atomic_tstop_eviction;  /* hold when working on inode 
> */
> +   unsigned intxflags; /* shmem extended flags */
> struct inodevfs_inode;
>  };
>
> @@ -61,6 +65,9 @@ extern struct file *shmem_file_setup(const char *name,
> loff_t size, unsigned long flags);
>  extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
> unsigned long flags);
> +extern struct file *shmem_file_setup_xflags(const char *name, loff_t size,
> +   unsigned long flags,
> +   unsigned int xflags);
>  extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
> const char *name, loff_t size, unsigned long flags);
>  extern int shmem_zero_setup(struct vm_area_struct *);
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
>  #define MFD_CLOEXEC0x0001U
>  #define MFD_ALLOW_SEALING  0x0002U
>  #define MFD_HUGETLB0x0004U
> +#define MFD_INACCESSIBLE   0x0008U
>
>  /*
>   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 9f80f162791a..74d45a26cf5d 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -245,16 +245,20 @@ long memfd_fcntl(struct file *file, unsigned int cmd, 
> unsigned long arg)
>  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +  MFD_INACCESSIBLE)
>
>  SYSCALL_DEFINE2(memfd_create,
> const char __user *, uname,
> unsigned int, flags)
>  {
> +   struct address_space *mapping;
> unsigned int *file_seals;
> +   unsigned int xflags;
> struct file *file;
> int fd, error;
> char *name;
> +   gfp_t gfp;
> long len;
>
> if (!(flags & MFD_HUGETLB)) {
> @@ -267,6 +271,10 @@ SYSCALL_DEFINE2(memfd_create,
> return -EINVAL;
> }
>
> +   /* Disallow sealing when MFD_INACCESSIBLE is set. */
> +   if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
> +   return -EINVAL;
> +
> /* length includes terminating zero */
> len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> if (len <= 0)
> @@ -301,8 +309,11 @@ SYSCALL_DEFINE2(memfd_create,
> HUGETLB_ANONHUGE_INODE,
> (flags >> MFD_HUGE_SHIFT) &
> 

Re: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd

2022-04-19 Thread Vishal Annapurve
On Thu, Mar 10, 2022 at 6:11 AM Chao Peng  wrote:
>
> KVM gets notified when memory pages changed in the memory backing store.
> When userspace allocates the memory with fallocate() or frees memory
> with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into
> KVM fallocate/invalidate callbacks respectively. To ensure KVM never
> maps both the private and shared variants of a GPA into the guest, in
> the fallocate callback, we should zap the existing shared mapping and
> in the invalidate callback we should zap the existing private mapping.
>
> In the callbacks, KVM firstly converts the offset range into the
> gfn_range and then calls existing kvm_unmap_gfn_range() which will zap
> the shared or private mapping. Both callbacks pass in a memslot
> reference but we need 'kvm' so add a reference in memslot structure.
>
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 
> ---
>  include/linux/kvm_host.h |  3 ++-
>  virt/kvm/kvm_main.c  | 36 
>  2 files changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9b175aeca63f..186b9b981a65 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -236,7 +236,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t 
> cr2_or_gpa,
>  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
>  #endif
>
> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER)
>  struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> gfn_t start;
> @@ -568,6 +568,7 @@ struct kvm_memory_slot {
> loff_t private_offset;
> struct memfile_pfn_ops *pfn_ops;
> struct memfile_notifier notifier;
> +   struct kvm *kvm;
>  };
>
>  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 67349421eae3..52319f49d58a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
>  #ifdef CONFIG_MEMFILE_NOTIFIER
> +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
> +pgoff_t start, pgoff_t end)
> +{
> +   int idx;
> +   struct kvm_memory_slot *slot = container_of(notifier,
> +   struct kvm_memory_slot,
> +   notifier);
> +   struct kvm_gfn_range gfn_range = {
> +   .slot   = slot,
> +   .start  = start - (slot->private_offset >> 
> PAGE_SHIFT),
> +   .end= end - (slot->private_offset >> PAGE_SHIFT),
> +   .may_block  = true,
> +   };
> +   struct kvm *kvm = slot->kvm;
> +
> +   gfn_range.start = max(gfn_range.start, slot->base_gfn);

gfn_range.start seems to be page offset within the file. Should this rather be:
gfn_range.start = slot->base_gfn + min(gfn_range.start, slot->npages);

> +   gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> +

Similar to previous comment, should this rather be:
gfn_range.end = slot->base_gfn + min(gfn_range.end, slot->npages);

> +   if (gfn_range.start >= gfn_range.end)
> +   return;
> +
> +   idx = srcu_read_lock(>srcu);
> +   KVM_MMU_LOCK(kvm);
> +   kvm_unmap_gfn_range(kvm, _range);
> +   kvm_flush_remote_tlbs(kvm);
> +   KVM_MMU_UNLOCK(kvm);
> +   srcu_read_unlock(>srcu, idx);
> +}
> +
> +static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
> +   .invalidate = kvm_memfile_notifier_handler,
> +   .fallocate = kvm_memfile_notifier_handler,
> +};
> +
>  static inline int kvm_memfile_register(struct kvm_memory_slot *slot)
>  {
> +   slot->notifier.ops = _memfile_notifier_ops;
> return memfile_register_notifier(file_inode(slot->private_file),
>  >notifier,
>  >pfn_ops);
> @@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> new->private_file = file;
> new->private_offset = mem->flags & KVM_MEM_PRIVATE ?
>   region_ext->private_offset : 0;
> +   new->kvm = kvm;
>
> r = kvm_set_memslot(kvm, old, new, change);
> if (!r)
> --
> 2.17.1
>



Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier

2022-04-19 Thread Vishal Annapurve
On Thu, Mar 10, 2022 at 6:10 AM Chao Peng  wrote:
>
> From: "Kirill A. Shutemov" 
>
> It maintains a memfile_notifier list in shmem_inode_info structure and
> implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> then exposes them to memfile_notifier via
> shmem_get_memfile_notifier_info.
>
> We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> allocated by userspace for private memory. If there is no pages
> allocated at the offset then error should be returned so KVM knows that
> the memory is not private memory.
>
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---
>  include/linux/shmem_fs.h |  4 +++
>  mm/shmem.c   | 76 
>  2 files changed, 80 insertions(+)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 2dde843f28ef..7bb16f2d2825 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /* inode in-kernel data */
>
> @@ -28,6 +29,9 @@ struct shmem_inode_info {
> struct simple_xattrsxattrs; /* list of xattrs */
> atomic_tstop_eviction;  /* hold when working on inode 
> */
> unsigned intxflags; /* shmem extended flags */
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +   struct memfile_notifier_list memfile_notifiers;
> +#endif
> struct inodevfs_inode;
>  };
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9b31a7056009..7b43e274c9a2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct 
> inode *inode, pgoff_t index)
> return page ? page_folio(page) : NULL;
>  }
>
> +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +   struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +   memfile_notifier_fallocate(>memfile_notifiers, start, end);
> +#endif
> +}
> +
> +static void notify_invalidate_page(struct inode *inode, struct folio *folio,
> +  pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +   struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +   start = max(start, folio->index);
> +   end = min(end, folio->index + folio_nr_pages(folio));
> +
> +   memfile_notifier_invalidate(>memfile_notifiers, start, end);
> +#endif
> +}
> +
>  /*
>   * Remove range of pages and swap entries from page cache, and free them.
>   * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> @@ -946,6 +968,8 @@ static void shmem_undo_range(struct inode *inode, loff_t 
> lstart, loff_t lend,
> }
> index += folio_nr_pages(folio) - 1;
>
> +   notify_invalidate_page(inode, folio, start, end);
> +
> if (!unfalloc || !folio_test_uptodate(folio))
> truncate_inode_folio(mapping, folio);
> folio_unlock(folio);
> @@ -1019,6 +1043,9 @@ static void shmem_undo_range(struct inode *inode, 
> loff_t lstart, loff_t lend,
> index--;
> break;
> }
> +
> +   notify_invalidate_page(inode, folio, start, 
> end);
> +

Should this be done in batches or done once for all of range [start, end)?

> VM_BUG_ON_FOLIO(folio_test_writeback(folio),
> folio);
> truncate_inode_folio(mapping, folio);
> @@ -2279,6 +2306,9 @@ static struct inode *shmem_get_inode(struct super_block 
> *sb, const struct inode
> info->flags = flags & VM_NORESERVE;
> INIT_LIST_HEAD(>shrinklist);
> INIT_LIST_HEAD(>swaplist);
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +   memfile_notifier_list_init(>memfile_notifiers);
> +#endif
> simple_xattrs_init(>xattrs);
> cache_no_acl(inode);
> mapping_set_large_folios(inode->i_mapping);
> @@ -2802,6 +2832,7 @@ static long shmem_fallocate(struct file *file, int 
> mode, loff_t offset,
> if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> i_size_write(inode, offset + len);
> inode->i_ctime = current_time(inode);
> +   notify_fallocate(inode, start, end);
>  undone:
> spin_lock(>i_lock);
> inode->i_private = NULL;
> @@ -3909,6 +3940,47 @@ static struct file_system_type shmem_fs_type = {
> .fs_flags   = FS_USERNS_MOUNT,
>  };
>
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> +static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int 
> *order)
> +{
> +   struct page *page;
> +   int ret;
> +
> +   ret = shmem_getpage(inode, 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-08 Thread Vishal Annapurve
On Mon, Mar 28, 2022 at 10:17 AM Andy Lutomirski  wrote:
>
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
> >
> > This is the v5 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> >   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
>
> Can this series be run and a VM booted without TDX?  A feature like
> that might help push it forward.
>
> --Andy

I have posted a RFC series with selftests to exercise the UPM feature
with normal non-confidential VMs via
https://lore.kernel.org/kvm/20220408210545.3915712-1-vannapu...@google.com/

-- Vishal