Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

2023-08-03 Thread Vlastimil Babka
On 7/26/23 13:20, Nikunj A. Dadhania wrote:
> Hi Sean,
> 
> On 7/24/2023 10:30 PM, Sean Christopherson wrote:
>> On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
>>> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
 This is the next iteration of implementing fd-based (instead of vma-based)
 memory for KVM guests.  If you want the full background of why we are doing
 this, please go read the v10 cover letter[1].

 The biggest change from v10 is to implement the backing storage in KVM
 itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
 See link[2] for details on why we pivoted to a KVM-specific approach.

 Key word is "biggest".  Relative to v10, there are many big changes.
 Highlights below (I can't remember everything that got changed at
 this point).

 Tagged RFC as there are a lot of empty changelogs, and a lot of missing
 documentation.  And ideally, we'll have even more tests before merging.
 There are also several gaps/opens (to be discussed in tomorrow's PUCK).
>>>
>>> As per our discussion on the PUCK call, here are the memory/NUMA accounting 
>>> related observations that I had while working on SNP guest secure page 
>>> migration:
>>>
>>> * gmem allocations are currently treated as file page allocations
>>>   accounted to the kernel and not to the QEMU process.
>> 
>> We need to level set on terminology: these are all *stats*, not accounting.  
>> That
>> distinction matters because we have wiggle room on stats, e.g. we can 
>> probably get
>> away with just about any definition of how guest_memfd memory impacts stats, 
>> so
>> long as the information that is surfaced to userspace is useful and expected.
>> 
>> But we absolutely need to get accounting correct, specifically the 
>> allocations
>> need to be correctly accounted in memcg.  And unless I'm missing something,
>> nothing in here shows anything related to memcg.
> 
> I tried out memcg after creating a separate cgroup for the qemu process. 
> Guest 
> memory is accounted in memcg.
> 
>   $ egrep -w "file|file_thp|unevictable" memory.stat
>   file 42978775040
>   file_thp 42949672960
>   unevictable 42953588736 
> 
> NUMA allocations are coming from right nodes as set by the numactl.
> 
>   $ egrep -w "file|file_thp|unevictable" memory.numa_stat
>   file N0=0 N1=20480 N2=21489377280 N3=21489377280
>   file_thp N0=0 N1=0 N2=21472739328 N3=21476933632
>   unevictable N0=0 N1=0 N2=21474697216 N3=21478891520
> 
>> 
>>>   Starting an SNP guest with 40G memory with memory interleave between
>>>   Node2 and Node3
>>>
>>>   $ numactl -i 2,3 ./bootg_snp.sh
>>>
>>> PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ 
>>> COMMAND
>>>  242179 root  20   0   40.4g  99580  51676 S  78.0   0.0   0:56.58 
>>> qemu-system-x86
>>>
>>>   -> Incorrect process resident memory and shared memory is reported
>> 
>> I don't know that I would call these "incorrect".  Shared memory definitely 
>> is
>> correct, because by definition guest_memfd isn't shared.  RSS is less clear 
>> cut;
>> gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up 
>> with
>> scenarios where RSS > VIRT, which will be quite confusing for unaware users 
>> (I'm
>> assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
>> memslots).
> 
> I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming 
> all the
> memory is private)
> 
> As per my experiments with a hack below. MM_FILEPAGES does get accounted to 
> RSS/SHR in top
> 
> PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>4339 root  20   0   40.4g  40.1g  40.1g S  76.7  16.0   0:13.83 
> qemu-system-x86
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f456f3b5049c..5b1f48a2e714 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
>  {
> trace_rss_stat(mm, member);
>  }
> +EXPORT_SYMBOL(mm_trace_rss_stat);
> 
>  /*
>   * Note: this doesn't free the actual pages themselves. That
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index a7e926af4255..e4f268bf9ce2 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, 
> pgoff_t index)
> clear_highpage(folio_page(folio, i));
> }
> 
> +   /* Account only once for the first time */
> +   if (!folio_test_dirty(folio))
> +   add_mm_counter(current->mm, MM_FILEPAGES, 
> folio_nr_pages(folio));

I think this alone would cause "Bad rss-counter" messages when the process
exits, because there's no corresponding decrement when page tables are torn
down. We would probably have to instantiate the page tables (i.e. with
PROT_NONE so userspace can't really do accesses through them) for this to
work properly.

So then it wouldn't technically be "unmapped private memory" a

Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

2023-07-26 Thread Nikunj A. Dadhania
On 7/26/2023 7:54 PM, Sean Christopherson wrote:
> On Wed, Jul 26, 2023, Nikunj A. Dadhania wrote:
>> On 7/24/2023 10:30 PM, Sean Christopherson wrote:

   /proc//smaps
   7f528be0-7f5c8be0 rw-p  00:01 26629  
 /memfd:memory-backend-memfd-shared (deleted)
   7f5c9020-7f5c9022 rw-s  00:01 44033  
 /memfd:rom-backend-memfd-shared (deleted)
   7f5c9040-7f5c9042 rw-s  00:01 44032  
 /memfd:rom-backend-memfd-shared (deleted)
   7f5c9080-7f5c90b7c000 rw-s  00:01 1025   
 /memfd:rom-backend-memfd-shared (deleted)
>>>
>>> This is all expected, and IMO correct.  There are no userspace mappings, 
>>> and so
>>> not accounting anything is working as intended.
>> Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how
>> would we know who is using 100GB of memory?
> 
> It's correct with respect to what the interfaces show, which is how much 
> memory
> is *mapped* into userspace.
> 
> As I said (or at least tried to say) in my first reply, I am not against 
> exposing
> memory usage to userspace via stats, only that it's not obvious to me that the
> existing VMA-based stats are the most appropriate way to surface this 
> information.

Right, then should we think in the line of creating a VM IOCTL for querying 
current memory
usage for guest-memfd ?

We could use memcg for statistics, but then memory cgroup can be disabled and 
so memcg 
isn't really a dependable option.

Do you have some ideas on how to expose the memory usage to the user space 
other than
VMA-based stats ?

Regards,
Nikunj


Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

2023-07-26 Thread Sean Christopherson
On Wed, Jul 26, 2023, Nikunj A. Dadhania wrote:
> Hi Sean,
> 
> On 7/24/2023 10:30 PM, Sean Christopherson wrote:
> >>   Starting an SNP guest with 40G memory with memory interleave between
> >>   Node2 and Node3
> >>
> >>   $ numactl -i 2,3 ./bootg_snp.sh
> >>
> >> PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ 
> >> COMMAND
> >>  242179 root  20   0   40.4g  99580  51676 S  78.0   0.0   0:56.58 
> >> qemu-system-x86
> >>
> >>   -> Incorrect process resident memory and shared memory is reported
> > 
> > I don't know that I would call these "incorrect".  Shared memory definitely 
> > is
> > correct, because by definition guest_memfd isn't shared.  RSS is less clear 
> > cut;
> > gmem memory is resident in RAM, but if we show gmem in RSS then we'll end 
> > up with
> > scenarios where RSS > VIRT, which will be quite confusing for unaware users 
> > (I'm
> > assuming the 40g of VIRT here comes from QEMU mapping the shared half of 
> > gmem
> > memslots).
> 
> I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming 
> all the
> memory is private)

And also assuming that (a) userspace mmap()'d the shared side of things 1:1 with
private memory and (b) that the shared mappings have not been populated.   Those
assumptions will mostly probably hold true for QEMU, but kernel correctness
shouldn't depend on assumptions about one specific userspace application.

> >>   /proc//smaps
> >>   7f528be0-7f5c8be0 rw-p  00:01 26629  
> >> /memfd:memory-backend-memfd-shared (deleted)
> >>   7f5c9020-7f5c9022 rw-s  00:01 44033  
> >> /memfd:rom-backend-memfd-shared (deleted)
> >>   7f5c9040-7f5c9042 rw-s  00:01 44032  
> >> /memfd:rom-backend-memfd-shared (deleted)
> >>   7f5c9080-7f5c90b7c000 rw-s  00:01 1025   
> >> /memfd:rom-backend-memfd-shared (deleted)
> > 
> > This is all expected, and IMO correct.  There are no userspace mappings, 
> > and so
> > not accounting anything is working as intended.
> Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how
> would we know who is using 100GB of memory?

It's correct with respect to what the interfaces show, which is how much memory
is *mapped* into userspace.

As I said (or at least tried to say) in my first reply, I am not against 
exposing
memory usage to userspace via stats, only that it's not obvious to me that the
existing VMA-based stats are the most appropriate way to surface this 
information.


Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

2023-07-26 Thread Nikunj A. Dadhania
Hi Sean,

On 7/24/2023 10:30 PM, Sean Christopherson wrote:
> On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
>> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
>>> This is the next iteration of implementing fd-based (instead of vma-based)
>>> memory for KVM guests.  If you want the full background of why we are doing
>>> this, please go read the v10 cover letter[1].
>>>
>>> The biggest change from v10 is to implement the backing storage in KVM
>>> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
>>> See link[2] for details on why we pivoted to a KVM-specific approach.
>>>
>>> Key word is "biggest".  Relative to v10, there are many big changes.
>>> Highlights below (I can't remember everything that got changed at
>>> this point).
>>>
>>> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
>>> documentation.  And ideally, we'll have even more tests before merging.
>>> There are also several gaps/opens (to be discussed in tomorrow's PUCK).
>>
>> As per our discussion on the PUCK call, here are the memory/NUMA accounting 
>> related observations that I had while working on SNP guest secure page 
>> migration:
>>
>> * gmem allocations are currently treated as file page allocations
>>   accounted to the kernel and not to the QEMU process.
> 
> We need to level set on terminology: these are all *stats*, not accounting.  
> That
> distinction matters because we have wiggle room on stats, e.g. we can 
> probably get
> away with just about any definition of how guest_memfd memory impacts stats, 
> so
> long as the information that is surfaced to userspace is useful and expected.
> 
> But we absolutely need to get accounting correct, specifically the allocations
> need to be correctly accounted in memcg.  And unless I'm missing something,
> nothing in here shows anything related to memcg.

I tried out memcg after creating a separate cgroup for the qemu process. Guest 
memory is accounted in memcg.

  $ egrep -w "file|file_thp|unevictable" memory.stat
  file 42978775040
  file_thp 42949672960
  unevictable 42953588736 

NUMA allocations are coming from right nodes as set by the numactl.

  $ egrep -w "file|file_thp|unevictable" memory.numa_stat
  file N0=0 N1=20480 N2=21489377280 N3=21489377280
  file_thp N0=0 N1=0 N2=21472739328 N3=21476933632
  unevictable N0=0 N1=0 N2=21474697216 N3=21478891520

> 
>>   Starting an SNP guest with 40G memory with memory interleave between
>>   Node2 and Node3
>>
>>   $ numactl -i 2,3 ./bootg_snp.sh
>>
>> PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ 
>> COMMAND
>>  242179 root  20   0   40.4g  99580  51676 S  78.0   0.0   0:56.58 
>> qemu-system-x86
>>
>>   -> Incorrect process resident memory and shared memory is reported
> 
> I don't know that I would call these "incorrect".  Shared memory definitely is
> correct, because by definition guest_memfd isn't shared.  RSS is less clear 
> cut;
> gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up 
> with
> scenarios where RSS > VIRT, which will be quite confusing for unaware users 
> (I'm
> assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
> memslots).

I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming 
all the
memory is private)

As per my experiments with a hack below. MM_FILEPAGES does get accounted to 
RSS/SHR in top

PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
   4339 root  20   0   40.4g  40.1g  40.1g S  76.7  16.0   0:13.83 
qemu-system-x86

diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..5b1f48a2e714 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
 {
trace_rss_stat(mm, member);
 }
+EXPORT_SYMBOL(mm_trace_rss_stat);

 /*
  * Note: this doesn't free the actual pages themselves. That
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index a7e926af4255..e4f268bf9ce2 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, 
pgoff_t index)
clear_highpage(folio_page(folio, i));
}

+   /* Account only once for the first time */
+   if (!folio_test_dirty(folio))
+   add_mm_counter(current->mm, MM_FILEPAGES, 
folio_nr_pages(folio));
+
folio_mark_accessed(folio);
folio_mark_dirty(folio);
folio_mark_uptodate(folio);

We can update the rss_stat appropriately to get correct reporting in userspace.

>>   Accounting of the memory happens in the host page fault handler path,
>>   but for private guest pages we will never hit that.
>>
>> * NUMA allocation does use the process mempolicy for appropriate node 
>>   allocation (Node2 and Node3), but they again do not get attributed to 
>>   the QEMU process
>>
>>   Every 1.0s: sudo numastat  -m -p qemu-system-x86 | egrep -i 
>> "qemu|PID|Node|Filepage"   gomat

Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

2023-07-24 Thread Sean Christopherson
On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
> > This is the next iteration of implementing fd-based (instead of vma-based)
> > memory for KVM guests.  If you want the full background of why we are doing
> > this, please go read the v10 cover letter[1].
> > 
> > The biggest change from v10 is to implement the backing storage in KVM
> > itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
> > See link[2] for details on why we pivoted to a KVM-specific approach.
> > 
> > Key word is "biggest".  Relative to v10, there are many big changes.
> > Highlights below (I can't remember everything that got changed at
> > this point).
> > 
> > Tagged RFC as there are a lot of empty changelogs, and a lot of missing
> > documentation.  And ideally, we'll have even more tests before merging.
> > There are also several gaps/opens (to be discussed in tomorrow's PUCK).
> 
> As per our discussion on the PUCK call, here are the memory/NUMA accounting 
> related observations that I had while working on SNP guest secure page 
> migration:
> 
> * gmem allocations are currently treated as file page allocations
>   accounted to the kernel and not to the QEMU process.

We need to level set on terminology: these are all *stats*, not accounting.  
That
distinction matters because we have wiggle room on stats, e.g. we can probably 
get
away with just about any definition of how guest_memfd memory impacts stats, so
long as the information that is surfaced to userspace is useful and expected.

But we absolutely need to get accounting correct, specifically the allocations
need to be correctly accounted in memcg.  And unless I'm missing something,
nothing in here shows anything related to memcg.

>   Starting an SNP guest with 40G memory with memory interleave between
>   Node2 and Node3
> 
>   $ numactl -i 2,3 ./bootg_snp.sh
> 
> PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  242179 root  20   0   40.4g  99580  51676 S  78.0   0.0   0:56.58 
> qemu-system-x86
> 
>   -> Incorrect process resident memory and shared memory is reported

I don't know that I would call these "incorrect".  Shared memory definitely is
correct, because by definition guest_memfd isn't shared.  RSS is less clear cut;
gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up 
with
scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm
assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
memslots).

>   Accounting of the memory happens in the host page fault handler path,
>   but for private guest pages we will never hit that.
> 
> * NUMA allocation does use the process mempolicy for appropriate node 
>   allocation (Node2 and Node3), but they again do not get attributed to 
>   the QEMU process
> 
>   Every 1.0s: sudo numastat  -m -p qemu-system-x86 | egrep -i 
> "qemu|PID|Node|Filepage"   gomati: Mon Jul 24 11:51:34 2023
> 
>   Per-node process memory usage (in MBs)
>   PID   Node 0  Node 1  Node 2
>   Node 3   Total
>   242179 (qemu-system-x86)   21.141.61   39.44
>39.38  101.57
>   Per-node system memory usage (in MBs):
> Node 0  Node 1  Node 2  
> Node 3   Total
>   FilePages2475.63 2395.8323999.46
> 23373.2252244.14
> 
> 
> * Most of the memory accounting relies on the VMAs and as private-fd of 
>   gmem doesn't have a VMA(and that was the design goal), user-space fails 
>   to attribute the memory appropriately to the process.
>
>   /proc//numa_maps
>   7f528be0 interleave:2-3 
> file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 
> mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4
>   7f5c9020 interleave:2-3 
> file=/memfd:rom-backend-memfd-shared\040(deleted)
>   7f5c9040 interleave:2-3 
> file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 
> kernelpagesize_kB=4
>   7f5c9080 interleave:2-3 
> file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 
> N3=380 kernelpagesize_kB=4
> 
>   /proc//smaps
>   7f528be0-7f5c8be0 rw-p  00:01 26629  
> /memfd:memory-backend-memfd-shared (deleted)
>   7f5c9020-7f5c9022 rw-s  00:01 44033  
> /memfd:rom-backend-memfd-shared (deleted)
>   7f5c9040-7f5c9042 rw-s  00:01 44032  
> /memfd:rom-backend-memfd-shared (deleted)
>   7f5c9080-7f5c90b7c000 rw-s  00:01 1025   
> /memfd:rom-backend-memfd-shared (deleted)

This is all expected, and IMO correct.  There are no userspace mappings, and so
not accounting anything is working as intended.

> * QEMU based NUMA bindings will not work. Memory backend uses m

Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

2023-07-24 Thread Nikunj A. Dadhania
On 7/19/2023 5:14 AM, Sean Christopherson wrote:
> This is the next iteration of implementing fd-based (instead of vma-based)
> memory for KVM guests.  If you want the full background of why we are doing
> this, please go read the v10 cover letter[1].
> 
> The biggest change from v10 is to implement the backing storage in KVM
> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
> See link[2] for details on why we pivoted to a KVM-specific approach.
> 
> Key word is "biggest".  Relative to v10, there are many big changes.
> Highlights below (I can't remember everything that got changed at
> this point).
> 
> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
> documentation.  And ideally, we'll have even more tests before merging.
> There are also several gaps/opens (to be discussed in tomorrow's PUCK).

As per our discussion on the PUCK call, here are the memory/NUMA accounting 
related observations that I had while working on SNP guest secure page 
migration:

* gmem allocations are currently treated as file page allocations
  accounted to the kernel and not to the QEMU process. 
  
  Starting an SNP guest with 40G memory with memory interleave between
  Node2 and Node3

  $ numactl -i 2,3 ./bootg_snp.sh

PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
 242179 root  20   0   40.4g  99580  51676 S  78.0   0.0   0:56.58 
qemu-system-x86

  -> Incorrect process resident memory and shared memory is reported

  Accounting of the memory happens in the host page fault handler path,
  but for private guest pages we will never hit that.

* NUMA allocation does use the process mempolicy for appropriate node 
  allocation (Node2 and Node3), but they again do not get attributed to 
  the QEMU process

  Every 1.0s: sudo numastat  -m -p qemu-system-x86 | egrep -i 
"qemu|PID|Node|Filepage"   gomati: Mon Jul 24 11:51:34 2023

  Per-node process memory usage (in MBs)
  PID   Node 0  Node 1  Node 2  
Node 3   Total
  242179 (qemu-system-x86)   21.141.61   39.44  
 39.38  101.57
  Per-node system memory usage (in MBs):
Node 0  Node 1  Node 2  
Node 3   Total
  FilePages2475.63 2395.8323999.46
23373.2252244.14


* Most of the memory accounting relies on the VMAs and as private-fd of 
  gmem doesn't have a VMA(and that was the design goal), user-space fails 
  to attribute the memory appropriately to the process.

  /proc//numa_maps
  7f528be0 interleave:2-3 
file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 
mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4
  7f5c9020 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted)
  7f5c9040 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) 
dirty=32 active=0 N2=32 kernelpagesize_kB=4
  7f5c9080 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) 
dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4

  /proc//smaps
  7f528be0-7f5c8be0 rw-p  00:01 26629  
/memfd:memory-backend-memfd-shared (deleted)
  7f5c9020-7f5c9022 rw-s  00:01 44033  
/memfd:rom-backend-memfd-shared (deleted)
  7f5c9040-7f5c9042 rw-s  00:01 44032  
/memfd:rom-backend-memfd-shared (deleted)
  7f5c9080-7f5c90b7c000 rw-s  00:01 1025   
/memfd:rom-backend-memfd-shared (deleted)

* QEMU based NUMA bindings will not work. Memory backend uses mbind() 
  to set the policy for a particular virtual memory range but gmem 
  private-FD does not have a virtual memory range visible in the host.

Regards,
Nikunj


[RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

2023-07-18 Thread Sean Christopherson
This is the next iteration of implementing fd-based (instead of vma-based)
memory for KVM guests.  If you want the full background of why we are doing
this, please go read the v10 cover letter[1].

The biggest change from v10 is to implement the backing storage in KVM
itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
See link[2] for details on why we pivoted to a KVM-specific approach.

Key word is "biggest".  Relative to v10, there are many big changes.
Highlights below (I can't remember everything that got changed at
this point).

Tagged RFC as there are a lot of empty changelogs, and a lot of missing
documentation.  And ideally, we'll have even more tests before merging.
There are also several gaps/opens (to be discussed in tomorrow's PUCK).

v11:
 - Test private<=>shared conversions *without* doing fallocate()
 - PUNCH_HOLE all memory between iterations of the conversion test so that
   KVM doesn't retain pages in the guest_memfd
 - Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of
   giving it a THP or PMD specific name.
 - Fold in fixes from a lot of people (thank you!)
 - Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if
   KVM handles a page fault and looks at inconsistent attributes
 - Refactor MMU interaction with attributes updates to reuse much of KVM's
   framework for mmu_notifiers.

[1] 
https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.p...@linux.intel.com
[2] https://lore.kernel.org/all/zem5zq8oo+xna...@google.com

Ackerley Tng (1):
  KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (7):
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Introduce per-page memory attributes
  KVM: x86: Disallow hugepages when memory attributes are mixed
  KVM: x86/mmu: Handle page fault for private memory
  KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
  KVM: selftests: Expand set_memory_region_test to validate
guest_memfd()

Sean Christopherson (18):
  KVM: Wrap kvm_gfn_range.pte in a per-action union
  KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn
ranges
  KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to
CONFIG_KVM_GENERIC_MMU_NOTIFIER
  KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  security: Export security_inode_init_security_anon() for use by KVM
  KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing
memory
  KVM: Add transparent hugepage support for dedicated guest memory
  KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  KVM: Allow arch code to track number of memslot address spaces per VM
  KVM: x86: Add support for "protected VMs" that can utilize private
memory
  KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
  KVM: selftests: Convert lib's mem regions to
KVM_SET_USER_MEMORY_REGION2
  KVM: selftests: Add support for creating private memslots
  KVM: selftests: Introduce VM "shape" to allow tests to specify the VM
type
  KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data
  KVM: selftests: Add basic selftest for guest_memfd()

Vishal Annapurve (3):
  KVM: selftests: Add helpers to convert guest memory b/w private and
shared
  KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls
(x86)
  KVM: selftests: Add x86-only selftest for private memory conversions

 Documentation/virt/kvm/api.rst| 114 
 arch/arm64/include/asm/kvm_host.h |   2 -
 arch/arm64/kvm/Kconfig|   2 +-
 arch/arm64/kvm/mmu.c  |   2 +-
 arch/mips/include/asm/kvm_host.h  |   2 -
 arch/mips/kvm/Kconfig |   2 +-
 arch/mips/kvm/mmu.c   |   2 +-
 arch/powerpc/include/asm/kvm_host.h   |   2 -
 arch/powerpc/kvm/Kconfig  |   8 +-
 arch/powerpc/kvm/book3s_hv.c  |   2 +-
 arch/powerpc/kvm/powerpc.c|   5 +-
 arch/riscv/include/asm/kvm_host.h |   2 -
 arch/riscv/kvm/Kconfig|   2 +-
 arch/riscv/kvm/mmu.c  |   2 +-
 arch/x86/include/asm/kvm_host.h   |  17 +-
 arch/x86/include/uapi/asm/kvm.h   |   3 +
 arch/x86/kvm/Kconfig  |  14 +-
 arch/x86/kvm/debugfs.c|   2 +-
 arch/x86/kvm/mmu/mmu.c| 287 +++-
 arch/x86/kvm/mmu/mmu_internal.h   |   4 +
 arch/x86/kvm/mmu/mmutrace.h   |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c|   8 +-
 arch/x86/kvm/vmx/vmx.c|  11 +-
 arch/x86/kvm/x86.c|  24 +-
 include/linux/kvm_host.h  | 129 +++-
 include/linux/pagemap.h   |  11 +
 i