Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-20 Thread Sean Christopherson
On Tue, Apr 20, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 08:09:13PM +, Sean Christopherson wrote:
> > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > The critical question is whether we ever need to translate hva->pfn after
> > > the page is added to the guest private memory. I believe we do, but I
> > > never checked. And that's the reason we need to keep hwpoison entries
> > > around, which encode pfn.
> > 
> > As proposed in the TDX RFC, KVM would "need" the hva->pfn translation if the
> > guest private EPT entry was zapped, e.g. by NUMA balancing (which will fail 
> > on
> > the backend).  But in that case, KVM still has the original PFN, the "new"
> > translation becomes a sanity check to make sure that the zapped translation
> > wasn't moved unexpectedly.
> > 
> > Regardless, I don't see what that has to do with kvm_pfn_map.  At some 
> > point,
> > gup() has to fault in the page or look at the host PTE value.  For the 
> > latter,
> > at least on x86, we can throw info into the PTE itself to tag it as 
> > guest-only.
> > No matter what implementation we settle on, I think we've failed if we end 
> > up in
> > a situation where the primary MMU has pages it doesn't know are guest-only.
> 
> I try to understand if it's a problem if KVM sees a guest-only PTE, but
> it's for other VM. Like two VM's try to use the same tmpfs file as guest
> memory. We cannot insert the pfn into two TD/SEV guest at once, but can it
> cause other problems? I'm not sure.

For TDX and SNP, "firmware" will prevent assigning the same PFN to multiple VMs.

For SEV and SEV-ES, the PSP (what I'm calling "firmware") will not prevent
assigning the same page to multiple guests.  But the failure mode in that case,
assuming the guests have different ASIDs, is limited to corruption of the guest.

On the other hand, for SEV/SEV-ES it's not invalid to assign the same ASID to
multiple guests (there's an in-flight patch to do exactly that[*]), and sharing
PFNs between guests with the same ASID would also be valid.  In other words, if
we want to enforce PFN association in the kernel, I think the association should
be per-ASID, not per-KVM guest.

So, I don't think we _need_ to rely on the TDX/SNP behavior, but if leveraging
firmware to handle those checks means avoiding additional complexity in the
kernel, then I think it's worth leaning on firmware even if it means SEV/SEV-ES
don't enjoy the same level of robustness.

[*] https://lkml.kernel.org/r/20210408223214.2582277-1-na...@google.com


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Kirill A. Shutemov
On Mon, Apr 19, 2021 at 08:09:13PM +, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Mon, Apr 19, 2021 at 06:09:29PM +, Sean Christopherson wrote:
> > > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > > On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> > > > > But fundamentally the private pages, are well, private.  They can't 
> > > > > be shared
> > > > > across processes, so I think we could (should?) require the VMA to 
> > > > > always be
> > > > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. 
> > > > > is that
> > > > > enough to prevent userspace and unaware kernel code from acquiring a 
> > > > > reference
> > > > > to the underlying page?
> > > > 
> > > > Shared pages should be fine too (you folks wanted tmpfs support).
> > > 
> > > Is that a conflict though?  If the private->shared conversion request is 
> > > kicked
> > > out to userspace, then userspace can re-mmap() the files as MAP_SHARED, 
> > > no?
> > > 
> > > Allowing MAP_SHARED for guest private memory feels wrong.  The data can't 
> > > be
> > > shared, and dirty data can't be written back to the file.
> > 
> > It can be remapped, but faulting in the page would produce hwpoison entry.
> 
> It sounds like you're thinking the whole tmpfs file is poisoned.

No. File is not poisoned. Pages got poisoned when they are faulted into
flagged VMA. Different VM can use the same file as long as offsets do not
overlap.

> My thought is that userspace would need to do something like for guest
> private memory:
> 
>   mmap(NULL, guest_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | 
> MAP_GUEST_ONLY, fd, 0);
> 
> The MAP_GUEST_ONLY would be used by the kernel to ensure the resulting VMA can
> only point at private/poisoned memory, e.g. on fault, the associated PFN would
> be tagged with PG_hwpoison or whtaever.  @fd in this case could point at 
> tmpfs,
> but I don't think it's a hard requirement.
> 
> On conversion to shared, userspace could then do:
> 
>   munmap(, )
>   mmap(, , PROT_READ|PROT_WRITE, MAP_SHARED | 
> MAP_FIXED_NOREPLACE, fd, );
> 
> or
> 
>   mmap(, , PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 
> );
> 

I played with this variant before, but initiated from kernel. Should work
fine.

> or
> 
>   ioctl(kvm, KVM_SET_USER_MEMORY_REGION, );
>   mmap(NULL, , PROT_READ|PROT_WRITE, MAP_SHARED, fd, );
>   ioctl(kvm, KVM_SET_USER_MEMORY_REGION, );
> 
> Combinations would also work, e.g. unmap the private range and move the 
> memslot.
> The private and shared memory regions could also be backed differently, e.g.
> tmpfs for shared memory, anonymous for private memory.

Right. Kernel has to be flexible enough to provide any of the schemes.

> 
> > I don't see other way to make Google's use-case with tmpfs-backed guest
> > memory work.
> 
> The underlying use-case is to be able to access guest memory from more than 
> one
> process, e.g. so that communication with the guest isn't limited to the VMM
> process associated with the KVM instances.  By definition, guest private 
> memory
> can't be accessed by the host; I don't see how anyone, Google included, can 
> have
> any real requirements about
> 
> > > > The poisoned pages must be useless outside of the process with the 
> > > > blessed
> > > > struct kvm. See kvm_pfn_map in the patch.
> > > 
> > > The big requirement for kernel TDX support is that the pages are useless 
> > > in the
> > > host.  Regarding the guest, for TDX, the TDX Module guarantees that at 
> > > most a
> > > single KVM guest can have access to a page at any given time.  I believe 
> > > the RMP
> > > provides the same guarantees for SEV-SNP.
> > > 
> > > SEV/SEV-ES could still end up with corruption if multiple guests map the 
> > > same
> > > private page, but that's obviously not the end of the world since it's 
> > > the status
> > > quo today.  Living with that shortcoming might be a worthy tradeoff if 
> > > punting
> > > mutual exclusion between guests to firmware/hardware allows us to 
> > > simplify the
> > > kernel implementation.
> > 
> > The critical question is whether we ever need to translate hva->pfn after
> > the page is added to the guest private memory. I believe we do, but I
> > never checked. And that's the reason we need to keep hwpoison entries
> > around, which encode pfn.
> 
> As proposed in the TDX RFC, KVM would "need" the hva->pfn translation if the
> guest private EPT entry was zapped, e.g. by NUMA balancing (which will fail on
> the backend).  But in that case, KVM still has the original PFN, the "new"
> translation becomes a sanity check to make sure that the zapped translation
> wasn't moved unexpectedly.
> 
> Regardless, I don't see what that has to do with kvm_pfn_map.  At some point,
> gup() has to fault in the page or look at the host PTE value.  For the latter,
> at least on x86, we can throw info into the PTE itself to tag it as 
> guest-only.
> No matter what 

Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Sean Christopherson
On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 06:09:29PM +, Sean Christopherson wrote:
> > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> > > > But fundamentally the private pages, are well, private.  They can't be 
> > > > shared
> > > > across processes, so I think we could (should?) require the VMA to 
> > > > always be
> > > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. 
> > > > is that
> > > > enough to prevent userspace and unaware kernel code from acquiring a 
> > > > reference
> > > > to the underlying page?
> > > 
> > > Shared pages should be fine too (you folks wanted tmpfs support).
> > 
> > Is that a conflict though?  If the private->shared conversion request is 
> > kicked
> > out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?
> > 
> > Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
> > shared, and dirty data can't be written back to the file.
> 
> It can be remapped, but faulting in the page would produce hwpoison entry.

It sounds like you're thinking the whole tmpfs file is poisoned.  My thought is
that userspace would need to do something like for guest private memory:

mmap(NULL, guest_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | 
MAP_GUEST_ONLY, fd, 0);

The MAP_GUEST_ONLY would be used by the kernel to ensure the resulting VMA can
only point at private/poisoned memory, e.g. on fault, the associated PFN would
be tagged with PG_hwpoison or whtaever.  @fd in this case could point at tmpfs,
but I don't think it's a hard requirement.

On conversion to shared, userspace could then do:

munmap(, )
mmap(, , PROT_READ|PROT_WRITE, MAP_SHARED | 
MAP_FIXED_NOREPLACE, fd, );

or

mmap(, , PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 
);

or

ioctl(kvm, KVM_SET_USER_MEMORY_REGION, );
mmap(NULL, , PROT_READ|PROT_WRITE, MAP_SHARED, fd, );
ioctl(kvm, KVM_SET_USER_MEMORY_REGION, );

Combinations would also work, e.g. unmap the private range and move the memslot.
The private and shared memory regions could also be backed differently, e.g.
tmpfs for shared memory, anonymous for private memory.

> I don't see other way to make Google's use-case with tmpfs-backed guest
> memory work.

The underlying use-case is to be able to access guest memory from more than one
process, e.g. so that communication with the guest isn't limited to the VMM
process associated with the KVM instances.  By definition, guest private memory
can't be accessed by the host; I don't see how anyone, Google included, can have
any real requirements about

> > > The poisoned pages must be useless outside of the process with the blessed
> > > struct kvm. See kvm_pfn_map in the patch.
> > 
> > The big requirement for kernel TDX support is that the pages are useless in 
> > the
> > host.  Regarding the guest, for TDX, the TDX Module guarantees that at most 
> > a
> > single KVM guest can have access to a page at any given time.  I believe 
> > the RMP
> > provides the same guarantees for SEV-SNP.
> > 
> > SEV/SEV-ES could still end up with corruption if multiple guests map the 
> > same
> > private page, but that's obviously not the end of the world since it's the 
> > status
> > quo today.  Living with that shortcoming might be a worthy tradeoff if 
> > punting
> > mutual exclusion between guests to firmware/hardware allows us to simplify 
> > the
> > kernel implementation.
> 
> The critical question is whether we ever need to translate hva->pfn after
> the page is added to the guest private memory. I believe we do, but I
> never checked. And that's the reason we need to keep hwpoison entries
> around, which encode pfn.

As proposed in the TDX RFC, KVM would "need" the hva->pfn translation if the
guest private EPT entry was zapped, e.g. by NUMA balancing (which will fail on
the backend).  But in that case, KVM still has the original PFN, the "new"
translation becomes a sanity check to make sure that the zapped translation
wasn't moved unexpectedly.

Regardless, I don't see what that has to do with kvm_pfn_map.  At some point,
gup() has to fault in the page or look at the host PTE value.  For the latter,
at least on x86, we can throw info into the PTE itself to tag it as guest-only.
No matter what implementation we settle on, I think we've failed if we end up in
a situation where the primary MMU has pages it doesn't know are guest-only.

> If we don't, it would simplify the solution: kvm_pfn_map is not needed.
> Single bit-per page would be enough.


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Kirill A. Shutemov
On Mon, Apr 19, 2021 at 06:09:29PM +, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> > > But fundamentally the private pages, are well, private.  They can't be 
> > > shared
> > > across processes, so I think we could (should?) require the VMA to always 
> > > be
> > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is 
> > > that
> > > enough to prevent userspace and unaware kernel code from acquiring a 
> > > reference
> > > to the underlying page?
> > 
> > Shared pages should be fine too (you folks wanted tmpfs support).
> 
> Is that a conflict though?  If the private->shared conversion request is 
> kicked
> out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?
> 
> Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
> shared, and dirty data can't be written back to the file.

It can be remapped, but faulting in the page would produce hwpoison entry.
I don't see other way to make Google's use-case with tmpfs-backed guest
memory work.

> > The poisoned pages must be useless outside of the process with the blessed
> > struct kvm. See kvm_pfn_map in the patch.
> 
> The big requirement for kernel TDX support is that the pages are useless in 
> the
> host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
> single KVM guest can have access to a page at any given time.  I believe the 
> RMP
> provides the same guarantees for SEV-SNP.
> 
> SEV/SEV-ES could still end up with corruption if multiple guests map the same
> private page, but that's obviously not the end of the world since it's the 
> status
> quo today.  Living with that shortcoming might be a worthy tradeoff if punting
> mutual exclusion between guests to firmware/hardware allows us to simplify the
> kernel implementation.

The critical question is whether we ever need to translate hva->pfn after
the page is added to the guest private memory. I believe we do, but I
never checked. And that's the reason we need to keep hwpoison entries
around, which encode pfn.

If we don't, it would simplify the solution: kvm_pfn_map is not needed.
Single bit-per page would be enough.

> > > >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> > > >Used only for private mapping population.
> > > 
> > > >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> > > >guest.
> > > > 
> > > >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> > > >would lead to SIGBUS.
> > > > 
> > > > So far it looks pretty straight-forward.
> > > > 
> > > > The only thing that I don't understand is at way point the page gets 
> > > > tied
> > > > to the KVM instance. Currently we do it just before populating shadow
> > > > entries, but it would not work with the new scheme: as we poison pages
> > > > on fault it they may never get inserted into shadow entries. That's not
> > > > good as we rely on the info to unpoison page on free.
> > > 
> > > Can you elaborate on what you mean by "unpoison"?  If the page is never 
> > > actually
> > > mapped into the guest, then its poisoned status is nothing more than a 
> > > software
> > > flag, i.e. nothing extra needs to be done on free.
> > 
> > Normally, poisoned flag preserved for freed pages as it usually indicate
> > hardware issue. In this case we need return page to the normal circulation.
> > So we need a way to differentiate two kinds of page poison. Current patch
> > does this by adding page's pfn to kvm_pfn_map. But this will not work if
> > we uncouple poisoning and adding to shadow PTE.
> 
> Why use PG_hwpoison then?

Page flags are scarce. I don't want to take occupy a new one until I'm
sure I must.

And we can re-use existing infrastructure to SIGBUS on access to such
pages.

-- 
 Kirill A. Shutemov


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread David Hildenbrand

On 19.04.21 20:09, Sean Christopherson wrote:

On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:

On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:

But fundamentally the private pages, are well, private.  They can't be shared
across processes, so I think we could (should?) require the VMA to always be
MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
enough to prevent userspace and unaware kernel code from acquiring a reference
to the underlying page?


Shared pages should be fine too (you folks wanted tmpfs support).


Is that a conflict though?  If the private->shared conversion request is kicked
out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?

Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
shared, and dirty data can't be written back to the file.


The poisoned pages must be useless outside of the process with the blessed
struct kvm. See kvm_pfn_map in the patch.


The big requirement for kernel TDX support is that the pages are useless in the
host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
single KVM guest can have access to a page at any given time.  I believe the RMP
provides the same guarantees for SEV-SNP.

SEV/SEV-ES could still end up with corruption if multiple guests map the same
private page, but that's obviously not the end of the world since it's the 
status
quo today.  Living with that shortcoming might be a worthy tradeoff if punting
mutual exclusion between guests to firmware/hardware allows us to simplify the
kernel implementation.


  - Add a new GUP flag to retrive such pages from the userspace mapping.
Used only for private mapping population.



  - Shared gfn ranges managed by userspace, based on hypercalls from the
guest.

  - Shared mappings get populated via normal VMA. Any poisoned pages here
would lead to SIGBUS.

So far it looks pretty straight-forward.

The only thing that I don't understand is at way point the page gets tied
to the KVM instance. Currently we do it just before populating shadow
entries, but it would not work with the new scheme: as we poison pages
on fault it they may never get inserted into shadow entries. That's not
good as we rely on the info to unpoison page on free.


Can you elaborate on what you mean by "unpoison"?  If the page is never actually
mapped into the guest, then its poisoned status is nothing more than a software
flag, i.e. nothing extra needs to be done on free.


Normally, poisoned flag preserved for freed pages as it usually indicate
hardware issue. In this case we need return page to the normal circulation.
So we need a way to differentiate two kinds of page poison. Current patch
does this by adding page's pfn to kvm_pfn_map. But this will not work if
we uncouple poisoning and adding to shadow PTE.


Why use PG_hwpoison then?



I already raised that reusing PG_hwpoison is not what we want. And I 
repeat, to me this all looks like a big hack; some things you (Sena) 
propose sound cleaner, at least to me.


--
Thanks,

David / dhildenb



Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Sean Christopherson
On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> > But fundamentally the private pages, are well, private.  They can't be 
> > shared
> > across processes, so I think we could (should?) require the VMA to always be
> > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is 
> > that
> > enough to prevent userspace and unaware kernel code from acquiring a 
> > reference
> > to the underlying page?
> 
> Shared pages should be fine too (you folks wanted tmpfs support).

Is that a conflict though?  If the private->shared conversion request is kicked
out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?

Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
shared, and dirty data can't be written back to the file.

> The poisoned pages must be useless outside of the process with the blessed
> struct kvm. See kvm_pfn_map in the patch.

The big requirement for kernel TDX support is that the pages are useless in the
host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
single KVM guest can have access to a page at any given time.  I believe the RMP
provides the same guarantees for SEV-SNP.

SEV/SEV-ES could still end up with corruption if multiple guests map the same
private page, but that's obviously not the end of the world since it's the 
status
quo today.  Living with that shortcoming might be a worthy tradeoff if punting
mutual exclusion between guests to firmware/hardware allows us to simplify the
kernel implementation.

> > >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> > >Used only for private mapping population.
> > 
> > >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> > >guest.
> > > 
> > >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> > >would lead to SIGBUS.
> > > 
> > > So far it looks pretty straight-forward.
> > > 
> > > The only thing that I don't understand is at way point the page gets tied
> > > to the KVM instance. Currently we do it just before populating shadow
> > > entries, but it would not work with the new scheme: as we poison pages
> > > on fault it they may never get inserted into shadow entries. That's not
> > > good as we rely on the info to unpoison page on free.
> > 
> > Can you elaborate on what you mean by "unpoison"?  If the page is never 
> > actually
> > mapped into the guest, then its poisoned status is nothing more than a 
> > software
> > flag, i.e. nothing extra needs to be done on free.
> 
> Normally, poisoned flag preserved for freed pages as it usually indicate
> hardware issue. In this case we need return page to the normal circulation.
> So we need a way to differentiate two kinds of page poison. Current patch
> does this by adding page's pfn to kvm_pfn_map. But this will not work if
> we uncouple poisoning and adding to shadow PTE.

Why use PG_hwpoison then?


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Kirill A. Shutemov
On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > On Fri, Apr 16, 2021 at 05:30:30PM +, Sean Christopherson wrote:
> > > I like the idea of using "special" PTE value to denote guest private 
> > > memory,
> > > e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved 
> > > in the
> > > manipulation of the special flag/value.
> > > 
> > > Today, userspace owns the gfn->hva translations and the kernel 
> > > effectively owns
> > > the hva->pfn translations (with input from userspace).  KVM just connects 
> > > the
> > > dots.
> > > 
> > > Having KVM own the shared/private transitions means KVM is now part owner 
> > > of the
> > > entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary 
> > > MMU
> > > and a co-owner of the primary MMU.  This creates locking madness, e.g. 
> > > KVM taking
> > > mmap_sem for write, mmu_lock under page lock, etc..., and also takes 
> > > control away
> > > from userspace.  E.g. userspace strategy could be to use a separate 
> > > backing/pool
> > > for shared memory and change the gfn->hva translation (memslots) in 
> > > reaction to
> > > a shared/private conversion.  Automatically swizzling things in KVM takes 
> > > away
> > > that option.
> > > 
> > > IMO, KVM should be entirely "passive" in this process, e.g. the guest 
> > > shares or
> > > protects memory, userspace calls into the kernel to change state, and the 
> > > kernel
> > > manages the page tables to prevent bad actors.  KVM simply does the 
> > > plumbing for
> > > the guest page tables.
> > 
> > That's a new perspective for me. Very interesting.
> > 
> > Let's see how it can look like:
> > 
> >  - KVM only allows poisoned pages (or whatever flag we end up using for
> >protection) in the private mappings. SIGBUS otherwise.
> > 
> >  - Poisoned pages must be tied to the KVM instance to be allowed in the
> >private mappings. Like kvm->id in the current prototype. SIGBUS
> >otherwise.
> > 
> >  - Pages get poisoned on fault in if the VMA has a new vmflag set.
> > 
> >  - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
> >access such pages.
> > 
> >  - Poisoned pages produced this way get unpoisoned on free.
> > 
> >  - The new VMA flag set by userspace. mprotect(2)?
> 
> Ya, or mmap(), though I'm not entirely sure a VMA flag would suffice.  The
> notion of the page being private is tied to the PFN, which would suggest 
> "struct
> page" needs to be involved.

PG_hwpoison will be set on the page, so it's tied to pfn.

> But fundamentally the private pages, are well, private.  They can't be shared
> across processes, so I think we could (should?) require the VMA to always be
> MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
> enough to prevent userspace and unaware kernel code from acquiring a reference
> to the underlying page?

Shared pages should be fine too (you folks wanted tmpfs support).

The poisoned pages must be useless outside of the process with the blessed
struct kvm. See kvm_pfn_map in the patch.

> >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> >Used only for private mapping population.
> 
> >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> >guest.
> > 
> >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> >would lead to SIGBUS.
> > 
> > So far it looks pretty straight-forward.
> > 
> > The only thing that I don't understand is at way point the page gets tied
> > to the KVM instance. Currently we do it just before populating shadow
> > entries, but it would not work with the new scheme: as we poison pages
> > on fault it they may never get inserted into shadow entries. That's not
> > good as we rely on the info to unpoison page on free.
> 
> Can you elaborate on what you mean by "unpoison"?  If the page is never 
> actually
> mapped into the guest, then its poisoned status is nothing more than a 
> software
> flag, i.e. nothing extra needs to be done on free.

Normally, poisoned flag preserved for freed pages as it usually indicate
hardware issue. In this case we need return page to the normal circulation.
So we need a way to differentiate two kinds of page poison. Current patch
does this by adding page's pfn to kvm_pfn_map. But this will not work if
we uncouple poisoning and adding to shadow PTE.

-- 
 Kirill A. Shutemov


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Sean Christopherson
On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Fri, Apr 16, 2021 at 05:30:30PM +, Sean Christopherson wrote:
> > I like the idea of using "special" PTE value to denote guest private memory,
> > e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in 
> > the
> > manipulation of the special flag/value.
> > 
> > Today, userspace owns the gfn->hva translations and the kernel effectively 
> > owns
> > the hva->pfn translations (with input from userspace).  KVM just connects 
> > the
> > dots.
> > 
> > Having KVM own the shared/private transitions means KVM is now part owner 
> > of the
> > entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary 
> > MMU
> > and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM 
> > taking
> > mmap_sem for write, mmu_lock under page lock, etc..., and also takes 
> > control away
> > from userspace.  E.g. userspace strategy could be to use a separate 
> > backing/pool
> > for shared memory and change the gfn->hva translation (memslots) in 
> > reaction to
> > a shared/private conversion.  Automatically swizzling things in KVM takes 
> > away
> > that option.
> > 
> > IMO, KVM should be entirely "passive" in this process, e.g. the guest 
> > shares or
> > protects memory, userspace calls into the kernel to change state, and the 
> > kernel
> > manages the page tables to prevent bad actors.  KVM simply does the 
> > plumbing for
> > the guest page tables.
> 
> That's a new perspective for me. Very interesting.
> 
> Let's see how it can look like:
> 
>  - KVM only allows poisoned pages (or whatever flag we end up using for
>protection) in the private mappings. SIGBUS otherwise.
> 
>  - Poisoned pages must be tied to the KVM instance to be allowed in the
>private mappings. Like kvm->id in the current prototype. SIGBUS
>otherwise.
> 
>  - Pages get poisoned on fault in if the VMA has a new vmflag set.
> 
>  - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
>access such pages.
> 
>  - Poisoned pages produced this way get unpoisoned on free.
> 
>  - The new VMA flag set by userspace. mprotect(2)?

Ya, or mmap(), though I'm not entirely sure a VMA flag would suffice.  The
notion of the page being private is tied to the PFN, which would suggest "struct
page" needs to be involved.

But fundamentally the private pages, are well, private.  They can't be shared
across processes, so I think we could (should?) require the VMA to always be
MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
enough to prevent userspace and unaware kernel code from acquiring a reference
to the underlying page?

>  - Add a new GUP flag to retrive such pages from the userspace mapping.
>Used only for private mapping population.

>  - Shared gfn ranges managed by userspace, based on hypercalls from the
>guest.
> 
>  - Shared mappings get populated via normal VMA. Any poisoned pages here
>would lead to SIGBUS.
> 
> So far it looks pretty straight-forward.
> 
> The only thing that I don't understand is at way point the page gets tied
> to the KVM instance. Currently we do it just before populating shadow
> entries, but it would not work with the new scheme: as we poison pages
> on fault it they may never get inserted into shadow entries. That's not
> good as we rely on the info to unpoison page on free.

Can you elaborate on what you mean by "unpoison"?  If the page is never actually
mapped into the guest, then its poisoned status is nothing more than a software
flag, i.e. nothing extra needs to be done on free.  If the page is mapped into
the guest, then KVM can be made responsible for reinitializing the page with
keyid=0 when the page is removed from the guest.

The TDX Module prevents mapping the same PFN into multiple guests, so the kernel
doesn't actually have to care _which_ KVM instance(s) is associated with a page,
it only needs to prevent installing valid PTEs in the host page tables.

> Maybe we should tie VMA to the KVM instance on setting the vmflags?
> I donno.
> 
> Any comments?
> 
> -- 
>  Kirill A. Shutemov


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Kirill A. Shutemov
On Fri, Apr 16, 2021 at 05:30:30PM +, Sean Christopherson wrote:
> On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1b404e4d7dd8..f8183386abe7 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > kvm_sched_yield(vcpu->kvm, a0);
> > ret = 0;
> > break;
> > +   case KVM_HC_ENABLE_MEM_PROTECTED:
> > +   ret = kvm_protect_memory(vcpu->kvm);
> > +   break;
> > +   case KVM_HC_MEM_SHARE:
> > +   ret = kvm_share_memory(vcpu->kvm, a0, a1);
> 
> Can you take a look at a proposed hypercall interface for SEV live migration 
> and
> holler if you (or anyone else) thinks it will have extensibility issues?
> 
> https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.ka...@amd.com

Will look closer. Thanks.

> > @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
> > *async, bool write_fault,
> > flags |= FOLL_WRITE;
> > if (async)
> > flags |= FOLL_NOWAIT;
> > +   if (kvm->mem_protected)
> > +   flags |= FOLL_ALLOW_POISONED;
> 
> This is unsafe, only the flows that are mapping the PFN into the guest should
> use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page.

That's true for TDX. I prototyped with pure KVM with minimal modification
to the guest. We had to be more permissive for the reason. It will go
away for TDX.

> > -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> > -void *data, int offset, int len)
> > +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int 
> > len)
> > +{
> > +   int offset = offset_in_page(hva);
> > +   struct page *page;
> > +   int npages, seg;
> > +   void *vaddr;
> > +
> > +   if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
> > +   !kvm->mem_protected) {
> > +   return __copy_from_user(data, (void __user *)hva, len);
> > +   }
> > +
> > +   might_fault();
> > +   kasan_check_write(data, len);
> > +   check_object_size(data, len, false);
> > +
> > +   while ((seg = next_segment(len, offset)) != 0) {
> > +   npages = get_user_pages_unlocked(hva, 1, ,
> > +FOLL_ALLOW_POISONED);
> > +   if (npages != 1)
> > +   return -EFAULT;
> > +
> > +   if (!kvm_page_allowed(kvm, page))
> > +   return -EFAULT;
> > +
> > +   vaddr = kmap_atomic(page);
> > +   memcpy(data, vaddr + offset, seg);
> > +   kunmap_atomic(vaddr);
> 
> Why is KVM allowed to access a poisoned page?  I would expect shared pages to
> _not_ be poisoned.  Except for pure software emulation of SEV, KVM can't 
> access
> guest private memory.

Again, it's not going to be in TDX implementation.


> I like the idea of using "special" PTE value to denote guest private memory,
> e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in the
> manipulation of the special flag/value.
> 
> Today, userspace owns the gfn->hva translations and the kernel effectively 
> owns
> the hva->pfn translations (with input from userspace).  KVM just connects the
> dots.
> 
> Having KVM own the shared/private transitions means KVM is now part owner of 
> the
> entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU
> and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM 
> taking
> mmap_sem for write, mmu_lock under page lock, etc..., and also takes control 
> away
> from userspace.  E.g. userspace strategy could be to use a separate 
> backing/pool
> for shared memory and change the gfn->hva translation (memslots) in reaction 
> to
> a shared/private conversion.  Automatically swizzling things in KVM takes away
> that option.
> 
> IMO, KVM should be entirely "passive" in this process, e.g. the guest shares 
> or
> protects memory, userspace calls into the kernel to change state, and the 
> kernel
> manages the page tables to prevent bad actors.  KVM simply does the plumbing 
> for
> the guest page tables.

That's a new perspective for me. Very interesting.

Let's see how it can look like:

 - KVM only allows poisoned pages (or whatever flag we end up using for
   protection) in the private mappings. SIGBUS otherwise.

 - Poisoned pages must be tied to the KVM instance to be allowed in the
   private mappings. Like kvm->id in the current prototype. SIGBUS
   otherwise.

 - Pages get poisoned on fault in if the VMA has a new vmflag set.

 - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
   access such pages.

 - Poisoned pages produced this way get unpoisoned on free.

 - The new VMA flag set by userspace. mprotect(2)?

 - Add a new GUP flag to retrive such pages from the userspace mapping.
   Used only for private mapping population.

 - 

Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Xiaoyao Li

On 4/17/2021 1:30 AM, Sean Christopherson wrote:

On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:

[...]

index fadaccb95a4c..cd2374802702 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu 
*vcpu)
  }
  #endif
  
+#define KVM_NR_SHARED_RANGES 32

+
  /*
   * Note:
   * memslots are not sorted by id anymore, please use id_to_memslot()
@@ -513,6 +515,10 @@ struct kvm {
pid_t userspace_pid;
unsigned int max_halt_poll_ns;
u32 dirty_ring_size;
+   bool mem_protected;
+   void *id;
+   int nr_shared_ranges;
+   struct range shared_ranges[KVM_NR_SHARED_RANGES];


Hard no for me.  IMO, anything that requires KVM to track shared/pinned pages in
a separate tree/array is non-starter.  More specific to TDX #MCs, KVM should not
be the canonical reference for the state of a page.



Do you mean something in struct page to track if the page is shared or 
private?


Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-16 Thread Sean Christopherson
On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b404e4d7dd8..f8183386abe7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   kvm_sched_yield(vcpu->kvm, a0);
>   ret = 0;
>   break;
> + case KVM_HC_ENABLE_MEM_PROTECTED:
> + ret = kvm_protect_memory(vcpu->kvm);
> + break;
> + case KVM_HC_MEM_SHARE:
> + ret = kvm_share_memory(vcpu->kvm, a0, a1);

Can you take a look at a proposed hypercall interface for SEV live migration and
holler if you (or anyone else) thinks it will have extensibility issues?

https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.ka...@amd.com

> + break;
>   default:
>   ret = -KVM_ENOSYS;
>   break;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fadaccb95a4c..cd2374802702 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct 
> kvm_vcpu *vcpu)
>  }
>  #endif
>  
> +#define KVM_NR_SHARED_RANGES 32
> +
>  /*
>   * Note:
>   * memslots are not sorted by id anymore, please use id_to_memslot()
> @@ -513,6 +515,10 @@ struct kvm {
>   pid_t userspace_pid;
>   unsigned int max_halt_poll_ns;
>   u32 dirty_ring_size;
> + bool mem_protected;
> + void *id;
> + int nr_shared_ranges;
> + struct range shared_ranges[KVM_NR_SHARED_RANGES];

Hard no for me.  IMO, anything that requires KVM to track shared/pinned pages in
a separate tree/array is non-starter.  More specific to TDX #MCs, KVM should not
be the canonical reference for the state of a page.

>  };
>  
>  #define kvm_err(fmt, ...) \

...

> @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
> *async, bool write_fault,
>   flags |= FOLL_WRITE;
>   if (async)
>   flags |= FOLL_NOWAIT;
> + if (kvm->mem_protected)
> + flags |= FOLL_ALLOW_POISONED;

This is unsafe, only the flows that are mapping the PFN into the guest should
use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page.

>  
>   npages = get_user_pages_unlocked(addr, 1, , flags);
>   if (npages != 1)
>   return npages;
>  
> + if (IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) &&
> + kvm->mem_protected && !kvm_page_allowed(kvm, page))
> + return -EHWPOISON;
> +
>   /* map read fault as writable if possible */
>   if (unlikely(!write_fault) && writable) {
>   struct page *wpage;

...

> @@ -2338,19 +2350,93 @@ static int next_segment(unsigned long len, int offset)
>   return len;
>  }
>  
> -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -  void *data, int offset, int len)
> +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len)
> +{
> + int offset = offset_in_page(hva);
> + struct page *page;
> + int npages, seg;
> + void *vaddr;
> +
> + if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
> + !kvm->mem_protected) {
> + return __copy_from_user(data, (void __user *)hva, len);
> + }
> +
> + might_fault();
> + kasan_check_write(data, len);
> + check_object_size(data, len, false);
> +
> + while ((seg = next_segment(len, offset)) != 0) {
> + npages = get_user_pages_unlocked(hva, 1, ,
> +  FOLL_ALLOW_POISONED);
> + if (npages != 1)
> + return -EFAULT;
> +
> + if (!kvm_page_allowed(kvm, page))
> + return -EFAULT;
> +
> + vaddr = kmap_atomic(page);
> + memcpy(data, vaddr + offset, seg);
> + kunmap_atomic(vaddr);

Why is KVM allowed to access a poisoned page?  I would expect shared pages to
_not_ be poisoned.  Except for pure software emulation of SEV, KVM can't access
guest private memory.

> +
> + put_page(page);
> + len -= seg;
> + hva += seg;
> + data += seg;
> + offset = 0;
> + }
> +
> + return 0;
> +}

...
  
> @@ -2693,6 +2775,41 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
> gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>  
> +int kvm_protect_memory(struct kvm *kvm)
> +{
> + if (mmap_write_lock_killable(kvm->mm))
> + return -KVM_EINTR;
> +
> + kvm->mem_protected = true;
> + kvm_arch_flush_shadow_all(kvm);
> + mmap_write_unlock(kvm->mm);
> +
> + return 0;
> +}
> +
> +int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long 
> npages)
> +{
> + unsigned long end = gfn + npages;
> +
> + if (!npages || !IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY))