Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-25 Thread Gupta, Pankaj



I view it as a performance problem because nothing stops KVM from 
copying from
userspace into the private fd during the SEV ioctl().  What's 
missing is the
ability for userspace to directly initialze the private fd, which 
may or may not

avoid an extra memcpy() depending on how clever userspace is.

Can you please elaborate more what you see as a performance problem? And
possible ways to solve it?


Oh, I'm not saying there actually _is_ a performance problem.  What 
I'm saying is
that in-place encryption is not a functional requirement, which means 
it's purely
an optimization, and thus we should other bother supporting in-place 
encryption

_if_ it would solve a performane bottleneck.


Even if we end up having a performance problem, I think we need to 
understand the workloads that we want to optimize before getting too 
excited about designing a speedup.


In particular, there's (depending on the specific technology, perhaps, 
and also architecture) a possible tradeoff between trying to reduce 
copying and trying to reduce unmapping and the associated flushes.  If a 
user program maps an fd, populates it, and then converts it in place 
into private memory (especially if it doesn't do it in a single shot), 
then that memory needs to get unmapped both from the user mm and 
probably from the kernel direct map.  On the flip side, it's possible to 
imagine an ioctl that does copy-and-add-to-private-fd that uses a 
private mm and doesn't need any TLB IPIs.


All of this is to say that trying to optimize right now seems quite 
premature to me.


Agree to it. Thank you for explaining!

Thanks,
Pankaj






Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-22 Thread Andy Lutomirski

On 7/21/22 14:19, Sean Christopherson wrote:

On Thu, Jul 21, 2022, Gupta, Pankaj wrote:





I view it as a performance problem because nothing stops KVM from copying from
userspace into the private fd during the SEV ioctl().  What's missing is the
ability for userspace to directly initialze the private fd, which may or may not
avoid an extra memcpy() depending on how clever userspace is.

Can you please elaborate more what you see as a performance problem? And
possible ways to solve it?


Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying 
is
that in-place encryption is not a functional requirement, which means it's 
purely
an optimization, and thus we should other bother supporting in-place encryption
_if_ it would solve a performane bottleneck.


Even if we end up having a performance problem, I think we need to 
understand the workloads that we want to optimize before getting too 
excited about designing a speedup.


In particular, there's (depending on the specific technology, perhaps, 
and also architecture) a possible tradeoff between trying to reduce 
copying and trying to reduce unmapping and the associated flushes.  If a 
user program maps an fd, populates it, and then converts it in place 
into private memory (especially if it doesn't do it in a single shot), 
then that memory needs to get unmapped both from the user mm and 
probably from the kernel direct map.  On the flip side, it's possible to 
imagine an ioctl that does copy-and-add-to-private-fd that uses a 
private mm and doesn't need any TLB IPIs.


All of this is to say that trying to optimize right now seems quite 
premature to me.




Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-21 Thread Gupta, Pankaj




  * The current patch should just work, but prefer to have pre-boot guest
payload/firmware population into private memory for performance.


Not just performance in the case of SEV, it's needed there because firmware
only supports in-place encryption of guest memory, there's no mechanism to
provide a separate buffer to load into guest memory at pre-boot time. I
think you're aware of this but wanted to point that out just in case.


I view it as a performance problem because nothing stops KVM from copying from
userspace into the private fd during the SEV ioctl().  What's missing is the
ability for userspace to directly initialze the private fd, which may or may not
avoid an extra memcpy() depending on how clever userspace is.

Can you please elaborate more what you see as a performance problem? And
possible ways to solve it?


Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying 
is
that in-place encryption is not a functional requirement, which means it's 
purely
an optimization, and thus we should other bother supporting in-place encryption
_if_ it would solve a performane bottleneck.


Understood. Thank you!

Best regards,
Pankaj




Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-21 Thread Sean Christopherson
On Thu, Jul 21, 2022, Gupta, Pankaj wrote:
> 
> Hi Sean, Chao,
> 
> While attempting to solve the pre-boot guest payload/firmware population
> into private memory for SEV SNP, retrieved this thread. Have question below:
> 
> > > > Requirements & Gaps
> > > > -
> > > >- Confidential computing(CC): TDX/SEV/CCA
> > > >  * Need support both explicit/implicit conversions.
> > > >  * Need support only destructive conversion at runtime.
> > > >  * The current patch should just work, but prefer to have pre-boot 
> > > > guest
> > > >payload/firmware population into private memory for performance.
> > > 
> > > Not just performance in the case of SEV, it's needed there because 
> > > firmware
> > > only supports in-place encryption of guest memory, there's no mechanism to
> > > provide a separate buffer to load into guest memory at pre-boot time. I
> > > think you're aware of this but wanted to point that out just in case.
> > 
> > I view it as a performance problem because nothing stops KVM from copying 
> > from
> > userspace into the private fd during the SEV ioctl().  What's missing is the
> > ability for userspace to directly initialze the private fd, which may or 
> > may not
> > avoid an extra memcpy() depending on how clever userspace is.
> Can you please elaborate more what you see as a performance problem? And
> possible ways to solve it?

Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying 
is
that in-place encryption is not a functional requirement, which means it's 
purely
an optimization, and thus we should other bother supporting in-place encryption
_if_ it would solve a performane bottleneck.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-21 Thread Gupta, Pankaj



Hi Sean, Chao,

While attempting to solve the pre-boot guest payload/firmware population
into private memory for SEV SNP, retrieved this thread. Have question below:


Requirements & Gaps
-
   - Confidential computing(CC): TDX/SEV/CCA
 * Need support both explicit/implicit conversions.
 * Need support only destructive conversion at runtime.
 * The current patch should just work, but prefer to have pre-boot guest
   payload/firmware population into private memory for performance.


Not just performance in the case of SEV, it's needed there because firmware
only supports in-place encryption of guest memory, there's no mechanism to
provide a separate buffer to load into guest memory at pre-boot time. I
think you're aware of this but wanted to point that out just in case.


I view it as a performance problem because nothing stops KVM from copying from
userspace into the private fd during the SEV ioctl().  What's missing is the
ability for userspace to directly initialze the private fd, which may or may not
avoid an extra memcpy() depending on how clever userspace is.
Can you please elaborate more what you see as a performance problem? And 
possible ways to solve it?


Thanks,
Pankaj



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-10 Thread Sean Christopherson
On Fri, Jun 10, 2022, Andy Lutomirski wrote:
> On Mon, Apr 25, 2022 at 1:31 PM Sean Christopherson  wrote:
> >
> > On Mon, Apr 25, 2022, Andy Lutomirski wrote:
> > >
> > >
> > > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > > >>
> > >
> > > >>
> > > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now 
> > > >> it's in
> > > >> the initial state appropriate for that VM.
> > > >>
> > > >> For TDX, this completely bypasses the cases where the data is 
> > > >> prepopulated
> > > >> and TDX can't handle it cleanly.
> >
> > I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require 
> > that the
> > source and destination have different HPAs.  There's just no pressing need 
> > to
> > support such behavior because userspace is highly motivated to keep the 
> > initial
> > image small for performance reasons, i.e. burning a few extra pages while 
> > building
> > the guest is a non-issue.
> 
> Following up on this, rather belatedly.  After re-reading the docs,
> TDX can populate guest memory using TDH.MEM.PAGE.ADD, but see Intel®
> TDX Module Base Spec v1.5, section 2.3, step D.4 substeps 1 and 2
> here:
> 
> https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1.5-base-spec-348549001.pdf
> 
> For each TD page:
> 
> 1. The host VMM specifies a TDR as a parameter and calls the
> TDH.MEM.PAGE.ADD function. It copies the contents from the TD
> image page into the target TD page which is encrypted with the TD
> ephemeral key. TDH.MEM.PAGE.ADD also extends the TD
> measurement with the page GPA.
> 
> 2. The host VMM extends the TD measurement with the contents of
> the new page by calling the TDH.MR.EXTEND function on each 256-
> byte chunk of the new TD page.
> 
> So this is a bit like SGX.  There is a specific series of operations
> that have to be done in precisely the right order to reproduce the
> intended TD measurement.  Otherwise the guest will boot and run until
> it tries to get a report and then it will have a hard time getting
> anyone to believe its report.
> 
> So I don't think the host kernel can get away with host userspace just
> providing pre-populated memory.  Userspace needs to tell the host
> kernel exactly what sequence of adds, extends, etc to perform and in
> what order, and the host kernel needs to do precisely what userspace
> asks it to do.  "Here's the contents of memory" doesn't cut it unless
> the tooling that builds the guest image matches the exact semantics
> that the host kernel provides.

For TDX, yes, a KVM ioctl() is mandatory for all intents and purposes since 
adding
non-zero memory into the guest requires a SEAMCALL.  My "idea", which I'm not 
sure
would actually work, is more than a bit contrived, and which I don't think is 
remotely
critical to support, is to let userspace fill the guest private memory directly
and then use the private page for both the source and the target to 
TDH.MEM.PAGE.ADD.

That would avoid having to double allocate memory for the initial guest image.  
But
like I said, contrived and low priority.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-10 Thread Andy Lutomirski
On Mon, Apr 25, 2022 at 1:31 PM Sean Christopherson  wrote:
>
> On Mon, Apr 25, 2022, Andy Lutomirski wrote:
> >
> >
> > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > >>
> >
> > >>
> > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's 
> > >> in
> > >> the initial state appropriate for that VM.
> > >>
> > >> For TDX, this completely bypasses the cases where the data is 
> > >> prepopulated
> > >> and TDX can't handle it cleanly.
>
> I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that 
> the
> source and destination have different HPAs.  There's just no pressing need to
> support such behavior because userspace is highly motivated to keep the 
> initial
> image small for performance reasons, i.e. burning a few extra pages while 
> building
> the guest is a non-issue.

Following up on this, rather belatedly.  After re-reading the docs,
TDX can populate guest memory using TDH.MEM.PAGE.ADD, but see Intel®
TDX Module Base Spec v1.5, section 2.3, step D.4 substeps 1 and 2
here:

https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1.5-base-spec-348549001.pdf

For each TD page:

1. The host VMM specifies a TDR as a parameter and calls the
TDH.MEM.PAGE.ADD function. It copies the contents from the TD
image page into the target TD page which is encrypted with the TD
ephemeral key. TDH.MEM.PAGE.ADD also extends the TD
measurement with the page GPA.

2. The host VMM extends the TD measurement with the contents of
the new page by calling the TDH.MR.EXTEND function on each 256-
byte chunk of the new TD page.

So this is a bit like SGX.  There is a specific series of operations
that have to be done in precisely the right order to reproduce the
intended TD measurement.  Otherwise the guest will boot and run until
it tries to get a report and then it will have a hard time getting
anyone to believe its report.

So I don't think the host kernel can get away with host userspace just
providing pre-populated memory.  Userspace needs to tell the host
kernel exactly what sequence of adds, extends, etc to perform and in
what order, and the host kernel needs to do precisely what userspace
asks it to do.  "Here's the contents of memory" doesn't cut it unless
the tooling that builds the guest image matches the exact semantics
that the host kernel provides.

--Andy



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-05-09 Thread Sean Christopherson
On Mon, May 09, 2022, Michael Roth wrote:
> On Fri, Apr 22, 2022 at 06:56:12PM +0800, Chao Peng wrote:
> > Requirements & Gaps
> > -
> >   - Confidential computing(CC): TDX/SEV/CCA
> > * Need support both explicit/implicit conversions.
> > * Need support only destructive conversion at runtime.
> > * The current patch should just work, but prefer to have pre-boot guest
> >   payload/firmware population into private memory for performance.
> 
> Not just performance in the case of SEV, it's needed there because firmware
> only supports in-place encryption of guest memory, there's no mechanism to
> provide a separate buffer to load into guest memory at pre-boot time. I
> think you're aware of this but wanted to point that out just in case.

I view it as a performance problem because nothing stops KVM from copying from
userspace into the private fd during the SEV ioctl().  What's missing is the
ability for userspace to directly initialze the private fd, which may or may not
avoid an extra memcpy() depending on how clever userspace is.

> 
> > 
> >   - pKVM
> > * Support explicit conversion only. Hard to achieve implicit conversion,
> >   does not record the guest access info (private/shared) in page fault,
> >   also makes little sense.
> > * Expect to support non-destructive conversion at runtime. Additionally
> >   in-place conversion (the underlying physical page is unchanged) is
> >   desirable since copy is not disirable. The current destructive 
> > conversion
> >   does not fit well.
> > * The current callbacks between mm/KVM is useful and reusable for pKVM.
> > * Pre-boot guest payload population is nice to have.
> > 
> > 
> > Change Proposal
> > ---
> > Since there are some divergences for pKVM from CC usages and at this time 
> > looks
> > whether we will and how we will support pKVM with this private memory 
> > patchset
> > is still not quite clear, so this proposal does not imply certain detailed 
> > pKVM
> > implementation. But from the API level, we want this can be possible to be 
> > future
> > extended for pKVM or other potential usages.
> > 
> >   - No new user APIs introduced for memory backing store, e.g. remove the
> > current MFD_INACCESSIBLE. This info will be communicated from 
> > memfile_notifier
> > consumers to backing store via the new 'flag' field in memfile_notifier
> > described below. At creation time, the fd is normal shared fd. At 
> > rumtime CC
> > usages will keep using current fallocate/FALLOC_FL_PUNCH_HOLE to do the
> > conversion, but pKVM may also possible use a different way (e.g. rely on
> > mmap/munmap or mprotect as discussed). These are all not new APIs 
> > anyway.
> 
> For SNP most of the explicit conversions are via GHCB page-state change
> requests. Each of these PSC requests can request shared/private
> conversions for up to 252 individual pages, along with whether or not
> they should be treated as 4K or 2M pages. Currently, like with
> KVM_EXIT_MEMORY_ERROR, these requests get handled in userspace and call
> back into the kernel via fallocate/PUNCH_HOLE calls.
> 
> For each fallocate(), we need to update the RMP table to mark a page as
> private, and for PUNCH_HOLE we need to mark it as shared (otherwise it
> would be freed back to the host as guest-owned/private and cause a crash if
> the host tries to re-use it for something). I needed to add some callbacks
> to the memfile_notifier to handle these RMP table updates. There might be
> some other bits of book-keeping like clflush's, and adding/removing guest
> pages from the kernel's direct map.
> 
> Not currently implemented, but the guest can also issue requests to
> "smash"/"unsmash" a 2M private range into individual 4K private ranges
> (generally in advance of flipping one of the pages to shared, or
> vice-versa) in the RMP table. Hypervisor code tries to handle this
> automatically, by determining when to smash/unsmash on it's own, but...
> 
> I'm wondering how all these things can be properly conveyed through this
> fallocate/PUNCH_HOLE interface if we ever needed to add support for all
> of this, as it seems a bit restrictive as-is. For instance, with the
> current approach, one possible scheme is:
> 
>   - explicit conversion of shared->private for 252 4K pages:
> - we could do 252 individual fallocate()'s of 4K each, and make sure the
>   kernel code will do notifier callbacks / RMP updates for each individual
>   4K page
> 
>   - shared->private for 252 2M pages:
> - we could do 252 individual fallocate()'s of 2M each, and make sure the
>   kernel code will do notifier callbacks / RMP updates for each individual
>   2M page
> 
> But for SNP most of these bulk PSC changes are when the guest switches
> *all* of it's pages from shared->private during early boot when it
> validates all of it's memory. So these pages tend to be contiguous
> ranges, and a 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-05-09 Thread Michael Roth
On Fri, Apr 22, 2022 at 06:56:12PM +0800, Chao Peng wrote:
> Great thanks for the discussions. I summarized the requirements/gaps and the
> potential changes for next step. Please help to review.

Hi Chao,

Thanks for writing this up. I've been meaning to respond, but wanted to
make a bit more progress with SNP+UPM prototype to get a better idea of
what's needed on that end. I've needed to make some changes on the KVM
and QEMU side to get things working so hopefully with your proposed
rework those changes can be dropped.

> 
> 
> Terminologies:
> --
>   - memory conversion: the action of converting guest memory between private
> and shared.
>   - explicit conversion: an enlightened guest uses a hypercall to explicitly
> request a memory conversion to VMM.
>   - implicit conversion: the conversion when VMM reacts to a page fault due
> to different guest/host memory attributes (private/shared).
>   - destructive conversion: the memory content is lost/destroyed during
> conversion.
>   - non-destructive conversion: the memory content is preserved during
> conversion.
> 
> 
> Requirements & Gaps
> -
>   - Confidential computing(CC): TDX/SEV/CCA
> * Need support both explicit/implicit conversions.
> * Need support only destructive conversion at runtime.
> * The current patch should just work, but prefer to have pre-boot guest
>   payload/firmware population into private memory for performance.

Not just performance in the case of SEV, it's needed there because firmware
only supports in-place encryption of guest memory, there's no mechanism to
provide a separate buffer to load into guest memory at pre-boot time. I
think you're aware of this but wanted to point that out just in case.

> 
>   - pKVM
> * Support explicit conversion only. Hard to achieve implicit conversion,
>   does not record the guest access info (private/shared) in page fault,
>   also makes little sense.
> * Expect to support non-destructive conversion at runtime. Additionally
>   in-place conversion (the underlying physical page is unchanged) is
>   desirable since copy is not disirable. The current destructive 
> conversion
>   does not fit well.
> * The current callbacks between mm/KVM is useful and reusable for pKVM.
> * Pre-boot guest payload population is nice to have.
> 
> 
> Change Proposal
> ---
> Since there are some divergences for pKVM from CC usages and at this time 
> looks
> whether we will and how we will support pKVM with this private memory patchset
> is still not quite clear, so this proposal does not imply certain detailed 
> pKVM
> implementation. But from the API level, we want this can be possible to be 
> future
> extended for pKVM or other potential usages.
> 
>   - No new user APIs introduced for memory backing store, e.g. remove the
> current MFD_INACCESSIBLE. This info will be communicated from 
> memfile_notifier
> consumers to backing store via the new 'flag' field in memfile_notifier
> described below. At creation time, the fd is normal shared fd. At rumtime 
> CC
> usages will keep using current fallocate/FALLOC_FL_PUNCH_HOLE to do the
> conversion, but pKVM may also possible use a different way (e.g. rely on
> mmap/munmap or mprotect as discussed). These are all not new APIs anyway.

For SNP most of the explicit conversions are via GHCB page-state change
requests. Each of these PSC requests can request shared/private
conversions for up to 252 individual pages, along with whether or not
they should be treated as 4K or 2M pages. Currently, like with
KVM_EXIT_MEMORY_ERROR, these requests get handled in userspace and call
back into the kernel via fallocate/PUNCH_HOLE calls.

For each fallocate(), we need to update the RMP table to mark a page as
private, and for PUNCH_HOLE we need to mark it as shared (otherwise it
would be freed back to the host as guest-owned/private and cause a crash if
the host tries to re-use it for something). I needed to add some callbacks
to the memfile_notifier to handle these RMP table updates. There might be
some other bits of book-keeping like clflush's, and adding/removing guest
pages from the kernel's direct map.

Not currently implemented, but the guest can also issue requests to
"smash"/"unsmash" a 2M private range into individual 4K private ranges
(generally in advance of flipping one of the pages to shared, or
vice-versa) in the RMP table. Hypervisor code tries to handle this
automatically, by determining when to smash/unsmash on it's own, but...

I'm wondering how all these things can be properly conveyed through this
fallocate/PUNCH_HOLE interface if we ever needed to add support for all
of this, as it seems a bit restrictive as-is. For instance, with the
current approach, one possible scheme is:

  - explicit conversion of shared->private for 252 4K pages:
- we could do 252 individual fallocate()'s of 4K each, and make 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-05-03 Thread Quentin Perret
On Thursday 28 Apr 2022 at 20:29:52 (+0800), Chao Peng wrote:
> 
> + Michael in case he has comment from SEV side.
> 
> On Mon, Apr 25, 2022 at 07:52:38AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > >> 
> > 
> > >> 
> > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's 
> > >> in the initial state appropriate for that VM.
> > >> 
> > >> For TDX, this completely bypasses the cases where the data is 
> > >> prepopulated and TDX can't handle it cleanly.  For SEV, it bypasses a 
> > >> situation in which data might be written to the memory before we find 
> > >> out whether that data will be unreclaimable or unmovable.
> > >
> > > This sounds a more strict rule to avoid semantics unclear.
> > >
> > > So userspace needs to know what excatly happens for a 'bind' operation.
> > > This is different when binds to different technologies. E.g. for SEV, it
> > > may imply after this call, the memfile can be accessed (through mmap or
> > > what ever) from userspace, while for current TDX this should be not 
> > > allowed.
> > 
> > I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve 
> > similar goals and have broadly similar ways of achieving them, they really 
> > are different, and having userspace be aware of the differences seems okay 
> > to me.
> > 
> > (Although I don't think that allowing userspace to mmap SEV shared pages is 
> > particularly wise -- it will result in faults or cache incoherence 
> > depending on the variant of SEV in use.)
> > 
> > >
> > > And I feel we still need a third flow/operation to indicate the
> > > completion of the initialization on the memfile before the guest's 
> > > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> > > and prevent future userspace access. After this point, then the memfile
> > > becomes truely private fd.
> > 
> > Even that is technology-dependent.  For TDX, this operation doesn't really 
> > exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough 
> > detail).  For pKVM, I guess it does exist and isn't quite the same as a 
> > shared->private conversion.
> > 
> > Maybe this could be generalized a bit as an operation "measure and make 
> > private" that would be supported by the technologies for which it's useful.
> 
> Then I think we need callback instead of static flag field. Backing
> store implements this callback and consumers change the flags
> dynamically with this callback. This implements kind of state machine
> flow.
> 
> > 
> > 
> > >
> > >> 
> > >> 
> > >> --
> > >> 
> > >> Now I have a question, since I don't think anyone has really answered 
> > >> it: how does this all work with SEV- or pKVM-like technologies in which 
> > >> private and shared pages share the same address space?  I sounds like 
> > >> you're proposing to have a big memfile that contains private and shared 
> > >> pages and to use that same memfile as pages are converted back and 
> > >> forth.  IO and even real physical DMA could be done on that memfile.  Am 
> > >> I understanding correctly?
> > >
> > > For TDX case, and probably SEV as well, this memfile contains private 
> > > memory
> > > only. But this design at least makes it possible for usage cases like
> > > pKVM which wants both private/shared memory in the same memfile and rely
> > > on other ways like mmap/munmap or mprotect to toggle private/shared 
> > > instead
> > > of fallocate/hole punching.
> > 
> > Hmm.  Then we still need some way to get KVM to generate the correct SEV 
> > pagetables.  For TDX, there are private memslots and shared memslots, and 
> > they can overlap.  If they overlap and both contain valid pages at the same 
> > address, then the results may not be what the guest-side ABI expects, but 
> > everything will work.  So, when a single logical guest page transitions 
> > between shared and private, no change to the memslots is needed.  For SEV, 
> > this is not the case: everything is in one set of pagetables, and there 
> > isn't a natural way to resolve overlaps.
> 
> I don't see SEV has problem. Note for all the cases, both private/shared
> memory are in the same memslot. For a given GPA, if there is no private
> page, then shared page will be used to establish KVM pagetables, so this
> can guarantee there is no overlaps.
> 
> > 
> > If the memslot code becomes efficient enough, then the memslots could be 
> > fragmented.  Or the memfile could support private and shared data in the 
> > same memslot.  And if pKVM does this, I don't see why SEV couldn't also do 
> > it and hopefully reuse the same code.
> 
> For pKVM, that might be the case. For SEV, I don't think we require
> private/shared data in the same memfile. The same model that works for
> TDX should also work for SEV. Or maybe I misunderstood something here?
> 
> > 
> > >
> > 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-28 Thread Chao Peng


+ Michael in case he has comment from SEV side.

On Mon, Apr 25, 2022 at 07:52:38AM -0700, Andy Lutomirski wrote:
> 
> 
> On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> >> 
> 
> >> 
> >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in 
> >> the initial state appropriate for that VM.
> >> 
> >> For TDX, this completely bypasses the cases where the data is prepopulated 
> >> and TDX can't handle it cleanly.  For SEV, it bypasses a situation in 
> >> which data might be written to the memory before we find out whether that 
> >> data will be unreclaimable or unmovable.
> >
> > This sounds a more strict rule to avoid semantics unclear.
> >
> > So userspace needs to know what excatly happens for a 'bind' operation.
> > This is different when binds to different technologies. E.g. for SEV, it
> > may imply after this call, the memfile can be accessed (through mmap or
> > what ever) from userspace, while for current TDX this should be not allowed.
> 
> I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve 
> similar goals and have broadly similar ways of achieving them, they really 
> are different, and having userspace be aware of the differences seems okay to 
> me.
> 
> (Although I don't think that allowing userspace to mmap SEV shared pages is 
> particularly wise -- it will result in faults or cache incoherence depending 
> on the variant of SEV in use.)
> 
> >
> > And I feel we still need a third flow/operation to indicate the
> > completion of the initialization on the memfile before the guest's 
> > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> > and prevent future userspace access. After this point, then the memfile
> > becomes truely private fd.
> 
> Even that is technology-dependent.  For TDX, this operation doesn't really 
> exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough 
> detail).  For pKVM, I guess it does exist and isn't quite the same as a 
> shared->private conversion.
> 
> Maybe this could be generalized a bit as an operation "measure and make 
> private" that would be supported by the technologies for which it's useful.

Then I think we need callback instead of static flag field. Backing
store implements this callback and consumers change the flags
dynamically with this callback. This implements kind of state machine
flow.

> 
> 
> >
> >> 
> >> 
> >> --
> >> 
> >> Now I have a question, since I don't think anyone has really answered it: 
> >> how does this all work with SEV- or pKVM-like technologies in which 
> >> private and shared pages share the same address space?  I sounds like 
> >> you're proposing to have a big memfile that contains private and shared 
> >> pages and to use that same memfile as pages are converted back and forth.  
> >> IO and even real physical DMA could be done on that memfile.  Am I 
> >> understanding correctly?
> >
> > For TDX case, and probably SEV as well, this memfile contains private memory
> > only. But this design at least makes it possible for usage cases like
> > pKVM which wants both private/shared memory in the same memfile and rely
> > on other ways like mmap/munmap or mprotect to toggle private/shared instead
> > of fallocate/hole punching.
> 
> Hmm.  Then we still need some way to get KVM to generate the correct SEV 
> pagetables.  For TDX, there are private memslots and shared memslots, and 
> they can overlap.  If they overlap and both contain valid pages at the same 
> address, then the results may not be what the guest-side ABI expects, but 
> everything will work.  So, when a single logical guest page transitions 
> between shared and private, no change to the memslots is needed.  For SEV, 
> this is not the case: everything is in one set of pagetables, and there isn't 
> a natural way to resolve overlaps.

I don't see SEV has problem. Note for all the cases, both private/shared
memory are in the same memslot. For a given GPA, if there is no private
page, then shared page will be used to establish KVM pagetables, so this
can guarantee there is no overlaps.

> 
> If the memslot code becomes efficient enough, then the memslots could be 
> fragmented.  Or the memfile could support private and shared data in the same 
> memslot.  And if pKVM does this, I don't see why SEV couldn't also do it and 
> hopefully reuse the same code.

For pKVM, that might be the case. For SEV, I don't think we require
private/shared data in the same memfile. The same model that works for
TDX should also work for SEV. Or maybe I misunderstood something here?

> 
> >
> >> 
> >> If so, I think this makes sense, but I'm wondering if the actual memslot 
> >> setup should be different.  For TDX, private memory lives in a logically 
> >> separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API 
> >> can reflect this straightforwardly.
> >
> 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-25 Thread Sean Christopherson
On Mon, Apr 25, 2022, Andy Lutomirski wrote:
> 
> 
> On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> >> 
> 
> >> 
> >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in
> >> the initial state appropriate for that VM.
> >> 
> >> For TDX, this completely bypasses the cases where the data is prepopulated
> >> and TDX can't handle it cleanly.

I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that the
source and destination have different HPAs.  There's just no pressing need to
support such behavior because userspace is highly motivated to keep the initial
image small for performance reasons, i.e. burning a few extra pages while 
building
the guest is a non-issue.

> >> For SEV, it bypasses a situation in which data might be written to the
> >> memory before we find out whether that data will be unreclaimable or
> >> unmovable.
> >
> > This sounds a more strict rule to avoid semantics unclear.
> >
> > So userspace needs to know what excatly happens for a 'bind' operation.
> > This is different when binds to different technologies. E.g. for SEV, it
> > may imply after this call, the memfile can be accessed (through mmap or
> > what ever) from userspace, while for current TDX this should be not allowed.
> 
> I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve
> similar goals and have broadly similar ways of achieving them, they really
> are different, and having userspace be aware of the differences seems okay to
> me.

I agree, _if_ the properties of the memory are enumerated in a 
technology-agnostic
way.  The underlying mechanisms are different, but conceptually the set of sane
operations that userspace can perform/initiate are the same.  E.g. TDX and SNP 
can
support swap, they just don't because no one has requested Intel/AMD to provide
that support (no use cases for oversubscribing confidential VMs).  SNP does 
support
page migration, and TDX can add that support without too much fuss.

SEV "allows" the host to access guest private memory, but that doesn't mean it
should be deliberately supported by the kernel.  It's a bit of a moot point for
SEV/SEV-ES, as the host doesn't get any kind of notification that the guest has
"converted" a page, but the kernel shouldn't allow userspace to map memory that
is _known_ to be private.

> (Although I don't think that allowing userspace to mmap SEV shared pages is

s/shared/private?

> particularly wise -- it will result in faults or cache incoherence depending
> on the variant of SEV in use.)
>
> > And I feel we still need a third flow/operation to indicate the
> > completion of the initialization on the memfile before the guest's 
> > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> > and prevent future userspace access. After this point, then the memfile
> > becomes truely private fd.
> 
> Even that is technology-dependent.  For TDX, this operation doesn't really
> exist.

As above, I believe this is TDH.MEM.PAGE.ADD.

> For SEV, I'm not sure (I haven't read the specs in nearly enough detail).

QEMU+KVM does in-place conversion for SEV/SEV-ES via SNP_LAUNCH_UPDATE, AFAICT
that's still allowed for SNP.

> For pKVM, I guess it does exist and isn't quite the same as a
> shared->private conversion.
> 
> Maybe this could be generalized a bit as an operation "measure and make
> private" that would be supported by the technologies for which it's useful.
> 
> 
> >> Now I have a question, since I don't think anyone has really answered it:
> >> how does this all work with SEV- or pKVM-like technologies in which
> >> private and shared pages share the same address space?

The current proposal is to have both a private fd and a shared hva for memslot
that can be mapped private.  A GPA is considered private by KVM if the memslot
has a private fd and that corresponding page in the private fd is populated.  
KVM
will always and only map the current flavor of shared/private based on that
definition.  If userspace punches a hole in the private fd, KVM will unmap any
relevant private GPAs.  If userspace populates a range in the private fd, KVM 
will
unmap any relevant shared GPAs.

> >> I sounds like you're proposing to have a big memfile that contains private
> >> and shared pages and to use that same memfile as pages are converted back
> >> and forth.  IO and even real physical DMA could be done on that memfile.
> >> Am I understanding correctly?
> >
> > For TDX case, and probably SEV as well, this memfile contains private memory
> > only. But this design at least makes it possible for usage cases like
> > pKVM which wants both private/shared memory in the same memfile and rely
> > on other ways like mmap/munmap or mprotect to toggle private/shared instead
> > of fallocate/hole punching.
> 
> Hmm.  Then we still need some way to get KVM to generate the correct SEV
> pagetables.  For TDX, there are private 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-25 Thread Andy Lutomirski



On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
>> 

>> 
>> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in 
>> the initial state appropriate for that VM.
>> 
>> For TDX, this completely bypasses the cases where the data is prepopulated 
>> and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which 
>> data might be written to the memory before we find out whether that data 
>> will be unreclaimable or unmovable.
>
> This sounds a more strict rule to avoid semantics unclear.
>
> So userspace needs to know what excatly happens for a 'bind' operation.
> This is different when binds to different technologies. E.g. for SEV, it
> may imply after this call, the memfile can be accessed (through mmap or
> what ever) from userspace, while for current TDX this should be not allowed.

I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve 
similar goals and have broadly similar ways of achieving them, they really are 
different, and having userspace be aware of the differences seems okay to me.

(Although I don't think that allowing userspace to mmap SEV shared pages is 
particularly wise -- it will result in faults or cache incoherence depending on 
the variant of SEV in use.)

>
> And I feel we still need a third flow/operation to indicate the
> completion of the initialization on the memfile before the guest's 
> first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> and prevent future userspace access. After this point, then the memfile
> becomes truely private fd.

Even that is technology-dependent.  For TDX, this operation doesn't really 
exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough 
detail).  For pKVM, I guess it does exist and isn't quite the same as a 
shared->private conversion.

Maybe this could be generalized a bit as an operation "measure and make 
private" that would be supported by the technologies for which it's useful.


>
>> 
>> 
>> --
>> 
>> Now I have a question, since I don't think anyone has really answered it: 
>> how does this all work with SEV- or pKVM-like technologies in which private 
>> and shared pages share the same address space?  I sounds like you're 
>> proposing to have a big memfile that contains private and shared pages and 
>> to use that same memfile as pages are converted back and forth.  IO and even 
>> real physical DMA could be done on that memfile.  Am I understanding 
>> correctly?
>
> For TDX case, and probably SEV as well, this memfile contains private memory
> only. But this design at least makes it possible for usage cases like
> pKVM which wants both private/shared memory in the same memfile and rely
> on other ways like mmap/munmap or mprotect to toggle private/shared instead
> of fallocate/hole punching.

Hmm.  Then we still need some way to get KVM to generate the correct SEV 
pagetables.  For TDX, there are private memslots and shared memslots, and they 
can overlap.  If they overlap and both contain valid pages at the same address, 
then the results may not be what the guest-side ABI expects, but everything 
will work.  So, when a single logical guest page transitions between shared and 
private, no change to the memslots is needed.  For SEV, this is not the case: 
everything is in one set of pagetables, and there isn't a natural way to 
resolve overlaps.

If the memslot code becomes efficient enough, then the memslots could be 
fragmented.  Or the memfile could support private and shared data in the same 
memslot.  And if pKVM does this, I don't see why SEV couldn't also do it and 
hopefully reuse the same code.

>
>> 
>> If so, I think this makes sense, but I'm wondering if the actual memslot 
>> setup should be different.  For TDX, private memory lives in a logically 
>> separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can 
>> reflect this straightforwardly.
>
> I believe so. The flow should be similar but we do need pass different
> flags during the 'bind' to the backing store for different usages. That
> should be some new flags for pKVM but the callbacks (API here) between
> memfile_notifile and its consumers can be reused.

And also some different flag in the operation that installs the fd as a memslot?

>
>> 
>> And the corresponding TDX question: is the intent still that shared pages 
>> aren't allowed at all in a TDX memfile?  If so, that would be the most 
>> direct mapping to what the hardware actually does.
>
> Exactly. TDX will still use fallocate/hole punching to turn on/off the
> private page. Once off, the traditional shared page will become
> effective in KVM.

Works for me.

For what it's worth, I still think it should be fine to land all the TDX 
memfile bits upstream as long as we're confident that SEV, pKVM, etc can be 
added on without issues.

I think we can increase confidence in this by 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-25 Thread Chao Peng
On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> 
> 
> On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote:
> > On Tue, Apr 05, 2022 at 06:03:21PM +, Sean Christopherson wrote:
> >> On Tue, Apr 05, 2022, Quentin Perret wrote:
> >> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > Only when the register succeeds, the fd is
> > converted into a private fd, before that, the fd is just a normal 
> > (shared)
> > one. During this conversion, the previous data is preserved so you can 
> > put
> > some initial data in guest pages (whether the architecture allows this 
> > is
> > architecture-specific and out of the scope of this patch).
> 
> I think this can be made to work, but it will be awkward.  On TDX, for 
> example, what exactly are the semantics supposed to be?  An error code if the 
> memory isn't all zero?  An error code if it has ever been written?
> 
> Fundamentally, I think this is because your proposed lifecycle for these 
> memfiles results in a lightweight API but is awkward for the intended use 
> cases.  You're proposing, roughly:
> 
> 1. Create a memfile. 
> 
> Now it's in a shared state with an unknown virt technology.  It can be read 
> and written.  Let's call this state BRAND_NEW.
> 
> 2. Bind to a VM.
> 
> Now it's an a bound state.  For TDX, for example, let's call the new state 
> BOUND_TDX.  In this state, the TDX rules are followed (private memory can't 
> be converted, etc).
> 
> The problem here is that the BOUND_NEW state allows things that are 
> nonsensical in TDX, and the binding step needs to invent some kind of 
> semantics for what happens when binding a nonempty memfile.
> 
> 
> So I would propose a somewhat different order:
> 
> 1. Create a memfile.  It's in the UNBOUND state and no operations whatsoever 
> are allowed except binding or closing.

OK, so we need invent new user API to indicate UNBOUND state. For memfd
based, it can be a new feature-neutral flag at creation time.

> 
> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in 
> the initial state appropriate for that VM.
> 
> For TDX, this completely bypasses the cases where the data is prepopulated 
> and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which 
> data might be written to the memory before we find out whether that data will 
> be unreclaimable or unmovable.

This sounds a more strict rule to avoid semantics unclear.

So userspace needs to know what excatly happens for a 'bind' operation.
This is different when binds to different technologies. E.g. for SEV, it
may imply after this call, the memfile can be accessed (through mmap or
what ever) from userspace, while for current TDX this should be not allowed.

And I feel we still need a third flow/operation to indicate the
completion of the initialization on the memfile before the guest's 
first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
and prevent future userspace access. After this point, then the memfile
becomes truely private fd.

> 
> 
> --
> 
> Now I have a question, since I don't think anyone has really answered it: how 
> does this all work with SEV- or pKVM-like technologies in which private and 
> shared pages share the same address space?  I sounds like you're proposing to 
> have a big memfile that contains private and shared pages and to use that 
> same memfile as pages are converted back and forth.  IO and even real 
> physical DMA could be done on that memfile.  Am I understanding correctly?

For TDX case, and probably SEV as well, this memfile contains private memory
only. But this design at least makes it possible for usage cases like
pKVM which wants both private/shared memory in the same memfile and rely
on other ways like mmap/munmap or mprotect to toggle private/shared instead
of fallocate/hole punching.

> 
> If so, I think this makes sense, but I'm wondering if the actual memslot 
> setup should be different.  For TDX, private memory lives in a logically 
> separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can 
> reflect this straightforwardly.

I believe so. The flow should be similar but we do need pass different
flags during the 'bind' to the backing store for different usages. That
should be some new flags for pKVM but the callbacks (API here) between
memfile_notifile and its consumers can be reused.

> 
> And the corresponding TDX question: is the intent still that shared pages 
> aren't allowed at all in a TDX memfile?  If so, that would be the most direct 
> mapping to what the hardware actually does.

Exactly. TDX will still use fallocate/hole punching to turn on/off the
private page. Once off, the traditional shared page will become
effective in KVM.

Chao
> 
> --Andy



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-24 Thread Andy Lutomirski



On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote:
> On Tue, Apr 05, 2022 at 06:03:21PM +, Sean Christopherson wrote:
>> On Tue, Apr 05, 2022, Quentin Perret wrote:
>> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> Only when the register succeeds, the fd is
> converted into a private fd, before that, the fd is just a normal (shared)
> one. During this conversion, the previous data is preserved so you can put
> some initial data in guest pages (whether the architecture allows this is
> architecture-specific and out of the scope of this patch).

I think this can be made to work, but it will be awkward.  On TDX, for example, 
what exactly are the semantics supposed to be?  An error code if the memory 
isn't all zero?  An error code if it has ever been written?

Fundamentally, I think this is because your proposed lifecycle for these 
memfiles results in a lightweight API but is awkward for the intended use 
cases.  You're proposing, roughly:

1. Create a memfile. 

Now it's in a shared state with an unknown virt technology.  It can be read and 
written.  Let's call this state BRAND_NEW.

2. Bind to a VM.

Now it's an a bound state.  For TDX, for example, let's call the new state 
BOUND_TDX.  In this state, the TDX rules are followed (private memory can't be 
converted, etc).

The problem here is that the BOUND_NEW state allows things that are nonsensical 
in TDX, and the binding step needs to invent some kind of semantics for what 
happens when binding a nonempty memfile.


So I would propose a somewhat different order:

1. Create a memfile.  It's in the UNBOUND state and no operations whatsoever 
are allowed except binding or closing.

2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the 
initial state appropriate for that VM.

For TDX, this completely bypasses the cases where the data is prepopulated and 
TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data 
might be written to the memory before we find out whether that data will be 
unreclaimable or unmovable.


--

Now I have a question, since I don't think anyone has really answered it: how 
does this all work with SEV- or pKVM-like technologies in which private and 
shared pages share the same address space?  I sounds like you're proposing to 
have a big memfile that contains private and shared pages and to use that same 
memfile as pages are converted back and forth.  IO and even real physical DMA 
could be done on that memfile.  Am I understanding correctly?

If so, I think this makes sense, but I'm wondering if the actual memslot setup 
should be different.  For TDX, private memory lives in a logically separate 
memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect 
this straightforwardly.

And the corresponding TDX question: is the intent still that shared pages 
aren't allowed at all in a TDX memfile?  If so, that would be the most direct 
mapping to what the hardware actually does.

--Andy



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-24 Thread Chao Peng
On Fri, Apr 22, 2022 at 01:06:25PM +0200, Paolo Bonzini wrote:
> On 4/22/22 12:56, Chao Peng wrote:
> >  /* memfile notifier flags */
> >  #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in 
> > the file is inaccessible from userspace (e.g. read/write/mmap) */
> >  #define MFN_F_UNMOVABLE   0x0002  /* memory allocated in 
> > the file is unmovable */
> >  #define MFN_F_UNRECLAIMABLE   0x0003  /* memory allocated in 
> > the file is unreclaimable (e.g. via kswapd or any other pathes) */
> 
> You probably mean BIT(0/1/2) here.

Right, it's BIT(n), Thanks.

Chao
> 
> Paolo
> 
> >  When memfile_notifier is being registered, memfile_register_notifier 
> > will
> >  need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
> >  previous mmap-ed mapping exists on the fd (I'm still unclear on how to 
> > do
> >  this). When multiple consumers are supported it also need check all
> >  registered consumers to see if any conflict (e.g. all consumers should 
> > have
> >  MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd 
> > is
> >  converted into a private fd, before that, the fd is just a normal 
> > (shared)
> >  one. During this conversion, the previous data is preserved so you can 
> > put
> >  some initial data in guest pages (whether the architecture allows this 
> > is
> >  architecture-specific and out of the scope of this patch).



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-22 Thread Paolo Bonzini

On 4/22/22 12:56, Chao Peng wrote:

 /* memfile notifier flags */
 #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in the 
file is inaccessible from userspace (e.g. read/write/mmap) */
 #define MFN_F_UNMOVABLE   0x0002  /* memory allocated in the 
file is unmovable */
 #define MFN_F_UNRECLAIMABLE   0x0003  /* memory allocated in the 
file is unreclaimable (e.g. via kswapd or any other pathes) */


You probably mean BIT(0/1/2) here.

Paolo


 When memfile_notifier is being registered, memfile_register_notifier will
 need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
 previous mmap-ed mapping exists on the fd (I'm still unclear on how to do
 this). When multiple consumers are supported it also need check all
 registered consumers to see if any conflict (e.g. all consumers should have
 MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd is
 converted into a private fd, before that, the fd is just a normal (shared)
 one. During this conversion, the previous data is preserved so you can put
 some initial data in guest pages (whether the architecture allows this is
 architecture-specific and out of the scope of this patch).





Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-22 Thread Chao Peng
On Tue, Apr 05, 2022 at 06:03:21PM +, Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Quentin Perret wrote:
> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > > >>  - it can be very useful for protected VMs to do shared=>private
> > > >>conversions. Think of a VM receiving some data from the host in a
> > > >>shared buffer, and then it wants to operate on that buffer without
> > > >>risking to leak confidential informations in a transient state. In
> > > >>that case the most logical thing to do is to convert the buffer back
> > > >>to private, do whatever needs to be done on that buffer (decrypting 
> > > >> a
> > > >>frame, ...), and then share it back with the host to consume it;
> > > >
> > > > If performance is a motivation, why would the guest want to do two
> > > > conversions instead of just doing internal memcpy() to/from a private
> > > > page?  I would be quite surprised if multiple exits and TLB shootdowns 
> > > > is
> > > > actually faster, especially at any kind of scale where zapping stage-2
> > > > PTEs will cause lock contention and IPIs.
> > > 
> > > I don't know the numbers or all the details, but this is arm64, which is a
> > > rather better architecture than x86 in this regard.  So maybe it's not so
> > > bad, at least in very simple cases, ignoring all implementation details.
> > > (But see below.)  Also the systems in question tend to have fewer CPUs 
> > > than
> > > some of the massive x86 systems out there.
> > 
> > Yep. I can try and do some measurements if that's really necessary, but
> > I'm really convinced the cost of the TLBI for the shared->private
> > conversion is going to be significantly smaller than the cost of memcpy
> > the buffer twice in the guest for us.
> 
> It's not just the TLB shootdown, the VM-Exits aren't free.   And barring 
> non-trivial
> improvements to KVM's MMU, e.g. sharding of mmu_lock, modifying the page 
> tables will
> block all other updates and MMU operations.  Taking mmu_lock for read, should 
> arm64
> ever convert to a rwlock, is not an option because KVM needs to block other
> conversions to avoid races.
> 
> Hmm, though batching multiple pages into a single request would mitigate most 
> of
> the overhead.
> 
> > There are variations of that idea: e.g. allow userspace to mmap the
> > entire private fd but w/o taking a reference on pages mapped with
> > PROT_NONE. And then the VMM can use mprotect() in response to
> > share/unshare requests. I think Marc liked that idea as it keeps the
> > userspace API closer to normal KVM -- there actually is a
> > straightforward gpa->hva relation. Not sure how much that would impact
> > the implementation at this point.
> > 
> > For the shared=>private conversion, this would be something like so:
> > 
> >  - the guest issues a hypercall to unshare a page;
> > 
> >  - the hypervisor forwards the request to the host;
> > 
> >  - the host kernel forwards the request to userspace;
> > 
> >  - userspace then munmap()s the shared page;
> > 
> >  - KVM then tries to take a reference to the page. If it succeeds, it
> >re-enters the guest with a flag of some sort saying that the share
> >succeeded, and the hypervisor will adjust pgtables accordingly. If
> >KVM failed to take a reference, it flags this and the hypervisor will
> >be responsible for communicating that back to the guest. This means
> >the guest must handle failures (possibly fatal).
> > 
> > (There are probably many ways in which we can optimize this, e.g. by
> > having the host proactively munmap() pages it no longer needs so that
> > the unshare hypercall from the guest doesn't need to exit all the way
> > back to host userspace.)
> 
> ...
> 
> > > Maybe there could be a special mode for the private memory fds in which
> > > specific pages are marked as "managed by this fd but actually shared".
> > > pread() and pwrite() would work on those pages, but not mmap().  (Or maybe
> > > mmap() but the resulting mappings would not permit GUP.)
> 
> Unless I misunderstand what you intend by pread()/pwrite(), I think we'd need 
> to
> allow mmap(), otherwise e.g. uaccess from the kernel wouldn't work.
> 
> > > And transitioning them would be a special operation on the fd that is
> > > specific to pKVM and wouldn't work on TDX or SEV.
> 
> To keep things feature agnostic (IMO, baking TDX vs SEV vs pKVM info into 
> private-fd
> is a really bad idea), this could be handled by adding a flag and/or callback 
> into
> the notifier/client stating whether or not it supports mapping a private-fd, 
> and then
> mapping would be allowed if and only if all consumers support/allow mapping.
> 
> > > Hmm.  Sean and Chao, are we making a bit of a mistake by making these fds
> > > technology-agnostic?  That is, would we want to distinguish between a TDX
> > > backing fd, a SEV backing fd, a software-based backing fd, etc?  API-wise
> > > this could work by requiring the fd to be bound to a KVM VM 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-12 Thread Kirill A. Shutemov
On Mon, Mar 28, 2022 at 01:16:48PM -0700, Andy Lutomirski wrote:
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
> >
> > This is the v5 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> >   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> 
> Can this series be run and a VM booted without TDX?  A feature like
> that might help push it forward.

It would require enlightenment of the guest code. We have two options.

Simple one is to limit enabling to the guest kernel, but it would require
non-destructive conversion between shared->private memory. This does not
seem to be compatible with current design.

Other option is get memory private from time 0 of VM boot, but it requires
modification of virtual BIOS to setup shared ranges as needed. I'm not
sure if anybody volunteer to work on BIOS code to make it happen.

Hm.

-- 
 Kirill A. Shutemov



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-12 Thread Chao Peng
On Fri, Apr 08, 2022 at 11:35:05AM -1000, Vishal Annapurve wrote:
> On Mon, Mar 28, 2022 at 10:17 AM Andy Lutomirski  wrote:
> >
> > On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  
> > wrote:
> > >
> > > This is the v5 of this series which tries to implement the fd-based KVM
> > > guest private memory. The patches are based on latest kvm/queue branch
> > > commit:
> > >
> > >   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> >
> > Can this series be run and a VM booted without TDX?  A feature like
> > that might help push it forward.
> >
> > --Andy
> 
> I have posted a RFC series with selftests to exercise the UPM feature
> with normal non-confidential VMs via
> https://lore.kernel.org/kvm/20220408210545.3915712-1-vannapu...@google.com/

Thanks Vishal, this sounds very helpful, it already started to find
bugs.

Chao
> 
> -- Vishal



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-08 Thread Vishal Annapurve
On Mon, Mar 28, 2022 at 10:17 AM Andy Lutomirski  wrote:
>
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
> >
> > This is the v5 of this series which tries to implement the fd-based KVM
> > guest private memory. The patches are based on latest kvm/queue branch
> > commit:
> >
> >   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
>
> Can this series be run and a VM booted without TDX?  A feature like
> that might help push it forward.
>
> --Andy

I have posted a RFC series with selftests to exercise the UPM feature
with normal non-confidential VMs via
https://lore.kernel.org/kvm/20220408210545.3915712-1-vannapu...@google.com/

-- Vishal



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-06 Thread Andy Lutomirski



On Tue, Apr 5, 2022, at 11:30 AM, Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Andy Lutomirski wrote:

>
>> resume guest
>> *** host -> hypervisor -> guest ***
>> Guest unshares the page.
>> *** guest -> hypervisor ***
>> Hypervisor removes PTE.  TLBI.
>> *** hypervisor -> guest ***
>> 
>> Obviously considerable cleverness is needed to make a virt IOMMU like this
>> work well, but still.
>> 
>> Anyway, my suggestion is that the fd backing proposal get slightly modified
>> to get it ready for multiple subtypes of backing object, which should be a
>> pretty minimal change.  Then, if someone actually needs any of this
>> cleverness, it can be added later.  In the mean time, the
>> pread()/pwrite()/splice() scheme is pretty good.
>
> Tangentially related to getting private-fd ready for multiple things, 
> what about
> implementing the pread()/pwrite()/splice() scheme in pKVM itself?  I.e. 
> read() on
> the VM fd, with the offset corresponding to gfn in some way.
>

Hmm, could make sense.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-06 Thread Quentin Perret
On Tuesday 05 Apr 2022 at 10:51:36 (-0700), Andy Lutomirski wrote:
> Let's try actually counting syscalls and mode transitions, at least 
> approximately.  For non-direct IO (DMA allocation on guest side, not straight 
> to/from pagecache or similar):
> 
> Guest writes to shared DMA buffer.  Assume the guest is smart and reuses the 
> buffer.
> Guest writes descriptor to shared virtio ring.
> Guest rings virtio doorbell, which causes an exit.
> *** guest -> hypervisor -> host ***
> host reads virtio ring (mmaped shared memory)
> host does pread() to read the DMA buffer or reads mmapped buffer
> host does the IO
> resume guest
> *** host -> hypervisor -> guest ***
> 
> This is essentially optimal in terms of transitions.  The data is copied on 
> the guest side (which may well be mandatory depending on what guest userspace 
> did to initiate the IO) and on the host (which may well be mandatory 
> depending on what the host is doing with the data).
> 
> Now let's try straight-from-guest-pagecache or otherwise zero-copy on the 
> guest side.  Without nondestructive changes, the guest needs a bounce buffer 
> and it looks just like the above.  One extra copy, zero extra mode 
> transitions.  With nondestructive changes, it's a bit more like physical 
> hardware with an IOMMU:
> 
> Guest shares the page.
> *** guest -> hypervisor ***
> Hypervisor adds a PTE.  Let's assume we're being very optimal and the host is 
> not synchronously notified.
> *** hypervisor -> guest ***
> Guest writes descriptor to shared virtio ring.
> Guest rings virtio doorbell, which causes an exit.
> *** guest -> hypervisor -> host ***
> host reads virtio ring (mmaped shared memory)
> 
> mmap  *** syscall ***
> host does the IO
> munmap *** syscall, TLBI ***
> 
> resume guest
> *** host -> hypervisor -> guest ***
> Guest unshares the page.
> *** guest -> hypervisor ***
> Hypervisor removes PTE.  TLBI.
> *** hypervisor -> guest ***
> 
> This is quite expensive.  For small IO, pread() or splice() in the host may 
> be a lot faster.  Even for large IO, splice() may still win.

Right, that would work nicely for pages that are shared transiently, but
less so for long-term shares. But I guess your proposal below should do
the trick.

> I can imagine clever improvements.  First, let's get rid of mmap() + 
> munmap().  Instead use a special device mapping with special semantics, not 
> regular memory.  (mmap and munmap are expensive even ignoring any arch and 
> TLB stuff.)  The rule is that, if the page is shared, access works, and if 
> private, access doesn't, but it's still mapped.  The hypervisor and the host 
> cooperate to make it so.

As long as the page can't be GUP'd I _think_ this shouldn't be a
problem. We can have the hypervisor re-inject the fault in the host. And
the host fault handler will deal with it just fine if the fault was
taken from userspace (inject a SEGV), or from the kernel through uaccess
macros. But we do get into issues if the host kernel can be tricked into
accessing the page via e.g. kmap(). I've been able to trigger this by
strace-ing a userspace process which passes a pointer to private memory
to a syscall. strace will inspect the syscall argument using
process_vm_readv(), which will pin_user_pages_remote() and access the
page via kmap(), and then we're in trouble. But preventing GUP would
prevent this by construction I think?

FWIW memfd_secret() did look like a good solution to this, but it lacks
the bidirectional notifiers with KVM that is offered by this patch
series, which is needed to allow KVM to handle guest faults, and also
offers a good framework to support future extensions (e.g.
hypervisor-assisted page migration, swap, ...). So yes, ideally
pKVM would use a kind of hybrid between memfd_secret and the private fd
proposed here, or something else providing similar properties.

Thanks,
Quentin



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-06 Thread Quentin Perret
On Tuesday 05 Apr 2022 at 18:03:21 (+), Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Quentin Perret wrote:
> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > > >>  - it can be very useful for protected VMs to do shared=>private
> > > >>conversions. Think of a VM receiving some data from the host in a
> > > >>shared buffer, and then it wants to operate on that buffer without
> > > >>risking to leak confidential informations in a transient state. In
> > > >>that case the most logical thing to do is to convert the buffer back
> > > >>to private, do whatever needs to be done on that buffer (decrypting 
> > > >> a
> > > >>frame, ...), and then share it back with the host to consume it;
> > > >
> > > > If performance is a motivation, why would the guest want to do two
> > > > conversions instead of just doing internal memcpy() to/from a private
> > > > page?  I would be quite surprised if multiple exits and TLB shootdowns 
> > > > is
> > > > actually faster, especially at any kind of scale where zapping stage-2
> > > > PTEs will cause lock contention and IPIs.
> > > 
> > > I don't know the numbers or all the details, but this is arm64, which is a
> > > rather better architecture than x86 in this regard.  So maybe it's not so
> > > bad, at least in very simple cases, ignoring all implementation details.
> > > (But see below.)  Also the systems in question tend to have fewer CPUs 
> > > than
> > > some of the massive x86 systems out there.
> > 
> > Yep. I can try and do some measurements if that's really necessary, but
> > I'm really convinced the cost of the TLBI for the shared->private
> > conversion is going to be significantly smaller than the cost of memcpy
> > the buffer twice in the guest for us.
> 
> It's not just the TLB shootdown, the VM-Exits aren't free.

Ack, but we can at least work on the rest (number of exits, locking, ...).
The cost of the memcpy and the TLBI are really incompressible.

> And barring non-trivial
> improvements to KVM's MMU, e.g. sharding of mmu_lock, modifying the page 
> tables will
> block all other updates and MMU operations.  Taking mmu_lock for read, should 
> arm64
> ever convert to a rwlock, is not an option because KVM needs to block other
> conversions to avoid races.

FWIW the host mmu_lock isn't all that useful for pKVM. The host doesn't
have _any_ control over guest page-tables, and the hypervisor can't
safely rely on the host for locking, so we have hypervisor-level
synchronization.

> Hmm, though batching multiple pages into a single request would mitigate most 
> of
> the overhead.

Yep, there are a few tricks we can play to make this fairly efficient in
the most common cases. And fine-grain locking at EL2 is really high up
on the todo list :-)

Thanks,
Quentin



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-05 Thread Sean Christopherson
On Tue, Apr 05, 2022, Andy Lutomirski wrote:
> On Tue, Apr 5, 2022, at 3:36 AM, Quentin Perret wrote:
> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> >> The best I can come up with is a special type of shared page that is not
> >> GUP-able and maybe not even mmappable, having a clear option for
> >> transitions to fail, and generally preventing the nasty cases from
> >> happening in the first place.
> >
> > Right, that sounds reasonable to me.
> 
> At least as a v1, this is probably more straightforward than allowing mmap().
> Also, there's much to be said for a simpler, limited API, to be expanded if
> genuinely needed, as opposed to starting out with a very featureful API.

Regarding "genuinely needed", IMO the same applies to supporting this at all.
Without numbers from something at least approximating a real use case, we're 
just
speculating on which will be the most performant approach.

> >> Maybe there could be a special mode for the private memory fds in which
> >> specific pages are marked as "managed by this fd but actually shared".
> >> pread() and pwrite() would work on those pages, but not mmap().  (Or maybe
> >> mmap() but the resulting mappings would not permit GUP.)  And
> >> transitioning them would be a special operation on the fd that is specific
> >> to pKVM and wouldn't work on TDX or SEV.
> >
> > Aha, didn't think of pread()/pwrite(). Very interesting.
> 
> There are plenty of use cases for which pread()/pwrite()/splice() will be as
> fast or even much faster than mmap()+memcpy().

...

> resume guest
> *** host -> hypervisor -> guest ***
> Guest unshares the page.
> *** guest -> hypervisor ***
> Hypervisor removes PTE.  TLBI.
> *** hypervisor -> guest ***
> 
> Obviously considerable cleverness is needed to make a virt IOMMU like this
> work well, but still.
> 
> Anyway, my suggestion is that the fd backing proposal get slightly modified
> to get it ready for multiple subtypes of backing object, which should be a
> pretty minimal change.  Then, if someone actually needs any of this
> cleverness, it can be added later.  In the mean time, the
> pread()/pwrite()/splice() scheme is pretty good.

Tangentially related to getting private-fd ready for multiple things, what about
implementing the pread()/pwrite()/splice() scheme in pKVM itself?  I.e. read() 
on
the VM fd, with the offset corresponding to gfn in some way.

Ditto for mmap() on the VM fd, though that would require additional changes 
outside
of pKVM.

That would allow pKVM to support in-place conversions without the private-fd 
having
to differentiate between the type of protected VM, and without having to provide
new APIs from the private-fd.  TDX, SNP, etc... Just Work by not supporting the 
pKVM
APIs.

And assuming we get multiple consumers down the road, pKVM will need to be able 
to
communicate the "true" state of a page to other consumers, because in addition 
to
being a consumer, pKVM is also an owner/enforcer analogous to the TDX Module and
the SEV PSP.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-05 Thread Sean Christopherson
On Tue, Apr 05, 2022, Quentin Perret wrote:
> On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > >>  - it can be very useful for protected VMs to do shared=>private
> > >>conversions. Think of a VM receiving some data from the host in a
> > >>shared buffer, and then it wants to operate on that buffer without
> > >>risking to leak confidential informations in a transient state. In
> > >>that case the most logical thing to do is to convert the buffer back
> > >>to private, do whatever needs to be done on that buffer (decrypting a
> > >>frame, ...), and then share it back with the host to consume it;
> > >
> > > If performance is a motivation, why would the guest want to do two
> > > conversions instead of just doing internal memcpy() to/from a private
> > > page?  I would be quite surprised if multiple exits and TLB shootdowns is
> > > actually faster, especially at any kind of scale where zapping stage-2
> > > PTEs will cause lock contention and IPIs.
> > 
> > I don't know the numbers or all the details, but this is arm64, which is a
> > rather better architecture than x86 in this regard.  So maybe it's not so
> > bad, at least in very simple cases, ignoring all implementation details.
> > (But see below.)  Also the systems in question tend to have fewer CPUs than
> > some of the massive x86 systems out there.
> 
> Yep. I can try and do some measurements if that's really necessary, but
> I'm really convinced the cost of the TLBI for the shared->private
> conversion is going to be significantly smaller than the cost of memcpy
> the buffer twice in the guest for us.

It's not just the TLB shootdown, the VM-Exits aren't free.   And barring 
non-trivial
improvements to KVM's MMU, e.g. sharding of mmu_lock, modifying the page tables 
will
block all other updates and MMU operations.  Taking mmu_lock for read, should 
arm64
ever convert to a rwlock, is not an option because KVM needs to block other
conversions to avoid races.

Hmm, though batching multiple pages into a single request would mitigate most of
the overhead.

> There are variations of that idea: e.g. allow userspace to mmap the
> entire private fd but w/o taking a reference on pages mapped with
> PROT_NONE. And then the VMM can use mprotect() in response to
> share/unshare requests. I think Marc liked that idea as it keeps the
> userspace API closer to normal KVM -- there actually is a
> straightforward gpa->hva relation. Not sure how much that would impact
> the implementation at this point.
> 
> For the shared=>private conversion, this would be something like so:
> 
>  - the guest issues a hypercall to unshare a page;
> 
>  - the hypervisor forwards the request to the host;
> 
>  - the host kernel forwards the request to userspace;
> 
>  - userspace then munmap()s the shared page;
> 
>  - KVM then tries to take a reference to the page. If it succeeds, it
>re-enters the guest with a flag of some sort saying that the share
>succeeded, and the hypervisor will adjust pgtables accordingly. If
>KVM failed to take a reference, it flags this and the hypervisor will
>be responsible for communicating that back to the guest. This means
>the guest must handle failures (possibly fatal).
> 
> (There are probably many ways in which we can optimize this, e.g. by
> having the host proactively munmap() pages it no longer needs so that
> the unshare hypercall from the guest doesn't need to exit all the way
> back to host userspace.)

...

> > Maybe there could be a special mode for the private memory fds in which
> > specific pages are marked as "managed by this fd but actually shared".
> > pread() and pwrite() would work on those pages, but not mmap().  (Or maybe
> > mmap() but the resulting mappings would not permit GUP.)

Unless I misunderstand what you intend by pread()/pwrite(), I think we'd need to
allow mmap(), otherwise e.g. uaccess from the kernel wouldn't work.

> > And transitioning them would be a special operation on the fd that is
> > specific to pKVM and wouldn't work on TDX or SEV.

To keep things feature agnostic (IMO, baking TDX vs SEV vs pKVM info into 
private-fd
is a really bad idea), this could be handled by adding a flag and/or callback 
into
the notifier/client stating whether or not it supports mapping a private-fd, 
and then
mapping would be allowed if and only if all consumers support/allow mapping.

> > Hmm.  Sean and Chao, are we making a bit of a mistake by making these fds
> > technology-agnostic?  That is, would we want to distinguish between a TDX
> > backing fd, a SEV backing fd, a software-based backing fd, etc?  API-wise
> > this could work by requiring the fd to be bound to a KVM VM instance and
> > possibly even configured a bit before any other operations would be
> > allowed.

I really don't want to distinguish between between each exact feature, but I've
no objection to adding flags/callbacks to track specific properties of the
downstream consumers, e.g. "can this 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-05 Thread Andy Lutomirski



On Tue, Apr 5, 2022, at 3:36 AM, Quentin Perret wrote:
> On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
>> 
>> 
>> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
>> > On Mon, Apr 04, 2022, Quentin Perret wrote:
>> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
>> >> FWIW, there are a couple of reasons why I'd like to have in-place
>> >> conversions:
>> >> 
>> >>  - one goal of pKVM is to migrate some things away from the Arm
>> >>Trustzone environment (e.g. DRM and the likes) and into protected VMs
>> >>instead. This will give Linux a fighting chance to defend itself
>> >>against these things -- they currently have access to _all_ memory.
>> >>And transitioning pages between Linux and Trustzone (donations and
>> >>shares) is fast and non-destructive, so we really do not want pKVM to
>> >>regress by requiring the hypervisor to memcpy things;
>> >
>> > Is there actually a _need_ for the conversion to be non-destructive?  
>> > E.g. I assume
>> > the "trusted" side of things will need to be reworked to run as a pKVM 
>> > guest, at
>> > which point reworking its logic to understand that conversions are 
>> > destructive and
>> > slow-ish doesn't seem too onerous.
>> >
>> >>  - it can be very useful for protected VMs to do shared=>private
>> >>conversions. Think of a VM receiving some data from the host in a
>> >>shared buffer, and then it wants to operate on that buffer without
>> >>risking to leak confidential informations in a transient state. In
>> >>that case the most logical thing to do is to convert the buffer back
>> >>to private, do whatever needs to be done on that buffer (decrypting a
>> >>frame, ...), and then share it back with the host to consume it;
>> >
>> > If performance is a motivation, why would the guest want to do two 
>> > conversions
>> > instead of just doing internal memcpy() to/from a private page?  I 
>> > would be quite
>> > surprised if multiple exits and TLB shootdowns is actually faster, 
>> > especially at
>> > any kind of scale where zapping stage-2 PTEs will cause lock contention 
>> > and IPIs.
>> 
>> I don't know the numbers or all the details, but this is arm64, which is a 
>> rather better architecture than x86 in this regard.  So maybe it's not so 
>> bad, at least in very simple cases, ignoring all implementation details.  
>> (But see below.)  Also the systems in question tend to have fewer CPUs than 
>> some of the massive x86 systems out there.
>
> Yep. I can try and do some measurements if that's really necessary, but
> I'm really convinced the cost of the TLBI for the shared->private
> conversion is going to be significantly smaller than the cost of memcpy
> the buffer twice in the guest for us. To be fair, although the cost for
> the CPU update is going to be low, the cost for IOMMU updates _might_ be
> higher, but that very much depends on the hardware. On systems that use
> e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and
> the iotlb invalidation is done on the back of the CPU invalidation. So,
> on systems with sane hardware the overhead is *really* quite small.
>
> Also, memcpy requires double the memory, it is pretty bad for power, and
> it causes memory traffic which can't be a good thing for things running
> concurrently.
>
>> If we actually wanted to support transitioning the same page between shared 
>> and private, though, we have a bit of an awkward situation.  Private to 
>> shared is conceptually easy -- do some bookkeeping, reconstitute the direct 
>> map entry, and it's done.  The other direction is a mess: all existing uses 
>> of the page need to be torn down.  If the page has been recently used for 
>> DMA, this includes IOMMU entries.
>>
>> Quentin: let's ignore any API issues for now.  Do you have a concept of how 
>> a nondestructive shared -> private transition could work well, even in 
>> principle?
>
> I had a high level idea for the workflow, but I haven't looked into the
> implementation details.
>
> The idea would be to allow KVM *or* userspace to take a reference
> to a page in the fd in an exclusive manner. KVM could take a reference
> on a page (which would be necessary before to donating it to a guest)
> using some kind of memfile_notifier as proposed in this series, and
> userspace could do the same some other way (mmap presumably?). In both
> cases, the operation might fail.
>
> I would imagine the boot and private->shared flow as follow:
>
>  - the VMM uses fallocate on the private fd, and associates the offset, size> with a memslot;
>
>  - the guest boots, and as part of that KVM takes references to all the
>pages that are donated to the guest. If userspace happens to have a
>mapping to a page, KVM will fail to take the reference, which would
>be fatal for the guest.
>
>  - once the guest has booted, it issues a hypercall to share a page back
>with the host;
>
>  - KVM is notified, 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-05 Thread Quentin Perret
On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> 
> 
> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
> > On Mon, Apr 04, 2022, Quentin Perret wrote:
> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> >> FWIW, there are a couple of reasons why I'd like to have in-place
> >> conversions:
> >> 
> >>  - one goal of pKVM is to migrate some things away from the Arm
> >>Trustzone environment (e.g. DRM and the likes) and into protected VMs
> >>instead. This will give Linux a fighting chance to defend itself
> >>against these things -- they currently have access to _all_ memory.
> >>And transitioning pages between Linux and Trustzone (donations and
> >>shares) is fast and non-destructive, so we really do not want pKVM to
> >>regress by requiring the hypervisor to memcpy things;
> >
> > Is there actually a _need_ for the conversion to be non-destructive?  
> > E.g. I assume
> > the "trusted" side of things will need to be reworked to run as a pKVM 
> > guest, at
> > which point reworking its logic to understand that conversions are 
> > destructive and
> > slow-ish doesn't seem too onerous.
> >
> >>  - it can be very useful for protected VMs to do shared=>private
> >>conversions. Think of a VM receiving some data from the host in a
> >>shared buffer, and then it wants to operate on that buffer without
> >>risking to leak confidential informations in a transient state. In
> >>that case the most logical thing to do is to convert the buffer back
> >>to private, do whatever needs to be done on that buffer (decrypting a
> >>frame, ...), and then share it back with the host to consume it;
> >
> > If performance is a motivation, why would the guest want to do two 
> > conversions
> > instead of just doing internal memcpy() to/from a private page?  I 
> > would be quite
> > surprised if multiple exits and TLB shootdowns is actually faster, 
> > especially at
> > any kind of scale where zapping stage-2 PTEs will cause lock contention 
> > and IPIs.
> 
> I don't know the numbers or all the details, but this is arm64, which is a 
> rather better architecture than x86 in this regard.  So maybe it's not so 
> bad, at least in very simple cases, ignoring all implementation details.  
> (But see below.)  Also the systems in question tend to have fewer CPUs than 
> some of the massive x86 systems out there.

Yep. I can try and do some measurements if that's really necessary, but
I'm really convinced the cost of the TLBI for the shared->private
conversion is going to be significantly smaller than the cost of memcpy
the buffer twice in the guest for us. To be fair, although the cost for
the CPU update is going to be low, the cost for IOMMU updates _might_ be
higher, but that very much depends on the hardware. On systems that use
e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and
the iotlb invalidation is done on the back of the CPU invalidation. So,
on systems with sane hardware the overhead is *really* quite small.

Also, memcpy requires double the memory, it is pretty bad for power, and
it causes memory traffic which can't be a good thing for things running
concurrently.

> If we actually wanted to support transitioning the same page between shared 
> and private, though, we have a bit of an awkward situation.  Private to 
> shared is conceptually easy -- do some bookkeeping, reconstitute the direct 
> map entry, and it's done.  The other direction is a mess: all existing uses 
> of the page need to be torn down.  If the page has been recently used for 
> DMA, this includes IOMMU entries.
>
> Quentin: let's ignore any API issues for now.  Do you have a concept of how a 
> nondestructive shared -> private transition could work well, even in 
> principle?

I had a high level idea for the workflow, but I haven't looked into the
implementation details.

The idea would be to allow KVM *or* userspace to take a reference
to a page in the fd in an exclusive manner. KVM could take a reference
on a page (which would be necessary before to donating it to a guest)
using some kind of memfile_notifier as proposed in this series, and
userspace could do the same some other way (mmap presumably?). In both
cases, the operation might fail.

I would imagine the boot and private->shared flow as follow:

 - the VMM uses fallocate on the private fd, and associates the  with a memslot;

 - the guest boots, and as part of that KVM takes references to all the
   pages that are donated to the guest. If userspace happens to have a
   mapping to a page, KVM will fail to take the reference, which would
   be fatal for the guest.

 - once the guest has booted, it issues a hypercall to share a page back
   with the host;

 - KVM is notified, and at that point it drops its reference to the
   page. It then exits to userspace to notify it of the share;

 - host userspace receives the share, and mmaps the shared page with
   MAP_FIXED to 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-04 Thread Andy Lutomirski



On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Quentin Perret wrote:
>> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
>> FWIW, there are a couple of reasons why I'd like to have in-place
>> conversions:
>> 
>>  - one goal of pKVM is to migrate some things away from the Arm
>>Trustzone environment (e.g. DRM and the likes) and into protected VMs
>>instead. This will give Linux a fighting chance to defend itself
>>against these things -- they currently have access to _all_ memory.
>>And transitioning pages between Linux and Trustzone (donations and
>>shares) is fast and non-destructive, so we really do not want pKVM to
>>regress by requiring the hypervisor to memcpy things;
>
> Is there actually a _need_ for the conversion to be non-destructive?  
> E.g. I assume
> the "trusted" side of things will need to be reworked to run as a pKVM 
> guest, at
> which point reworking its logic to understand that conversions are 
> destructive and
> slow-ish doesn't seem too onerous.
>
>>  - it can be very useful for protected VMs to do shared=>private
>>conversions. Think of a VM receiving some data from the host in a
>>shared buffer, and then it wants to operate on that buffer without
>>risking to leak confidential informations in a transient state. In
>>that case the most logical thing to do is to convert the buffer back
>>to private, do whatever needs to be done on that buffer (decrypting a
>>frame, ...), and then share it back with the host to consume it;
>
> If performance is a motivation, why would the guest want to do two 
> conversions
> instead of just doing internal memcpy() to/from a private page?  I 
> would be quite
> surprised if multiple exits and TLB shootdowns is actually faster, 
> especially at
> any kind of scale where zapping stage-2 PTEs will cause lock contention 
> and IPIs.

I don't know the numbers or all the details, but this is arm64, which is a 
rather better architecture than x86 in this regard.  So maybe it's not so bad, 
at least in very simple cases, ignoring all implementation details.  (But see 
below.)  Also the systems in question tend to have fewer CPUs than some of the 
massive x86 systems out there.

If we actually wanted to support transitioning the same page between shared and 
private, though, we have a bit of an awkward situation.  Private to shared is 
conceptually easy -- do some bookkeeping, reconstitute the direct map entry, 
and it's done.  The other direction is a mess: all existing uses of the page 
need to be torn down.  If the page has been recently used for DMA, this 
includes IOMMU entries.

Quentin: let's ignore any API issues for now.  Do you have a concept of how a 
nondestructive shared -> private transition could work well, even in principle? 
 The best I can come up with is a special type of shared page that is not 
GUP-able and maybe not even mmappable, having a clear option for transitions to 
fail, and generally preventing the nasty cases from happening in the first 
place.

Maybe there could be a special mode for the private memory fds in which 
specific pages are marked as "managed by this fd but actually shared".  pread() 
and pwrite() would work on those pages, but not mmap().  (Or maybe mmap() but 
the resulting mappings would not permit GUP.)  And transitioning them would be 
a special operation on the fd that is specific to pKVM and wouldn't work on TDX 
or SEV.

Hmm.  Sean and Chao, are we making a bit of a mistake by making these fds 
technology-agnostic?  That is, would we want to distinguish between a TDX 
backing fd, a SEV backing fd, a software-based backing fd, etc?  API-wise this 
could work by requiring the fd to be bound to a KVM VM instance and possibly 
even configured a bit before any other operations would be allowed.

(Destructive transitions nicely avoid all the nasty cases.  If something is 
still pinning a shared page when it's "transitioned" to private (really just 
replaced with a new page), then the old page continues existing for as long as 
needed as a separate object.)



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-04 Thread Sean Christopherson
On Mon, Apr 04, 2022, Quentin Perret wrote:
> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> FWIW, there are a couple of reasons why I'd like to have in-place
> conversions:
> 
>  - one goal of pKVM is to migrate some things away from the Arm
>Trustzone environment (e.g. DRM and the likes) and into protected VMs
>instead. This will give Linux a fighting chance to defend itself
>against these things -- they currently have access to _all_ memory.
>And transitioning pages between Linux and Trustzone (donations and
>shares) is fast and non-destructive, so we really do not want pKVM to
>regress by requiring the hypervisor to memcpy things;

Is there actually a _need_ for the conversion to be non-destructive?  E.g. I 
assume
the "trusted" side of things will need to be reworked to run as a pKVM guest, at
which point reworking its logic to understand that conversions are destructive 
and
slow-ish doesn't seem too onerous.

>  - it can be very useful for protected VMs to do shared=>private
>conversions. Think of a VM receiving some data from the host in a
>shared buffer, and then it wants to operate on that buffer without
>risking to leak confidential informations in a transient state. In
>that case the most logical thing to do is to convert the buffer back
>to private, do whatever needs to be done on that buffer (decrypting a
>frame, ...), and then share it back with the host to consume it;

If performance is a motivation, why would the guest want to do two conversions
instead of just doing internal memcpy() to/from a private page?  I would be 
quite
surprised if multiple exits and TLB shootdowns is actually faster, especially at
any kind of scale where zapping stage-2 PTEs will cause lock contention and 
IPIs.

>  - similar to the previous point, a protected VM might want to
>temporarily turn a buffer private to avoid ToCToU issues;

Again, bounce buffer the page in the guest.

>  - once we're able to do device assignment to protected VMs, this might
>allow DMA-ing to a private buffer, and make it shared later w/o
>bouncing.

Exposing a private buffer to a device doesn't requring in-place conversion.  The
proper way to handle this would be to teach e.g. VFIO to retrieve the PFN from
the backing store.  I don't understand the use case for sharing a DMA'd page at 
a
later time; with whom would the guest share the page?  E.g. if a NIC has access 
to
guest private data then there should never be a need to convert/bounce the page.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-04 Thread Quentin Perret
On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote:
> > On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:
> 
> 
> > To answer your original question about memory 'conversion', the key
> > thing is that the pKVM hypervisor controls the stage-2 page-tables for
> > everyone in the system, all guests as well as the host. As such, a page
> > 'conversion' is nothing more than a permission change in the relevant
> > page-tables.
> >
> 
> So I can see two different ways to approach this.
> 
> One is that you split the whole address space in half and, just like SEV and 
> TDX, allocate one bit to indicate the shared/private status of a page.  This 
> makes it work a lot like SEV and TDX.
>
> The other is to have shared and private pages be distinguished only by their 
> hypercall history and the (protected) page tables.  This saves some address 
> space and some page table allocations, but it opens some cans of worms too.  
> In particular, the guest and the hypervisor need to coordinate, in a way that 
> the guest can trust, to ensure that the guest's idea of which pages are 
> private match the host's.  This model seems a bit harder to support nicely 
> with the private memory fd model, but not necessarily impossible.

Right. Perhaps one thing I should clarify as well: pKVM (as opposed to
TDX) has only _one_ page-table per guest, and it is controllex by the
hypervisor only. So the hypervisor needs to be involved for both shared
and private mappings. As such, shared pages have relatively similar
constraints when it comes to host mm stuff --  we can't migrate shared
pages or swap them out without getting the hypervisor involved.

> Also, what are you trying to accomplish by having the host userspace mmap 
> private pages?

What I would really like to have is non-destructive in-place conversions
of pages. mmap-ing the pages that have been shared back felt like a good
fit for the private=>shared conversion, but in fact I'm not all that
opinionated about the API as long as the behaviour and the performance
are there. Happy to look into alternatives.

FWIW, there are a couple of reasons why I'd like to have in-place
conversions:

 - one goal of pKVM is to migrate some things away from the Arm
   Trustzone environment (e.g. DRM and the likes) and into protected VMs
   instead. This will give Linux a fighting chance to defend itself
   against these things -- they currently have access to _all_ memory.
   And transitioning pages between Linux and Trustzone (donations and
   shares) is fast and non-destructive, so we really do not want pKVM to
   regress by requiring the hypervisor to memcpy things;

 - it can be very useful for protected VMs to do shared=>private
   conversions. Think of a VM receiving some data from the host in a
   shared buffer, and then it wants to operate on that buffer without
   risking to leak confidential informations in a transient state. In
   that case the most logical thing to do is to convert the buffer back
   to private, do whatever needs to be done on that buffer (decrypting a
   frame, ...), and then share it back with the host to consume it;

 - similar to the previous point, a protected VM might want to
   temporarily turn a buffer private to avoid ToCToU issues;

 - once we're able to do device assignment to protected VMs, this might
   allow DMA-ing to a private buffer, and make it shared later w/o
   bouncing.

And there is probably more.

IIUC, the private fd proposal as it stands requires shared and private
pages to come from entirely distinct places. So it's not entirely clear
to me how any of the above could be supported without having the
hypervisor memcpy the data during conversions, which I really don't want
to do for performance reasons.

> Is the idea that multiple guest could share the same page until such time as 
> one of them tries to write to it?

That would certainly be possible to implement in the pKVM
environment with the right tracking, so I think it is worth considering
as a future goal.

Thanks,
Quentin



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Andy Lutomirski
On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote:
> On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:


> To answer your original question about memory 'conversion', the key
> thing is that the pKVM hypervisor controls the stage-2 page-tables for
> everyone in the system, all guests as well as the host. As such, a page
> 'conversion' is nothing more than a permission change in the relevant
> page-tables.
>

So I can see two different ways to approach this.

One is that you split the whole address space in half and, just like SEV and 
TDX, allocate one bit to indicate the shared/private status of a page.  This 
makes it work a lot like SEV and TDX.

The other is to have shared and private pages be distinguished only by their 
hypercall history and the (protected) page tables.  This saves some address 
space and some page table allocations, but it opens some cans of worms too.  In 
particular, the guest and the hypervisor need to coordinate, in a way that the 
guest can trust, to ensure that the guest's idea of which pages are private 
match the host's.  This model seems a bit harder to support nicely with the 
private memory fd model, but not necessarily impossible.

Also, what are you trying to accomplish by having the host userspace mmap 
private pages?  Is the idea that multiple guest could share the same page until 
such time as one of them tries to write to it?  That would be kind of like 
having a third kind of memory that's visible to host and guests but is 
read-only for everyone.  TDX and SEV can't support this at all (a private page 
belongs to one guest and one guest only, at least in SEV and in the current TDX 
SEAM spec).  I imagine that this could be supported with private memory fds 
with some care without mmap, though -- the host could still populate the page 
with memcpy.  Or I suppose a memslot could support using MAP_PRIVATE fds and 
have approximately the right semantics.

--Andy





Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Sean Christopherson
On Fri, Apr 01, 2022, Quentin Perret wrote:
> On Friday 01 Apr 2022 at 17:14:21 (+), Sean Christopherson wrote:
> > On Fri, Apr 01, 2022, Quentin Perret wrote:
> > I assume there is a scenario where a page can be converted from 
> > shared=>private?
> > If so, is there a use case where that happens post-boot _and_ the contents 
> > of the
> > page are preserved?
> 
> I think most our use-cases are private=>shared, but how is that
> different?

Ah, it's not really different.  What I really was trying to understand is if 
there
are post-boot conversions that preserve data.  I asked about shared=>private 
because
there are known pre-boot conversions, e.g. populating the initial guest image, 
but
AFAIK there are no use cases for post-boot conversions, which might be more 
needy in
terms of performance.

> > > We currently don't allow the host punching holes in the guest IPA space.
> > 
> > The hole doesn't get punched in guest IPA space, it gets punched in the 
> > private
> > backing store, which is host PA space.
> 
> Hmm, in a previous message I thought that you mentioned when a whole
> gets punched in the fd KVM will go and unmap the page in the private
> SPTEs, which will cause a fatal error for any subsequent access from the
> guest to the corresponding IPA?

Oooh, that was in the context of TDX.  Mixing VMX and arm64 terminology... TDX 
has
two separate stage-2 roots, one for private IPAs and one for shared IPAs.  The
guest selects private/shared by toggling a bit stolen from the guest IPA space.
Upon conversion, KVM will remove from one stage-2 tree and insert into the 
other.

But even then, subsequent accesses to the wrong IPA won't be fatal, as KVM will
treat them as implicit conversions.  I wish they could be fatal, but that's not
"allowed" given the guest/host contract dictated by the TDX specs.

> If that's correct, I meant that we currently don't support that - the
> host can't unmap anything from the guest stage-2, it can only tear it
> down entirely. But again, I'm not too worried about that, we could
> certainly implement that part without too many issues.

I believe for the pKVM case it wouldn't be unmapping, it would be a PFN change.

> > > Once it has donated a page to a guest, it can't have it back until the
> > > guest has been entirely torn down (at which point all of memory is
> > > poisoned by the hypervisor obviously).
> > 
> > The guest doesn't have to know that it was handed back a different page.  
> > It will
> > require defining the semantics to state that the trusted hypervisor will 
> > clear
> > that page on conversion, but IMO the trusted hypervisor should be doing that
> > anyways.  IMO, forcing on the guest to correctly zero pages on conversion is
> > unnecessarily risky because converting private=>shared and preserving the 
> > contents
> > should be a very, very rare scenario, i.e. it's just one more thing for the 
> > guest
> > to get wrong.
> 
> I'm not sure I agree. The guest is going to communicate with an
> untrusted entity via that shared page, so it better be careful. Guest
> hardening in general is a major topic, and of all problems, zeroing the
> page before sharing is probably one of the simplest to solve.

Yes, for private=>shared you're correct, the guest needs to be paranoid as
there are no guarantees as to what data may be in the shared page.

I was thinking more in the context of shared=>private conversions, e.g. the 
guest
is done sharing a page and wants it back.  In that case, forcing the guest to 
zero
the private page upon re-acceptance is dicey.  Hmm, but if the guest needs to
explicitly re-accept the page, then putting the onus on the guest to zero the 
page
isn't a big deal.  The pKVM contract would just need to make it clear that the
guest cannot make any assumptions about the state of private data 

Oh, now I remember why I'm biased toward the trusted entity doing the work.
IIRC, thanks to TDX's lovely memory poisoning and cache aliasing behavior, the
guest can't be trusted to properly initialize private memory with the guest key,
i.e. the guest could induce a #MC and crash the host.

Anywho, I agree that for performance reasons, requiring the guest to zero 
private
pages is preferable so long as the guest must explicitly accept/initiate 
conversions.

> Also, note that in pKVM all the hypervisor code at EL2 runs with
> preemption disabled, which is a strict constraint. As such one of the
> main goals is the spend as little time as possible in that context.
> We're trying hard to keep the amount of zeroing/memcpy-ing to an
> absolute minimum. And that's especially true as we introduce support for
> huge pages. So, we'll take every opportunity we get to have the guest
> or the host do that work.

FWIW, TDX has the exact same constraints (they're actually worse as the trusted
entity runs with _all_ interrupts blocked).  And yeah, it needs to be careful 
when
dealing with huge pages, e.g. many flows force the guest/host to do 512 * 4kb 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Quentin Perret
On Friday 01 Apr 2022 at 17:14:21 (+), Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Quentin Perret wrote:
> > The typical flow is as follows:
> > 
> >  - the host asks the hypervisor to run a guest;
> > 
> >  - the hypervisor does the context switch, which includes switching
> >stage-2 page-tables;
> > 
> >  - initially the guest has an empty stage-2 (we don't require
> >pre-faulting everything), which means it'll immediately fault;
> > 
> >  - the hypervisor switches back to host context to handle the guest
> >fault;
> > 
> >  - the host handler finds the corresponding memslot and does the
> >ipa->hva conversion. In our current implementation it uses a longterm
> >GUP pin on the corresponding page;
> > 
> >  - once it has a page, the host handler issues a hypercall to donate the
> >page to the guest;
> > 
> >  - the hypervisor does a bunch of checks to make sure the host owns the
> >page, and if all is fine it will unmap it from the host stage-2 and
> >map it in the guest stage-2, and do some bookkeeping as it needs to
> >track page ownership, etc;
> > 
> >  - the guest can then proceed to run, and possibly faults in many more
> >pages;
> > 
> >  - when it wants to, the guest can then issue a hypercall to share a
> >page back with the host;
> > 
> >  - the hypervisor checks the request, maps the page back in the host
> >stage-2, does more bookkeeping and returns back to the host to notify
> >it of the share;
> > 
> >  - the host kernel at that point can exit back to userspace to relay
> >that information to the VMM;
> > 
> >  - rinse and repeat.
> 
> I assume there is a scenario where a page can be converted from 
> shared=>private?
> If so, is there a use case where that happens post-boot _and_ the contents of 
> the
> page are preserved?

I think most our use-cases are private=>shared, but how is that
different?

> > We currently don't allow the host punching holes in the guest IPA space.
> 
> The hole doesn't get punched in guest IPA space, it gets punched in the 
> private
> backing store, which is host PA space.

Hmm, in a previous message I thought that you mentioned when a whole
gets punched in the fd KVM will go and unmap the page in the private
SPTEs, which will cause a fatal error for any subsequent access from the
guest to the corresponding IPA?

If that's correct, I meant that we currently don't support that - the
host can't unmap anything from the guest stage-2, it can only tear it
down entirely. But again, I'm not too worried about that, we could
certainly implement that part without too many issues.

> > Once it has donated a page to a guest, it can't have it back until the
> > guest has been entirely torn down (at which point all of memory is
> > poisoned by the hypervisor obviously).
> 
> The guest doesn't have to know that it was handed back a different page.  It 
> will
> require defining the semantics to state that the trusted hypervisor will clear
> that page on conversion, but IMO the trusted hypervisor should be doing that
> anyways.  IMO, forcing on the guest to correctly zero pages on conversion is
> unnecessarily risky because converting private=>shared and preserving the 
> contents
> should be a very, very rare scenario, i.e. it's just one more thing for the 
> guest
> to get wrong.

I'm not sure I agree. The guest is going to communicate with an
untrusted entity via that shared page, so it better be careful. Guest
hardening in general is a major topic, and of all problems, zeroing the
page before sharing is probably one of the simplest to solve.

Also, note that in pKVM all the hypervisor code at EL2 runs with
preemption disabled, which is a strict constraint. As such one of the
main goals is the spend as little time as possible in that context.
We're trying hard to keep the amount of zeroing/memcpy-ing to an
absolute minimum. And that's especially true as we introduce support for
huge pages. So, we'll take every opportunity we get to have the guest
or the host do that work.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Sean Christopherson
On Fri, Apr 01, 2022, Quentin Perret wrote:
> The typical flow is as follows:
> 
>  - the host asks the hypervisor to run a guest;
> 
>  - the hypervisor does the context switch, which includes switching
>stage-2 page-tables;
> 
>  - initially the guest has an empty stage-2 (we don't require
>pre-faulting everything), which means it'll immediately fault;
> 
>  - the hypervisor switches back to host context to handle the guest
>fault;
> 
>  - the host handler finds the corresponding memslot and does the
>ipa->hva conversion. In our current implementation it uses a longterm
>GUP pin on the corresponding page;
> 
>  - once it has a page, the host handler issues a hypercall to donate the
>page to the guest;
> 
>  - the hypervisor does a bunch of checks to make sure the host owns the
>page, and if all is fine it will unmap it from the host stage-2 and
>map it in the guest stage-2, and do some bookkeeping as it needs to
>track page ownership, etc;
> 
>  - the guest can then proceed to run, and possibly faults in many more
>pages;
> 
>  - when it wants to, the guest can then issue a hypercall to share a
>page back with the host;
> 
>  - the hypervisor checks the request, maps the page back in the host
>stage-2, does more bookkeeping and returns back to the host to notify
>it of the share;
> 
>  - the host kernel at that point can exit back to userspace to relay
>that information to the VMM;
> 
>  - rinse and repeat.

I assume there is a scenario where a page can be converted from shared=>private?
If so, is there a use case where that happens post-boot _and_ the contents of 
the
page are preserved?

> We currently don't allow the host punching holes in the guest IPA space.

The hole doesn't get punched in guest IPA space, it gets punched in the private
backing store, which is host PA space.

> Once it has donated a page to a guest, it can't have it back until the
> guest has been entirely torn down (at which point all of memory is
> poisoned by the hypervisor obviously).

The guest doesn't have to know that it was handed back a different page.  It 
will
require defining the semantics to state that the trusted hypervisor will clear
that page on conversion, but IMO the trusted hypervisor should be doing that
anyways.  IMO, forcing on the guest to correctly zero pages on conversion is
unnecessarily risky because converting private=>shared and preserving the 
contents
should be a very, very rare scenario, i.e. it's just one more thing for the 
guest
to get wrong.

If there is a use case where the page contents need to be preserved, then that 
can
and should be an explicit request from the guest, and can be handled through
export/import style functions.  Export/import would be slow-ish due to memcpy(),
which is why I asked if there's a need to do this specific action frequently (or
at all).



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Quentin Perret
On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:
> On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote:
> > On Wed, Mar 30, 2022, Quentin Perret wrote:
> >> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
> >> > On 29/03/2022 18:01, Quentin Perret wrote:
> >> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
> >> > > the shared gpa range at an address that doesn't have a backing memslot,
> >> > > will KVM check whether there is a corresponding private memslot at the
> >> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> >> > > would that just generate an MMIO exit as usual?
> >> > 
> >> > My understanding is that the guest needs some way of tagging whether a
> >> > page is expected to be shared or private. On the architectures I'm aware
> >> > of this is done by effectively stealing a bit from the IPA space and
> >> > pretending it's a flag bit.
> >> 
> >> Right, and that is in fact the main point of divergence we have I think.
> >> While I understand this might be necessary for TDX and the likes, this
> >> makes little sense for pKVM. This would effectively embed into the IPA a
> >> purely software-defined non-architectural property/protocol although we
> >> don't actually need to: we (pKVM) can reasonably expect the guest to
> >> explicitly issue hypercalls to share pages in-place. So I'd be really
> >> keen to avoid baking in assumptions about that model too deep in the
> >> host mm bits if at all possible.
> >
> > There is no assumption about stealing PA bits baked into this API.  Even 
> > within
> > x86 KVM, I consider it a hard requirement that the common flows not assume 
> > the
> > private vs. shared information is communicated through the PA.
> 
> Quentin, I think we might need a clarification.  The API in this patchset 
> indeed has no requirement that a PA bit distinguish between private and 
> shared, but I think it makes at least a weak assumption that *something*, a 
> priori, distinguishes them.  In particular, there are private memslots and 
> shared memslots, so the logical flow of resolving a guest memory access looks 
> like:
> 
> 1. guest accesses a GVA
> 
> 2. read guest paging structures
> 
> 3. determine whether this is a shared or private access
> 
> 4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures 
> accordingly.  In particular, the memslot to reference is different depending 
> on the access type.
> 
> For TDX, this maps on to the fd-based model perfectly: the host-side paging 
> structures for the shared and private slots are completely separate.  For 
> SEV, the structures are shared and KVM will need to figure out what to do in 
> case a private and shared memslot overlap.  Presumably it's sufficient to 
> declare that one of them wins, although actually determining which one is 
> active for a given GPA may involve checking whether the backing store for a 
> given page actually exists.
> 
> But I don't understand pKVM well enough to understand how it fits in.  
> Quentin, how is the shared vs private mode of a memory access determined?  
> How do the paging structures work?  Can a guest switch between shared and 
> private by issuing a hypercall without changing any guest-side paging 
> structures or anything else?

My apologies, I've indeed shared very little details about how pKVM
works. We'll be posting patches upstream really soon that will hopefully
help with this, but in the meantime, here is the idea.

pKVM is designed around MMU-based protection as opposed to encryption as
is the case for many confidential computing solutions. It's probably
worth mentioning that, although it targets arm64, pKVM is distinct from
the Arm CC-A stuff and requires no fancy hardware extensions -- it is
applicable all the way back to Arm v8.0 which makes it an interesting
solution for mobile.

Another particularity of the pKVM approach is that the code of the
hypervisor itself lives in the kernel source tree (see
arch/arm64/kvm/hyp/nvhe/). The hypervisor is built with the rest of the
kernel but as a self-sufficient object, and ends up in its own dedicated
ELF section (.hyp.*) in the kernel image. The main requirement for pKVM
(and KVM on arm64 in general) is to have the bootloader enter the kernel
at the hypervisor exception level (a.k.a EL2). The boot procedure is a
bit involved, but eventually the hypervisor object is installed at EL2,
and the kernel is deprivileged to EL1 and proceeds to boot. From that
point on the hypervisor no longer trusts the kernel and will enable the
stage-2 MMU to impose access-control restrictions to all memory accesses
from the host.

All that to say: the pKVM approach offers a great deal of flexibility
when it comes to hypervisor behaviour. We have control over the
hypervisor code and can change it as we see fit. Since both the
hypervisor and the host kernel are part of the same image, the ABI
between them is very much *not* stable 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-31 Thread Andy Lutomirski
On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote:
> On Wed, Mar 30, 2022, Quentin Perret wrote:
>> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
>> > On 29/03/2022 18:01, Quentin Perret wrote:
>> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
>> > > the shared gpa range at an address that doesn't have a backing memslot,
>> > > will KVM check whether there is a corresponding private memslot at the
>> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
>> > > would that just generate an MMIO exit as usual?
>> > 
>> > My understanding is that the guest needs some way of tagging whether a
>> > page is expected to be shared or private. On the architectures I'm aware
>> > of this is done by effectively stealing a bit from the IPA space and
>> > pretending it's a flag bit.
>> 
>> Right, and that is in fact the main point of divergence we have I think.
>> While I understand this might be necessary for TDX and the likes, this
>> makes little sense for pKVM. This would effectively embed into the IPA a
>> purely software-defined non-architectural property/protocol although we
>> don't actually need to: we (pKVM) can reasonably expect the guest to
>> explicitly issue hypercalls to share pages in-place. So I'd be really
>> keen to avoid baking in assumptions about that model too deep in the
>> host mm bits if at all possible.
>
> There is no assumption about stealing PA bits baked into this API.  Even 
> within
> x86 KVM, I consider it a hard requirement that the common flows not assume the
> private vs. shared information is communicated through the PA.

Quentin, I think we might need a clarification.  The API in this patchset 
indeed has no requirement that a PA bit distinguish between private and shared, 
but I think it makes at least a weak assumption that *something*, a priori, 
distinguishes them.  In particular, there are private memslots and shared 
memslots, so the logical flow of resolving a guest memory access looks like:

1. guest accesses a GVA

2. read guest paging structures

3. determine whether this is a shared or private access

4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures 
accordingly.  In particular, the memslot to reference is different depending on 
the access type.

For TDX, this maps on to the fd-based model perfectly: the host-side paging 
structures for the shared and private slots are completely separate.  For SEV, 
the structures are shared and KVM will need to figure out what to do in case a 
private and shared memslot overlap.  Presumably it's sufficient to declare that 
one of them wins, although actually determining which one is active for a given 
GPA may involve checking whether the backing store for a given page actually 
exists.

But I don't understand pKVM well enough to understand how it fits in.  Quentin, 
how is the shared vs private mode of a memory access determined?  How do the 
paging structures work?  Can a guest switch between shared and private by 
issuing a hypercall without changing any guest-side paging structures or 
anything else?

It's plausible that SEV and (maybe) pKVM would be better served if memslots 
could be sparse or if there was otherwise a direct way for host userspace to 
indicate to KVM which address ranges are actually active (not hole-punched) in 
a given memslot or to otherwise be able to make a rule that two different 
memslots (one shared and one private) can't claim the same address.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-30 Thread Sean Christopherson
On Wed, Mar 30, 2022, Quentin Perret wrote:
> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
> > On 29/03/2022 18:01, Quentin Perret wrote:
> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
> > > the shared gpa range at an address that doesn't have a backing memslot,
> > > will KVM check whether there is a corresponding private memslot at the
> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> > > would that just generate an MMIO exit as usual?
> > 
> > My understanding is that the guest needs some way of tagging whether a
> > page is expected to be shared or private. On the architectures I'm aware
> > of this is done by effectively stealing a bit from the IPA space and
> > pretending it's a flag bit.
> 
> Right, and that is in fact the main point of divergence we have I think.
> While I understand this might be necessary for TDX and the likes, this
> makes little sense for pKVM. This would effectively embed into the IPA a
> purely software-defined non-architectural property/protocol although we
> don't actually need to: we (pKVM) can reasonably expect the guest to
> explicitly issue hypercalls to share pages in-place. So I'd be really
> keen to avoid baking in assumptions about that model too deep in the
> host mm bits if at all possible.

There is no assumption about stealing PA bits baked into this API.  Even within
x86 KVM, I consider it a hard requirement that the common flows not assume the
private vs. shared information is communicated through the PA.

> > > I'm overall inclined to think that while this abstraction works nicely
> > > for TDX and the likes, it might not suit pKVM all that well in the
> > > current form, but it's close.
> > > 
> > > What do you think of extending the model proposed here to also address
> > > the needs of implementations that support in-place sharing? One option
> > > would be to have KVM notify the private-fd backing store when a page is
> > > shared back by a guest, which would then allow host userspace to mmap
> > > that particular page in the private fd instead of punching a hole.
> > > 
> > > This should retain the main property you're after: private pages that
> > > are actually mapped in the guest SPTE aren't mmap-able, but all the
> > > others are fair game.
> > > 
> > > Thoughts?
> > How do you propose this works if the page shared by the guest then needs
> > to be made private again? If there's no hole punched then it's not
> > possible to just repopulate the private-fd. I'm struggling to see how
> > that could work.
> 
> Yes, some discussion might be required, but I was thinking about
> something along those lines:
> 
>  - a guest requests a shared->private page conversion;
> 
>  - the conversion request is routed all the way back to the VMM;
> 
>  - the VMM is expected to either decline the conversion (which may be
>fatal for the guest if it can't handle this), or to tear-down its
>mappings (via munmap()) of the shared page, and accept the
>conversion;
> 
>  - upon return from the VMM, KVM will be expected to check how many
>references to the shared page are still held (probably by asking the
>fd backing store) to check that userspace has indeed torn down its
>mappings. If all is fine, KVM will instruct the hypervisor to
>repopulate the private range of the guest, otherwise it'll return an
>error to the VMM;
> 
>  - if the conversion has been successful, the guest can resume its
>execution normally.
> 
> Note: this should still allow to use the hole-punching method just fine
> on systems that require it. The invariant here is just that KVM (with
> help from the backing store) is now responsible for refusing to
> instruct the hypervisor (or TDX module, or RMM, or whatever) to map a
> private page if there are existing mappings to it.
> 
> > Having said that; if we can work out a way to safely
> > mmap() pages from the private-fd there's definitely some benefits to be
> > had - e.g. it could be used to populate the initial memory before the
> > guest is started.
> 
> Right, so assuming the approach proposed above isn't entirely bogus,
> this might now become possible by having the VMM mmap the private-fd,
> load the payload, and then unmap it all, and only then instruct the
> hypervisor to use this as private memory.

Hard "no" on mapping the private-fd.  Having the invariant tha the private-fd
can never be mapped greatly simplifies the responsibilities of the backing 
store,
as well as the interface between the private-fd and the in-kernel consumers of 
the
memory (KVM in this case).

What is the use case for shared->private conversion?  x86, both TDX and SNP,
effectively do have a flavor of shared->private conversion; SNP can definitely
be in-place, and I think TDX too.  But the only use case in x86 is to populate
the initial guest image, and due to other performance bottlenecks, it's strongly
recommended to keep the initial image as small as 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-30 Thread Sean Christopherson
On Wed, Mar 30, 2022, Steven Price wrote:
> On 29/03/2022 18:01, Quentin Perret wrote:
> > Is implicit sharing a thing? E.g., if a guest makes a memory access in
> > the shared gpa range at an address that doesn't have a backing memslot,
> > will KVM check whether there is a corresponding private memslot at the
> > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> > would that just generate an MMIO exit as usual?
> 
> My understanding is that the guest needs some way of tagging whether a
> page is expected to be shared or private. On the architectures I'm aware
> of this is done by effectively stealing a bit from the IPA space and
> pretending it's a flag bit.
> 
> So when a guest access causes a fault, the flag bit (really part of the
> intermediate physical address) is compared against whether the page is
> present in the private fd. If they correspond (i.e. a private access and
> the private fd has a page, or a shared access and there's a hole in the
> private fd) then the appropriate page is mapped and the guest continues.
> If there's a mismatch then a KVM_EXIT_MEMORY_ERROR exit is trigged and
> the VMM is expected to fix up the situation (either convert the page or
> kill the guest if this was unexpected).

x86 architectures do steal a bit, but it's not strictly required.  The guest can
communicate its desired private vs. shared state via hypercall.  I refer to the
hypercall method as explicit conversion, and reacting to a page fault due to
accessing the "wrong" PA variant as implicit conversion.

I have dreams of supporting a software-only implementation on x86, a la pKVM, if
only for testing and debug purposes.  In that case, only explicit conversion is
supported.

I'd actually prefer TDX and SNP only allow explicit conversion, i.e. let the 
host
treat accesses to the "wrong" PA as illegal, but sadly the guest/host ABIs for
both TDX and SNP require the host to support implicit conversions.

>  The key point is that KVM never decides to convert between shared and 
>  private, it's
>  always a userspace decision.  Like normal memslots, where userspace has 
>  full control
>  over what gfns are a valid, this gives userspace full control over 
>  whether a gfn is
>  shared or private at any given time.
> >>>
> >>> I'm understanding this as 'the VMM is allowed to punch holes in the
> >>> private fd whenever it wants'. Is this correct?
> >>
> >> From the kernel's perspective, yes, the VMM can punch holes at any time.  
> >> From a
> >> "do I want to DoS my guest" perspective, the VMM must honor its contract 
> >> with the
> >> guest and not spuriously unmap private memory.
> >>
> >>> What happens if it does so for a page that a guest hasn't shared back?
> >>
> >> When the hole is punched, KVM will unmap the corresponding private SPTEs.  
> >> If the
> >> guest is still accessing the page as private, the next access will fault 
> >> and KVM
> >> will exit to userspace with KVM_EXIT_MEMORY_ERROR.  Of course the guest is 
> >> probably
> >> hosed if the hole punch was truly spurious, as at least hardware-based 
> >> protected VMs
> >> effectively destroy data when a private page is unmapped from the guest 
> >> private SPTEs.
> >>
> >> E.g. Linux guests for TDX and SNP will panic/terminate in such a scenario 
> >> as they
> >> will get a fault (injected by trusted hardware/firmware) saying that the 
> >> guest is
> >> trying to access an unaccepted/unvalidated page (TDX and SNP require the 
> >> guest to
> >> explicit accept all private pages that aren't part of the guest's initial 
> >> pre-boot
> >> image).
> > 
> > I suppose this is necessary is to prevent the VMM from re-fallocating
> > in a hole it previously punched and re-entering the guest without
> > notifying it?
> 
> I don't know specifically about TDX/SNP, but one thing we want to
> prevent with CCA is the VMM deallocating/reallocating a private page
> without the guest being aware (i.e. corrupting the guest's state).So
> punching a hole will taint the address such that a future access by the
> guest is fatal (unless the guest first jumps through the right hoops to
> acknowledge that it was expecting such a thing).

Yep, both TDX and SNP will trigger a fault in the guest if the host removes and
reinserts a private page.  The current plan for Linux guests is to track whether
or not a given page has been accepted as private, and panic/die if a fault due
to unaccepted private page occurs.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-30 Thread Quentin Perret
On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
> On 29/03/2022 18:01, Quentin Perret wrote:
> > On Monday 28 Mar 2022 at 18:58:35 (+), Sean Christopherson wrote:
> >> On Mon, Mar 28, 2022, Quentin Perret wrote:
> >>> Hi Sean,
> >>>
> >>> Thanks for the reply, this helps a lot.
> >>>
> >>> On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
>  On Thu, Mar 24, 2022, Quentin Perret wrote:
> > For Protected KVM (and I suspect most other confidential computing
> > solutions), guests have the ability to share some of their pages back
> > with the host kernel using a dedicated hypercall. This is necessary
> > for e.g. virtio communications, so these shared pages need to be mapped
> > back into the VMM's address space. I'm a bit confused about how that
> > would work with the approach proposed here. What is going to be the
> > approach for TDX?
> >
> > It feels like the most 'natural' thing would be to have a KVM exit
> > reason describing which pages have been shared back by the guest, and to
> > then allow the VMM to mmap those specific pages in response in the
> > memfd. Is this something that has been discussed or considered?
> 
>  The proposed solution is to exit to userspace with a new exit reason, 
>  KVM_EXIT_MEMORY_ERROR,
>  when the guest makes the hypercall to request conversion[1].  The 
>  private fd itself
>  will never allow mapping memory into userspace, instead userspace will 
>  need to punch
>  a hole in the private fd backing store.  The absense of a valid mapping 
>  in the private
>  fd is how KVM detects that a pfn is "shared" (memslots without a private 
>  fd are always
>  shared)[2].
> >>>
> >>> Right. I'm still a bit confused about how the VMM is going to get the
> >>> shared page mapped in its page-table. Once it has punched a hole into
> >>> the private fd, how is it supposed to access the actual physical page
> >>> that the guest shared?
> >>
> >> The guest doesn't share a _host_ physical page, the guest shares a _guest_ 
> >> physical
> >> page.  Until host userspace converts the gfn to shared and thus maps the 
> >> gfn=>hva
> >> via mmap(), the guest is blocked and can't read/write/exec the memory.  
> >> AFAIK, no
> >> architecture allows in-place decryption of guest private memory.  s390 
> >> allows a
> >> page to be "made accessible" to the host for the purposes of swap, and 
> >> other
> >> architectures will have similar behavior for migrating a protected VM, but 
> >> those
> >> scenarios are not sharing the page (and they also make the page 
> >> inaccessible to
> >> the guest).
> > 
> > I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
> > doing in-place sharing, which also means it can retain the content of
> > the page as part of the conversion.
> > 
> > Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
> > I _believe_ it should be technically possible to do in-place sharing for
> > them too.
> 
> In general this isn't possible as the physical memory could be
> encrypted, so some temporary memory is required. We have prototyped
> having a single temporary page for the setup when populating the guest's
> initial memory - this has the nice property of not requiring any
> additional allocation during the process but with the downside of
> effectively two memcpy()s per page (one to the temporary page and
> another, with optional encryption, into the now private page).

Interesting, thanks for the explanation.

> >>> Is there an assumption somewhere that the VMM should have this page 
> >>> mapped in
> >>> via an alias that it can legally access only once it has punched a hole at
> >>> the corresponding offset in the private fd or something along those lines?
> >>
> >> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't 
> >> to
> >> wait until the conversion to mmap() the shared variant, though obviously 
> >> it will
> >> potentially consume double the memory if the VMM actually populates both 
> >> the
> >> private and shared backing stores.
> > 
> > Gotcha. This is what confused me I think -- in this approach private and
> > shared pages are in fact entirely different.
> > 
> > In which scenario could you end up with both the private and shared
> > pages live at the same time? Would this be something like follows?
> > 
> >  - userspace creates a private fd, fallocates into it, and associates
> >the  tuple with a private memslot;
> > 
> >  - userspace then mmaps anonymous memory (for ex.), and associates it
> >with a standard memslot, which happens to be positioned at exactly
> >the right offset w.r.t to the private memslot (with this offset
> >defined by the bit that is set for the private addresses in the gpa
> >space);
> > 
> >  - the guest runs, and accesses both 'aliases' of the page without doing
> >an explicit share hypercall.
> > 
> 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-30 Thread Steven Price
On 29/03/2022 18:01, Quentin Perret wrote:
> On Monday 28 Mar 2022 at 18:58:35 (+), Sean Christopherson wrote:
>> On Mon, Mar 28, 2022, Quentin Perret wrote:
>>> Hi Sean,
>>>
>>> Thanks for the reply, this helps a lot.
>>>
>>> On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
 On Thu, Mar 24, 2022, Quentin Perret wrote:
> For Protected KVM (and I suspect most other confidential computing
> solutions), guests have the ability to share some of their pages back
> with the host kernel using a dedicated hypercall. This is necessary
> for e.g. virtio communications, so these shared pages need to be mapped
> back into the VMM's address space. I'm a bit confused about how that
> would work with the approach proposed here. What is going to be the
> approach for TDX?
>
> It feels like the most 'natural' thing would be to have a KVM exit
> reason describing which pages have been shared back by the guest, and to
> then allow the VMM to mmap those specific pages in response in the
> memfd. Is this something that has been discussed or considered?

 The proposed solution is to exit to userspace with a new exit reason, 
 KVM_EXIT_MEMORY_ERROR,
 when the guest makes the hypercall to request conversion[1].  The private 
 fd itself
 will never allow mapping memory into userspace, instead userspace will 
 need to punch
 a hole in the private fd backing store.  The absense of a valid mapping in 
 the private
 fd is how KVM detects that a pfn is "shared" (memslots without a private 
 fd are always
 shared)[2].
>>>
>>> Right. I'm still a bit confused about how the VMM is going to get the
>>> shared page mapped in its page-table. Once it has punched a hole into
>>> the private fd, how is it supposed to access the actual physical page
>>> that the guest shared?
>>
>> The guest doesn't share a _host_ physical page, the guest shares a _guest_ 
>> physical
>> page.  Until host userspace converts the gfn to shared and thus maps the 
>> gfn=>hva
>> via mmap(), the guest is blocked and can't read/write/exec the memory.  
>> AFAIK, no
>> architecture allows in-place decryption of guest private memory.  s390 
>> allows a
>> page to be "made accessible" to the host for the purposes of swap, and other
>> architectures will have similar behavior for migrating a protected VM, but 
>> those
>> scenarios are not sharing the page (and they also make the page inaccessible 
>> to
>> the guest).
> 
> I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
> doing in-place sharing, which also means it can retain the content of
> the page as part of the conversion.
> 
> Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
> I _believe_ it should be technically possible to do in-place sharing for
> them too.

In general this isn't possible as the physical memory could be
encrypted, so some temporary memory is required. We have prototyped
having a single temporary page for the setup when populating the guest's
initial memory - this has the nice property of not requiring any
additional allocation during the process but with the downside of
effectively two memcpy()s per page (one to the temporary page and
another, with optional encryption, into the now private page).

>>> Is there an assumption somewhere that the VMM should have this page mapped 
>>> in
>>> via an alias that it can legally access only once it has punched a hole at
>>> the corresponding offset in the private fd or something along those lines?
>>
>> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
>> wait until the conversion to mmap() the shared variant, though obviously it 
>> will
>> potentially consume double the memory if the VMM actually populates both the
>> private and shared backing stores.
> 
> Gotcha. This is what confused me I think -- in this approach private and
> shared pages are in fact entirely different.
> 
> In which scenario could you end up with both the private and shared
> pages live at the same time? Would this be something like follows?
> 
>  - userspace creates a private fd, fallocates into it, and associates
>the  tuple with a private memslot;
> 
>  - userspace then mmaps anonymous memory (for ex.), and associates it
>with a standard memslot, which happens to be positioned at exactly
>the right offset w.r.t to the private memslot (with this offset
>defined by the bit that is set for the private addresses in the gpa
>space);
> 
>  - the guest runs, and accesses both 'aliases' of the page without doing
>an explicit share hypercall.
> 
> Is there another option?

AIUI you can have both private and shared "live" at the same time. But
you can have a page allocated both in the private fd and in the same
location in the (shared) memslot in the VMM's memory map. In this
situation the private fd page effectively hides the shared page.

> Is implicit sharing a 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-29 Thread Quentin Perret
On Monday 28 Mar 2022 at 18:58:35 (+), Sean Christopherson wrote:
> On Mon, Mar 28, 2022, Quentin Perret wrote:
> > Hi Sean,
> > 
> > Thanks for the reply, this helps a lot.
> > 
> > On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
> > > On Thu, Mar 24, 2022, Quentin Perret wrote:
> > > > For Protected KVM (and I suspect most other confidential computing
> > > > solutions), guests have the ability to share some of their pages back
> > > > with the host kernel using a dedicated hypercall. This is necessary
> > > > for e.g. virtio communications, so these shared pages need to be mapped
> > > > back into the VMM's address space. I'm a bit confused about how that
> > > > would work with the approach proposed here. What is going to be the
> > > > approach for TDX?
> > > > 
> > > > It feels like the most 'natural' thing would be to have a KVM exit
> > > > reason describing which pages have been shared back by the guest, and to
> > > > then allow the VMM to mmap those specific pages in response in the
> > > > memfd. Is this something that has been discussed or considered?
> > > 
> > > The proposed solution is to exit to userspace with a new exit reason, 
> > > KVM_EXIT_MEMORY_ERROR,
> > > when the guest makes the hypercall to request conversion[1].  The private 
> > > fd itself
> > > will never allow mapping memory into userspace, instead userspace will 
> > > need to punch
> > > a hole in the private fd backing store.  The absense of a valid mapping 
> > > in the private
> > > fd is how KVM detects that a pfn is "shared" (memslots without a private 
> > > fd are always
> > > shared)[2].
> > 
> > Right. I'm still a bit confused about how the VMM is going to get the
> > shared page mapped in its page-table. Once it has punched a hole into
> > the private fd, how is it supposed to access the actual physical page
> > that the guest shared?
> 
> The guest doesn't share a _host_ physical page, the guest shares a _guest_ 
> physical
> page.  Until host userspace converts the gfn to shared and thus maps the 
> gfn=>hva
> via mmap(), the guest is blocked and can't read/write/exec the memory.  
> AFAIK, no
> architecture allows in-place decryption of guest private memory.  s390 allows 
> a
> page to be "made accessible" to the host for the purposes of swap, and other
> architectures will have similar behavior for migrating a protected VM, but 
> those
> scenarios are not sharing the page (and they also make the page inaccessible 
> to
> the guest).

I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
doing in-place sharing, which also means it can retain the content of
the page as part of the conversion.

Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
I _believe_ it should be technically possible to do in-place sharing for
them too.

> > Is there an assumption somewhere that the VMM should have this page mapped 
> > in
> > via an alias that it can legally access only once it has punched a hole at
> > the corresponding offset in the private fd or something along those lines?
> 
> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
> wait until the conversion to mmap() the shared variant, though obviously it 
> will
> potentially consume double the memory if the VMM actually populates both the
> private and shared backing stores.

Gotcha. This is what confused me I think -- in this approach private and
shared pages are in fact entirely different.

In which scenario could you end up with both the private and shared
pages live at the same time? Would this be something like follows?

 - userspace creates a private fd, fallocates into it, and associates
   the  tuple with a private memslot;

 - userspace then mmaps anonymous memory (for ex.), and associates it
   with a standard memslot, which happens to be positioned at exactly
   the right offset w.r.t to the private memslot (with this offset
   defined by the bit that is set for the private addresses in the gpa
   space);

 - the guest runs, and accesses both 'aliases' of the page without doing
   an explicit share hypercall.

Is there another option?

Is implicit sharing a thing? E.g., if a guest makes a memory access in
the shared gpa range at an address that doesn't have a backing memslot,
will KVM check whether there is a corresponding private memslot at the
right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
would that just generate an MMIO exit as usual?

> > > The key point is that KVM never decides to convert between shared and 
> > > private, it's
> > > always a userspace decision.  Like normal memslots, where userspace has 
> > > full control
> > > over what gfns are a valid, this gives userspace full control over 
> > > whether a gfn is
> > > shared or private at any given time.
> > 
> > I'm understanding this as 'the VMM is allowed to punch holes in the
> > private fd whenever it wants'. Is this correct?
> 
> From the kernel's perspective, yes, 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Sean Christopherson
On Mon, Mar 28, 2022, Nakajima, Jun wrote:
> > On Mar 28, 2022, at 1:16 PM, Andy Lutomirski  wrote:
> > 
> > On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  
> > wrote:
> >> 
> >> This is the v5 of this series which tries to implement the fd-based KVM
> >> guest private memory. The patches are based on latest kvm/queue branch
> >> commit:
> >> 
> >>  d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> > 
> > Can this series be run and a VM booted without TDX?  A feature like
> > that might help push it forward.
> > 
> > —Andy
> 
> Since the userspace VMM (e.g. QEMU) loses direct access to private memory of
> the VM, the guest needs to avoid using the private memory for (virtual) DMA
> buffers, for example. Otherwise, it would need to use bounce buffers, i.e. we
> would need changes to the VM. I think we can try that (i.e. add only bounce
> buffer changes). What do you think?

I would love to be able to test this series and run full-blown VMs without TDX 
or
SEV hardware.

The other option for getting test coverage is KVM selftests, which don't have an
existing guest that needs to be enlightened.  Vishal is doing work on that 
front,
though I think it's still in early stages.  Long term, selftests will also be 
great
for negative testing.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Nakajima, Jun
> On Mar 28, 2022, at 1:16 PM, Andy Lutomirski  wrote:
> 
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
>> 
>> This is the v5 of this series which tries to implement the fd-based KVM
>> guest private memory. The patches are based on latest kvm/queue branch
>> commit:
>> 
>>  d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
> 
> Can this series be run and a VM booted without TDX?  A feature like
> that might help push it forward.
> 
> —Andy

Since the userspace VMM (e.g. QEMU) loses direct access to private memory of 
the VM, the guest needs to avoid using the private memory for (virtual) DMA 
buffers, for example. Otherwise, it would need to use bounce buffers, i.e. we 
would need changes to the VM. I think we can try that (i.e. add only bounce 
buffer changes). What do you think?

Thanks,
--- 
Jun




Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Andy Lutomirski
On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
>
> This is the v5 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
>
>   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2

Can this series be run and a VM booted without TDX?  A feature like
that might help push it forward.

--Andy



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Sean Christopherson
On Mon, Mar 28, 2022, Quentin Perret wrote:
> Hi Sean,
> 
> Thanks for the reply, this helps a lot.
> 
> On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
> > On Thu, Mar 24, 2022, Quentin Perret wrote:
> > > For Protected KVM (and I suspect most other confidential computing
> > > solutions), guests have the ability to share some of their pages back
> > > with the host kernel using a dedicated hypercall. This is necessary
> > > for e.g. virtio communications, so these shared pages need to be mapped
> > > back into the VMM's address space. I'm a bit confused about how that
> > > would work with the approach proposed here. What is going to be the
> > > approach for TDX?
> > > 
> > > It feels like the most 'natural' thing would be to have a KVM exit
> > > reason describing which pages have been shared back by the guest, and to
> > > then allow the VMM to mmap those specific pages in response in the
> > > memfd. Is this something that has been discussed or considered?
> > 
> > The proposed solution is to exit to userspace with a new exit reason, 
> > KVM_EXIT_MEMORY_ERROR,
> > when the guest makes the hypercall to request conversion[1].  The private 
> > fd itself
> > will never allow mapping memory into userspace, instead userspace will need 
> > to punch
> > a hole in the private fd backing store.  The absense of a valid mapping in 
> > the private
> > fd is how KVM detects that a pfn is "shared" (memslots without a private fd 
> > are always
> > shared)[2].
> 
> Right. I'm still a bit confused about how the VMM is going to get the
> shared page mapped in its page-table. Once it has punched a hole into
> the private fd, how is it supposed to access the actual physical page
> that the guest shared?

The guest doesn't share a _host_ physical page, the guest shares a _guest_ 
physical
page.  Until host userspace converts the gfn to shared and thus maps the 
gfn=>hva
via mmap(), the guest is blocked and can't read/write/exec the memory.  AFAIK, 
no
architecture allows in-place decryption of guest private memory.  s390 allows a
page to be "made accessible" to the host for the purposes of swap, and other
architectures will have similar behavior for migrating a protected VM, but those
scenarios are not sharing the page (and they also make the page inaccessible to
the guest).

> Is there an assumption somewhere that the VMM should have this page mapped in
> via an alias that it can legally access only once it has punched a hole at
> the corresponding offset in the private fd or something along those lines?

Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
wait until the conversion to mmap() the shared variant, though obviously it will
potentially consume double the memory if the VMM actually populates both the
private and shared backing stores.

> > The key point is that KVM never decides to convert between shared and 
> > private, it's
> > always a userspace decision.  Like normal memslots, where userspace has 
> > full control
> > over what gfns are a valid, this gives userspace full control over whether 
> > a gfn is
> > shared or private at any given time.
> 
> I'm understanding this as 'the VMM is allowed to punch holes in the
> private fd whenever it wants'. Is this correct?

>From the kernel's perspective, yes, the VMM can punch holes at any time.  From 
>a
"do I want to DoS my guest" perspective, the VMM must honor its contract with 
the
guest and not spuriously unmap private memory.

> What happens if it does so for a page that a guest hasn't shared back?

When the hole is punched, KVM will unmap the corresponding private SPTEs.  If 
the
guest is still accessing the page as private, the next access will fault and KVM
will exit to userspace with KVM_EXIT_MEMORY_ERROR.  Of course the guest is 
probably
hosed if the hole punch was truly spurious, as at least hardware-based 
protected VMs
effectively destroy data when a private page is unmapped from the guest private 
SPTEs.

E.g. Linux guests for TDX and SNP will panic/terminate in such a scenario as 
they
will get a fault (injected by trusted hardware/firmware) saying that the guest 
is
trying to access an unaccepted/unvalidated page (TDX and SNP require the guest 
to
explicit accept all private pages that aren't part of the guest's initial 
pre-boot
image).

> > Another important detail is that this approach means the kernel and KVM 
> > treat the
> > shared backing store and private backing store as independent, albeit 
> > related,
> > entities.  This is very deliberate as it makes it easier to reason about 
> > what is
> > and isn't allowed/required.  E.g. the kernel only needs to handle freeing 
> > private
> > memory, there is no special handling for conversion to shared because no 
> > such path
> > exists as far as host pfns are concerned.  And userspace doesn't need any 
> > new "rules"
> > for protecting itself against a malicious guest, e.g. userspace already 
> > needs to
> > ensure that it has a 

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Quentin Perret
Hi Sean,

Thanks for the reply, this helps a lot.

On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
> On Thu, Mar 24, 2022, Quentin Perret wrote:
> > For Protected KVM (and I suspect most other confidential computing
> > solutions), guests have the ability to share some of their pages back
> > with the host kernel using a dedicated hypercall. This is necessary
> > for e.g. virtio communications, so these shared pages need to be mapped
> > back into the VMM's address space. I'm a bit confused about how that
> > would work with the approach proposed here. What is going to be the
> > approach for TDX?
> > 
> > It feels like the most 'natural' thing would be to have a KVM exit
> > reason describing which pages have been shared back by the guest, and to
> > then allow the VMM to mmap those specific pages in response in the
> > memfd. Is this something that has been discussed or considered?
> 
> The proposed solution is to exit to userspace with a new exit reason, 
> KVM_EXIT_MEMORY_ERROR,
> when the guest makes the hypercall to request conversion[1].  The private fd 
> itself
> will never allow mapping memory into userspace, instead userspace will need 
> to punch
> a hole in the private fd backing store.  The absense of a valid mapping in 
> the private
> fd is how KVM detects that a pfn is "shared" (memslots without a private fd 
> are always
> shared)[2].

Right. I'm still a bit confused about how the VMM is going to get the
shared page mapped in its page-table. Once it has punched a hole into
the private fd, how is it supposed to access the actual physical page
that the guest shared? Is there an assumption somewhere that the VMM
should have this page mapped in via an alias that it can legally access
only once it has punched a hole at the corresponding offset in the
private fd or something along those lines?

> The key point is that KVM never decides to convert between shared and 
> private, it's
> always a userspace decision.  Like normal memslots, where userspace has full 
> control
> over what gfns are a valid, this gives userspace full control over whether a 
> gfn is
> shared or private at any given time.

I'm understanding this as 'the VMM is allowed to punch holes in the
private fd whenever it wants'. Is this correct? What happens if it does
so for a page that a guest hasn't shared back?

> Another important detail is that this approach means the kernel and KVM treat 
> the
> shared backing store and private backing store as independent, albeit related,
> entities.  This is very deliberate as it makes it easier to reason about what 
> is
> and isn't allowed/required.  E.g. the kernel only needs to handle freeing 
> private
> memory, there is no special handling for conversion to shared because no such 
> path
> exists as far as host pfns are concerned.  And userspace doesn't need any new 
> "rules"
> for protecting itself against a malicious guest, e.g. userspace already needs 
> to
> ensure that it has a valid mapping prior to accessing guest memory (or be 
> able to
> handle any resulting signals).  A malicious guest can DoS itself by 
> instructing
> userspace to communicate over memory that is currently mapped private, but 
> there
> are no new novel attack vectors from the host's perspective as coercing the 
> host
> into accessing an invalid mapping after shared=>private conversion is just a 
> variant
> of a use-after-free.

Interesting. I was (maybe incorrectly) assuming that it would be
difficult to handle illegal host accesses w/ TDX. IOW, this would
essentially crash the host. Is this remotely correct or did I get that
wrong?

> One potential conversions that's TBD (at least, I think it is, I haven't read 
> through
> this most recent version) is how to support populating guest private memory 
> with
> non-zero data, e.g. to allow in-place conversion of the initial guest 
> firmware instead
> of having to an extra memcpy().

Right. FWIW, in the pKVM case we should be pretty immune to this I
think. The initial firmware is loaded in guest memory by the hypervisor
itself (the EL2 code in arm64 speak) as the first vCPU starts running.
And that firmware can then use e.g. virtio to load the guest payload and
measure/check it. IOW, we currently don't have expectations regarding
the initial state of guest memory, but it might be handy to have support
for pre-loading the payload in the future (should save a copy as you
said).

> [1] KVM will also exit to userspace with the same info on "implicit" 
> conversions,
> i.e. if the guest accesses the "wrong" GPA.  Neither SEV-SNP nor TDX 
> mandate
> explicit conversions in their guest<->host ABIs, so KVM has to support 
> implicit
> conversions :-/
> 
> [2] Ideally (IMO), KVM would require userspace to completely remove the 
> private memslot,
> but that's too slow due to use of SRCU in both KVM and userspace (QEMU at 
> least uses
> SRCU for memslot changes).



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Sean Christopherson
On Thu, Mar 24, 2022, Quentin Perret wrote:
> For Protected KVM (and I suspect most other confidential computing
> solutions), guests have the ability to share some of their pages back
> with the host kernel using a dedicated hypercall. This is necessary
> for e.g. virtio communications, so these shared pages need to be mapped
> back into the VMM's address space. I'm a bit confused about how that
> would work with the approach proposed here. What is going to be the
> approach for TDX?
> 
> It feels like the most 'natural' thing would be to have a KVM exit
> reason describing which pages have been shared back by the guest, and to
> then allow the VMM to mmap those specific pages in response in the
> memfd. Is this something that has been discussed or considered?

The proposed solution is to exit to userspace with a new exit reason, 
KVM_EXIT_MEMORY_ERROR,
when the guest makes the hypercall to request conversion[1].  The private fd 
itself
will never allow mapping memory into userspace, instead userspace will need to 
punch
a hole in the private fd backing store.  The absense of a valid mapping in the 
private
fd is how KVM detects that a pfn is "shared" (memslots without a private fd are 
always
shared)[2].

The key point is that KVM never decides to convert between shared and private, 
it's
always a userspace decision.  Like normal memslots, where userspace has full 
control
over what gfns are a valid, this gives userspace full control over whether a 
gfn is
shared or private at any given time.

Another important detail is that this approach means the kernel and KVM treat 
the
shared backing store and private backing store as independent, albeit related,
entities.  This is very deliberate as it makes it easier to reason about what is
and isn't allowed/required.  E.g. the kernel only needs to handle freeing 
private
memory, there is no special handling for conversion to shared because no such 
path
exists as far as host pfns are concerned.  And userspace doesn't need any new 
"rules"
for protecting itself against a malicious guest, e.g. userspace already needs to
ensure that it has a valid mapping prior to accessing guest memory (or be able 
to
handle any resulting signals).  A malicious guest can DoS itself by instructing
userspace to communicate over memory that is currently mapped private, but there
are no new novel attack vectors from the host's perspective as coercing the host
into accessing an invalid mapping after shared=>private conversion is just a 
variant
of a use-after-free.

One potential conversions that's TBD (at least, I think it is, I haven't read 
through
this most recent version) is how to support populating guest private memory with
non-zero data, e.g. to allow in-place conversion of the initial guest firmware 
instead
of having to an extra memcpy().

[1] KVM will also exit to userspace with the same info on "implicit" 
conversions,
i.e. if the guest accesses the "wrong" GPA.  Neither SEV-SNP nor TDX mandate
explicit conversions in their guest<->host ABIs, so KVM has to support 
implicit
conversions :-/

[2] Ideally (IMO), KVM would require userspace to completely remove the private 
memslot,
but that's too slow due to use of SRCU in both KVM and userspace (QEMU at 
least uses
SRCU for memslot changes).



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-24 Thread Quentin Perret
Hi Chao,

+CC Will and Marc for visibility.

On Thursday 10 Mar 2022 at 22:08:58 (+0800), Chao Peng wrote:
> This is the v5 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
>  
> Introduction
> 
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace fallocate/punch hole
> on the fd to notify KVM to map/unmap secondary MMU page tables.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 
> 
> mm extension
> -
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created
> with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
> 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
> KVM when memory gets allocated/invalidated.
>   - memfile_pfn_ops: callbacks for KVM to call into memory backing store
> to request memory pages for guest private memory.
> 
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets fallocated/invalidated.
> 
> memslot extension
> -
> Add the private fd and the fd offset to existing 'shared' memslot so that
> both private/shared guest memory can live in one single memslot. A page in
> the memslot is either private or shared. A page is private only when it's
> already allocated in the backing store fd, all the other cases it's treated
> as shared, this includes those already mapped as shared as well as those
> having not been mapped. This means the memory backing store is the place
> which tells the truth of which page is private.
> 
> Private memory map/unmap and conversion
> ---
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
>   - map: default fallocate() with mode=0.
>   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.

I recently came across this series which is interesting for the
Protected KVM work that's currently ongoing in the Android world (see
[1], [2] or [3] for more details). The idea is similar in a number of
ways to the Intel TDX stuff (from what I understand, but I'm clearly not
understanding it all so, ...) or the Arm CCA solution, but using stage-2
MMUs instead of encryption; and leverages the caveat of the nVHE
KVM/arm64 implementation to isolate the control of stage-2 MMUs from the
host.

For Protected KVM (and I suspect most other confidential computing
solutions), guests have the ability to share some of their pages back
with the host kernel using a dedicated hypercall. This is necessary
for e.g. virtio communications, so these shared pages need to be mapped
back into the VMM's address space. I'm a bit confused about how that
would work with the approach proposed here. What is going to be the
approach for TDX?

It feels like the most 'natural' thing would be to have a KVM exit
reason describing which pages have been shared back by the guest, and to
then allow the VMM to mmap those specific pages in response in the
memfd. Is this something that has been discussed or considered?

Thanks,
Quentin

[1] https://lwn.net/Articles/836693/
[2] https://www.youtube.com/watch?v=wY-u6n75iXc
[3] https://www.youtube.com/watch?v=54q6RzS9BpQ=10862s