Re: [PATCH v7 03/14] mm: Introduce memfile_notifier

2022-08-11 Thread Quentin Perret
+CC Fuad

On Wednesday 10 Aug 2022 at 14:38:43 (+), Sean Christopherson wrote:
> > I understand Sean's suggestion about abstracting, but if the new name
> > makes it harder to grasp and there isn't really an alternative to memfd
> > in sight, I'm not so sure I enjoy the tried abstraction here.
> 
> ARM's pKVM implementation is potentially (hopefully) going to switch to this 
> API
> (as a consumer) sooner than later.  If they anticipate being able to use 
> memfd,
> then there's unlikely to be a second backing type any time soon.
> 
> Quentin, Will?

Yep, Fuad is currently trying to port the pKVM mm stuff on top of this
series to see how well it fits, so stay tuned. I think there is still
some room for discussion around page conversions (private->shared etc),
and we'll need a clearer idea of what the code might look like to have a
constructive discussion, but so far it does seem like using a memfd (the
new private one or perhaps just memfd_secret, to be discussed) + memfd
notifiers is a promising option.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-05-03 Thread Quentin Perret
On Thursday 28 Apr 2022 at 20:29:52 (+0800), Chao Peng wrote:
> 
> + Michael in case he has comment from SEV side.
> 
> On Mon, Apr 25, 2022 at 07:52:38AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > >> 
> > 
> > >> 
> > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's 
> > >> in the initial state appropriate for that VM.
> > >> 
> > >> For TDX, this completely bypasses the cases where the data is 
> > >> prepopulated and TDX can't handle it cleanly.  For SEV, it bypasses a 
> > >> situation in which data might be written to the memory before we find 
> > >> out whether that data will be unreclaimable or unmovable.
> > >
> > > This sounds a more strict rule to avoid semantics unclear.
> > >
> > > So userspace needs to know what excatly happens for a 'bind' operation.
> > > This is different when binds to different technologies. E.g. for SEV, it
> > > may imply after this call, the memfile can be accessed (through mmap or
> > > what ever) from userspace, while for current TDX this should be not 
> > > allowed.
> > 
> > I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve 
> > similar goals and have broadly similar ways of achieving them, they really 
> > are different, and having userspace be aware of the differences seems okay 
> > to me.
> > 
> > (Although I don't think that allowing userspace to mmap SEV shared pages is 
> > particularly wise -- it will result in faults or cache incoherence 
> > depending on the variant of SEV in use.)
> > 
> > >
> > > And I feel we still need a third flow/operation to indicate the
> > > completion of the initialization on the memfile before the guest's 
> > > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> > > and prevent future userspace access. After this point, then the memfile
> > > becomes truely private fd.
> > 
> > Even that is technology-dependent.  For TDX, this operation doesn't really 
> > exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough 
> > detail).  For pKVM, I guess it does exist and isn't quite the same as a 
> > shared->private conversion.
> > 
> > Maybe this could be generalized a bit as an operation "measure and make 
> > private" that would be supported by the technologies for which it's useful.
> 
> Then I think we need callback instead of static flag field. Backing
> store implements this callback and consumers change the flags
> dynamically with this callback. This implements kind of state machine
> flow.
> 
> > 
> > 
> > >
> > >> 
> > >> 
> > >> --
> > >> 
> > >> Now I have a question, since I don't think anyone has really answered 
> > >> it: how does this all work with SEV- or pKVM-like technologies in which 
> > >> private and shared pages share the same address space?  I sounds like 
> > >> you're proposing to have a big memfile that contains private and shared 
> > >> pages and to use that same memfile as pages are converted back and 
> > >> forth.  IO and even real physical DMA could be done on that memfile.  Am 
> > >> I understanding correctly?
> > >
> > > For TDX case, and probably SEV as well, this memfile contains private 
> > > memory
> > > only. But this design at least makes it possible for usage cases like
> > > pKVM which wants both private/shared memory in the same memfile and rely
> > > on other ways like mmap/munmap or mprotect to toggle private/shared 
> > > instead
> > > of fallocate/hole punching.
> > 
> > Hmm.  Then we still need some way to get KVM to generate the correct SEV 
> > pagetables.  For TDX, there are private memslots and shared memslots, and 
> > they can overlap.  If they overlap and both contain valid pages at the same 
> > address, then the results may not be what the guest-side ABI expects, but 
> > everything will work.  So, when a single logical guest page transitions 
> > between shared and private, no change to the memslots is needed.  For SEV, 
> > this is not the case: everything is in one set of pagetables, and there 
> > isn't a natural way to resolve overlaps.
> 
> I don't see SEV has problem. Note for all the cases, both private/shared
> memory are in the same memslot. For a given GPA, if there is no private
> page, then shared page will be used to establish KVM pagetables, so this
> can guarantee there is no overlaps.
> 
> > 
> > If the memslot code becomes efficient enough, then the memslots could be 
> > fragmented.  Or the memfile could support private and shared data in the 
> > same memslot.  And if pKVM does this, I don't see why SEV couldn't also do 
> > it and hopefully reuse the same code.
> 
> For pKVM, that might be the case. For SEV, I don't think we require
> private/shared data in the same memfile. The same model that works for
> TDX should also work for SEV. Or maybe I misunderstood something here?
> 
> > 
> > >
> > >

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-06 Thread Quentin Perret
On Tuesday 05 Apr 2022 at 10:51:36 (-0700), Andy Lutomirski wrote:
> Let's try actually counting syscalls and mode transitions, at least 
> approximately.  For non-direct IO (DMA allocation on guest side, not straight 
> to/from pagecache or similar):
> 
> Guest writes to shared DMA buffer.  Assume the guest is smart and reuses the 
> buffer.
> Guest writes descriptor to shared virtio ring.
> Guest rings virtio doorbell, which causes an exit.
> *** guest -> hypervisor -> host ***
> host reads virtio ring (mmaped shared memory)
> host does pread() to read the DMA buffer or reads mmapped buffer
> host does the IO
> resume guest
> *** host -> hypervisor -> guest ***
> 
> This is essentially optimal in terms of transitions.  The data is copied on 
> the guest side (which may well be mandatory depending on what guest userspace 
> did to initiate the IO) and on the host (which may well be mandatory 
> depending on what the host is doing with the data).
> 
> Now let's try straight-from-guest-pagecache or otherwise zero-copy on the 
> guest side.  Without nondestructive changes, the guest needs a bounce buffer 
> and it looks just like the above.  One extra copy, zero extra mode 
> transitions.  With nondestructive changes, it's a bit more like physical 
> hardware with an IOMMU:
> 
> Guest shares the page.
> *** guest -> hypervisor ***
> Hypervisor adds a PTE.  Let's assume we're being very optimal and the host is 
> not synchronously notified.
> *** hypervisor -> guest ***
> Guest writes descriptor to shared virtio ring.
> Guest rings virtio doorbell, which causes an exit.
> *** guest -> hypervisor -> host ***
> host reads virtio ring (mmaped shared memory)
> 
> mmap  *** syscall ***
> host does the IO
> munmap *** syscall, TLBI ***
> 
> resume guest
> *** host -> hypervisor -> guest ***
> Guest unshares the page.
> *** guest -> hypervisor ***
> Hypervisor removes PTE.  TLBI.
> *** hypervisor -> guest ***
> 
> This is quite expensive.  For small IO, pread() or splice() in the host may 
> be a lot faster.  Even for large IO, splice() may still win.

Right, that would work nicely for pages that are shared transiently, but
less so for long-term shares. But I guess your proposal below should do
the trick.

> I can imagine clever improvements.  First, let's get rid of mmap() + 
> munmap().  Instead use a special device mapping with special semantics, not 
> regular memory.  (mmap and munmap are expensive even ignoring any arch and 
> TLB stuff.)  The rule is that, if the page is shared, access works, and if 
> private, access doesn't, but it's still mapped.  The hypervisor and the host 
> cooperate to make it so.

As long as the page can't be GUP'd I _think_ this shouldn't be a
problem. We can have the hypervisor re-inject the fault in the host. And
the host fault handler will deal with it just fine if the fault was
taken from userspace (inject a SEGV), or from the kernel through uaccess
macros. But we do get into issues if the host kernel can be tricked into
accessing the page via e.g. kmap(). I've been able to trigger this by
strace-ing a userspace process which passes a pointer to private memory
to a syscall. strace will inspect the syscall argument using
process_vm_readv(), which will pin_user_pages_remote() and access the
page via kmap(), and then we're in trouble. But preventing GUP would
prevent this by construction I think?

FWIW memfd_secret() did look like a good solution to this, but it lacks
the bidirectional notifiers with KVM that is offered by this patch
series, which is needed to allow KVM to handle guest faults, and also
offers a good framework to support future extensions (e.g.
hypervisor-assisted page migration, swap, ...). So yes, ideally
pKVM would use a kind of hybrid between memfd_secret and the private fd
proposed here, or something else providing similar properties.

Thanks,
Quentin



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-06 Thread Quentin Perret
On Tuesday 05 Apr 2022 at 18:03:21 (+), Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Quentin Perret wrote:
> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> > > >>  - it can be very useful for protected VMs to do shared=>private
> > > >>conversions. Think of a VM receiving some data from the host in a
> > > >>shared buffer, and then it wants to operate on that buffer without
> > > >>risking to leak confidential informations in a transient state. In
> > > >>that case the most logical thing to do is to convert the buffer back
> > > >>to private, do whatever needs to be done on that buffer (decrypting 
> > > >> a
> > > >>frame, ...), and then share it back with the host to consume it;
> > > >
> > > > If performance is a motivation, why would the guest want to do two
> > > > conversions instead of just doing internal memcpy() to/from a private
> > > > page?  I would be quite surprised if multiple exits and TLB shootdowns 
> > > > is
> > > > actually faster, especially at any kind of scale where zapping stage-2
> > > > PTEs will cause lock contention and IPIs.
> > > 
> > > I don't know the numbers or all the details, but this is arm64, which is a
> > > rather better architecture than x86 in this regard.  So maybe it's not so
> > > bad, at least in very simple cases, ignoring all implementation details.
> > > (But see below.)  Also the systems in question tend to have fewer CPUs 
> > > than
> > > some of the massive x86 systems out there.
> > 
> > Yep. I can try and do some measurements if that's really necessary, but
> > I'm really convinced the cost of the TLBI for the shared->private
> > conversion is going to be significantly smaller than the cost of memcpy
> > the buffer twice in the guest for us.
> 
> It's not just the TLB shootdown, the VM-Exits aren't free.

Ack, but we can at least work on the rest (number of exits, locking, ...).
The cost of the memcpy and the TLBI are really incompressible.

> And barring non-trivial
> improvements to KVM's MMU, e.g. sharding of mmu_lock, modifying the page 
> tables will
> block all other updates and MMU operations.  Taking mmu_lock for read, should 
> arm64
> ever convert to a rwlock, is not an option because KVM needs to block other
> conversions to avoid races.

FWIW the host mmu_lock isn't all that useful for pKVM. The host doesn't
have _any_ control over guest page-tables, and the hypervisor can't
safely rely on the host for locking, so we have hypervisor-level
synchronization.

> Hmm, though batching multiple pages into a single request would mitigate most 
> of
> the overhead.

Yep, there are a few tricks we can play to make this fairly efficient in
the most common cases. And fine-grain locking at EL2 is really high up
on the todo list :-)

Thanks,
Quentin



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-05 Thread Quentin Perret
On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> 
> 
> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
> > On Mon, Apr 04, 2022, Quentin Perret wrote:
> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> >> FWIW, there are a couple of reasons why I'd like to have in-place
> >> conversions:
> >> 
> >>  - one goal of pKVM is to migrate some things away from the Arm
> >>Trustzone environment (e.g. DRM and the likes) and into protected VMs
> >>instead. This will give Linux a fighting chance to defend itself
> >>against these things -- they currently have access to _all_ memory.
> >>And transitioning pages between Linux and Trustzone (donations and
> >>shares) is fast and non-destructive, so we really do not want pKVM to
> >>regress by requiring the hypervisor to memcpy things;
> >
> > Is there actually a _need_ for the conversion to be non-destructive?  
> > E.g. I assume
> > the "trusted" side of things will need to be reworked to run as a pKVM 
> > guest, at
> > which point reworking its logic to understand that conversions are 
> > destructive and
> > slow-ish doesn't seem too onerous.
> >
> >>  - it can be very useful for protected VMs to do shared=>private
> >>conversions. Think of a VM receiving some data from the host in a
> >>shared buffer, and then it wants to operate on that buffer without
> >>risking to leak confidential informations in a transient state. In
> >>that case the most logical thing to do is to convert the buffer back
> >>to private, do whatever needs to be done on that buffer (decrypting a
> >>frame, ...), and then share it back with the host to consume it;
> >
> > If performance is a motivation, why would the guest want to do two 
> > conversions
> > instead of just doing internal memcpy() to/from a private page?  I 
> > would be quite
> > surprised if multiple exits and TLB shootdowns is actually faster, 
> > especially at
> > any kind of scale where zapping stage-2 PTEs will cause lock contention 
> > and IPIs.
> 
> I don't know the numbers or all the details, but this is arm64, which is a 
> rather better architecture than x86 in this regard.  So maybe it's not so 
> bad, at least in very simple cases, ignoring all implementation details.  
> (But see below.)  Also the systems in question tend to have fewer CPUs than 
> some of the massive x86 systems out there.

Yep. I can try and do some measurements if that's really necessary, but
I'm really convinced the cost of the TLBI for the shared->private
conversion is going to be significantly smaller than the cost of memcpy
the buffer twice in the guest for us. To be fair, although the cost for
the CPU update is going to be low, the cost for IOMMU updates _might_ be
higher, but that very much depends on the hardware. On systems that use
e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and
the iotlb invalidation is done on the back of the CPU invalidation. So,
on systems with sane hardware the overhead is *really* quite small.

Also, memcpy requires double the memory, it is pretty bad for power, and
it causes memory traffic which can't be a good thing for things running
concurrently.

> If we actually wanted to support transitioning the same page between shared 
> and private, though, we have a bit of an awkward situation.  Private to 
> shared is conceptually easy -- do some bookkeeping, reconstitute the direct 
> map entry, and it's done.  The other direction is a mess: all existing uses 
> of the page need to be torn down.  If the page has been recently used for 
> DMA, this includes IOMMU entries.
>
> Quentin: let's ignore any API issues for now.  Do you have a concept of how a 
> nondestructive shared -> private transition could work well, even in 
> principle?

I had a high level idea for the workflow, but I haven't looked into the
implementation details.

The idea would be to allow KVM *or* userspace to take a reference
to a page in the fd in an exclusive manner. KVM could take a reference
on a page (which would be necessary before to donating it to a guest)
using some kind of memfile_notifier as proposed in this series, and
userspace could do the same some other way (mmap presumably?). In both
cases, the operation might fail.

I would imagine the boot and private->shared flow as follow:

 - the VMM uses fallocate on the private fd, and associates the  with a memslot;

 - the guest boots, and as part of that KVM takes references to all the
   pages that are donated to the guest. If userspace happens to have a
  

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-04 Thread Quentin Perret
On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
> On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote:
> > On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:
> 
> 
> > To answer your original question about memory 'conversion', the key
> > thing is that the pKVM hypervisor controls the stage-2 page-tables for
> > everyone in the system, all guests as well as the host. As such, a page
> > 'conversion' is nothing more than a permission change in the relevant
> > page-tables.
> >
> 
> So I can see two different ways to approach this.
> 
> One is that you split the whole address space in half and, just like SEV and 
> TDX, allocate one bit to indicate the shared/private status of a page.  This 
> makes it work a lot like SEV and TDX.
>
> The other is to have shared and private pages be distinguished only by their 
> hypercall history and the (protected) page tables.  This saves some address 
> space and some page table allocations, but it opens some cans of worms too.  
> In particular, the guest and the hypervisor need to coordinate, in a way that 
> the guest can trust, to ensure that the guest's idea of which pages are 
> private match the host's.  This model seems a bit harder to support nicely 
> with the private memory fd model, but not necessarily impossible.

Right. Perhaps one thing I should clarify as well: pKVM (as opposed to
TDX) has only _one_ page-table per guest, and it is controllex by the
hypervisor only. So the hypervisor needs to be involved for both shared
and private mappings. As such, shared pages have relatively similar
constraints when it comes to host mm stuff --  we can't migrate shared
pages or swap them out without getting the hypervisor involved.

> Also, what are you trying to accomplish by having the host userspace mmap 
> private pages?

What I would really like to have is non-destructive in-place conversions
of pages. mmap-ing the pages that have been shared back felt like a good
fit for the private=>shared conversion, but in fact I'm not all that
opinionated about the API as long as the behaviour and the performance
are there. Happy to look into alternatives.

FWIW, there are a couple of reasons why I'd like to have in-place
conversions:

 - one goal of pKVM is to migrate some things away from the Arm
   Trustzone environment (e.g. DRM and the likes) and into protected VMs
   instead. This will give Linux a fighting chance to defend itself
   against these things -- they currently have access to _all_ memory.
   And transitioning pages between Linux and Trustzone (donations and
   shares) is fast and non-destructive, so we really do not want pKVM to
   regress by requiring the hypervisor to memcpy things;

 - it can be very useful for protected VMs to do shared=>private
   conversions. Think of a VM receiving some data from the host in a
   shared buffer, and then it wants to operate on that buffer without
   risking to leak confidential informations in a transient state. In
   that case the most logical thing to do is to convert the buffer back
   to private, do whatever needs to be done on that buffer (decrypting a
   frame, ...), and then share it back with the host to consume it;

 - similar to the previous point, a protected VM might want to
   temporarily turn a buffer private to avoid ToCToU issues;

 - once we're able to do device assignment to protected VMs, this might
   allow DMA-ing to a private buffer, and make it shared later w/o
   bouncing.

And there is probably more.

IIUC, the private fd proposal as it stands requires shared and private
pages to come from entirely distinct places. So it's not entirely clear
to me how any of the above could be supported without having the
hypervisor memcpy the data during conversions, which I really don't want
to do for performance reasons.

> Is the idea that multiple guest could share the same page until such time as 
> one of them tries to write to it?

That would certainly be possible to implement in the pKVM
environment with the right tracking, so I think it is worth considering
as a future goal.

Thanks,
Quentin



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Quentin Perret
On Friday 01 Apr 2022 at 17:14:21 (+), Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Quentin Perret wrote:
> > The typical flow is as follows:
> > 
> >  - the host asks the hypervisor to run a guest;
> > 
> >  - the hypervisor does the context switch, which includes switching
> >stage-2 page-tables;
> > 
> >  - initially the guest has an empty stage-2 (we don't require
> >pre-faulting everything), which means it'll immediately fault;
> > 
> >  - the hypervisor switches back to host context to handle the guest
> >fault;
> > 
> >  - the host handler finds the corresponding memslot and does the
> >ipa->hva conversion. In our current implementation it uses a longterm
> >GUP pin on the corresponding page;
> > 
> >  - once it has a page, the host handler issues a hypercall to donate the
> >page to the guest;
> > 
> >  - the hypervisor does a bunch of checks to make sure the host owns the
> >page, and if all is fine it will unmap it from the host stage-2 and
> >map it in the guest stage-2, and do some bookkeeping as it needs to
> >track page ownership, etc;
> > 
> >  - the guest can then proceed to run, and possibly faults in many more
> >pages;
> > 
> >  - when it wants to, the guest can then issue a hypercall to share a
> >page back with the host;
> > 
> >  - the hypervisor checks the request, maps the page back in the host
> >stage-2, does more bookkeeping and returns back to the host to notify
> >it of the share;
> > 
> >  - the host kernel at that point can exit back to userspace to relay
> >that information to the VMM;
> > 
> >  - rinse and repeat.
> 
> I assume there is a scenario where a page can be converted from 
> shared=>private?
> If so, is there a use case where that happens post-boot _and_ the contents of 
> the
> page are preserved?

I think most our use-cases are private=>shared, but how is that
different?

> > We currently don't allow the host punching holes in the guest IPA space.
> 
> The hole doesn't get punched in guest IPA space, it gets punched in the 
> private
> backing store, which is host PA space.

Hmm, in a previous message I thought that you mentioned when a whole
gets punched in the fd KVM will go and unmap the page in the private
SPTEs, which will cause a fatal error for any subsequent access from the
guest to the corresponding IPA?

If that's correct, I meant that we currently don't support that - the
host can't unmap anything from the guest stage-2, it can only tear it
down entirely. But again, I'm not too worried about that, we could
certainly implement that part without too many issues.

> > Once it has donated a page to a guest, it can't have it back until the
> > guest has been entirely torn down (at which point all of memory is
> > poisoned by the hypervisor obviously).
> 
> The guest doesn't have to know that it was handed back a different page.  It 
> will
> require defining the semantics to state that the trusted hypervisor will clear
> that page on conversion, but IMO the trusted hypervisor should be doing that
> anyways.  IMO, forcing on the guest to correctly zero pages on conversion is
> unnecessarily risky because converting private=>shared and preserving the 
> contents
> should be a very, very rare scenario, i.e. it's just one more thing for the 
> guest
> to get wrong.

I'm not sure I agree. The guest is going to communicate with an
untrusted entity via that shared page, so it better be careful. Guest
hardening in general is a major topic, and of all problems, zeroing the
page before sharing is probably one of the simplest to solve.

Also, note that in pKVM all the hypervisor code at EL2 runs with
preemption disabled, which is a strict constraint. As such one of the
main goals is the spend as little time as possible in that context.
We're trying hard to keep the amount of zeroing/memcpy-ing to an
absolute minimum. And that's especially true as we introduce support for
huge pages. So, we'll take every opportunity we get to have the guest
or the host do that work.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Quentin Perret
On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:
> On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote:
> > On Wed, Mar 30, 2022, Quentin Perret wrote:
> >> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
> >> > On 29/03/2022 18:01, Quentin Perret wrote:
> >> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
> >> > > the shared gpa range at an address that doesn't have a backing memslot,
> >> > > will KVM check whether there is a corresponding private memslot at the
> >> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
> >> > > would that just generate an MMIO exit as usual?
> >> > 
> >> > My understanding is that the guest needs some way of tagging whether a
> >> > page is expected to be shared or private. On the architectures I'm aware
> >> > of this is done by effectively stealing a bit from the IPA space and
> >> > pretending it's a flag bit.
> >> 
> >> Right, and that is in fact the main point of divergence we have I think.
> >> While I understand this might be necessary for TDX and the likes, this
> >> makes little sense for pKVM. This would effectively embed into the IPA a
> >> purely software-defined non-architectural property/protocol although we
> >> don't actually need to: we (pKVM) can reasonably expect the guest to
> >> explicitly issue hypercalls to share pages in-place. So I'd be really
> >> keen to avoid baking in assumptions about that model too deep in the
> >> host mm bits if at all possible.
> >
> > There is no assumption about stealing PA bits baked into this API.  Even 
> > within
> > x86 KVM, I consider it a hard requirement that the common flows not assume 
> > the
> > private vs. shared information is communicated through the PA.
> 
> Quentin, I think we might need a clarification.  The API in this patchset 
> indeed has no requirement that a PA bit distinguish between private and 
> shared, but I think it makes at least a weak assumption that *something*, a 
> priori, distinguishes them.  In particular, there are private memslots and 
> shared memslots, so the logical flow of resolving a guest memory access looks 
> like:
> 
> 1. guest accesses a GVA
> 
> 2. read guest paging structures
> 
> 3. determine whether this is a shared or private access
> 
> 4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures 
> accordingly.  In particular, the memslot to reference is different depending 
> on the access type.
> 
> For TDX, this maps on to the fd-based model perfectly: the host-side paging 
> structures for the shared and private slots are completely separate.  For 
> SEV, the structures are shared and KVM will need to figure out what to do in 
> case a private and shared memslot overlap.  Presumably it's sufficient to 
> declare that one of them wins, although actually determining which one is 
> active for a given GPA may involve checking whether the backing store for a 
> given page actually exists.
> 
> But I don't understand pKVM well enough to understand how it fits in.  
> Quentin, how is the shared vs private mode of a memory access determined?  
> How do the paging structures work?  Can a guest switch between shared and 
> private by issuing a hypercall without changing any guest-side paging 
> structures or anything else?

My apologies, I've indeed shared very little details about how pKVM
works. We'll be posting patches upstream really soon that will hopefully
help with this, but in the meantime, here is the idea.

pKVM is designed around MMU-based protection as opposed to encryption as
is the case for many confidential computing solutions. It's probably
worth mentioning that, although it targets arm64, pKVM is distinct from
the Arm CC-A stuff and requires no fancy hardware extensions -- it is
applicable all the way back to Arm v8.0 which makes it an interesting
solution for mobile.

Another particularity of the pKVM approach is that the code of the
hypervisor itself lives in the kernel source tree (see
arch/arm64/kvm/hyp/nvhe/). The hypervisor is built with the rest of the
kernel but as a self-sufficient object, and ends up in its own dedicated
ELF section (.hyp.*) in the kernel image. The main requirement for pKVM
(and KVM on arm64 in general) is to have the bootloader enter the kernel
at the hypervisor exception level (a.k.a EL2). The boot procedure is a
bit involved, but eventually the hypervisor object is installed at EL2,
and the kernel is deprivileged to EL1 and proceeds to boot. From that
point on the hypervisor no longer trust

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-30 Thread Quentin Perret
On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
> On 29/03/2022 18:01, Quentin Perret wrote:
> > On Monday 28 Mar 2022 at 18:58:35 (+), Sean Christopherson wrote:
> >> On Mon, Mar 28, 2022, Quentin Perret wrote:
> >>> Hi Sean,
> >>>
> >>> Thanks for the reply, this helps a lot.
> >>>
> >>> On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
> >>>> On Thu, Mar 24, 2022, Quentin Perret wrote:
> >>>>> For Protected KVM (and I suspect most other confidential computing
> >>>>> solutions), guests have the ability to share some of their pages back
> >>>>> with the host kernel using a dedicated hypercall. This is necessary
> >>>>> for e.g. virtio communications, so these shared pages need to be mapped
> >>>>> back into the VMM's address space. I'm a bit confused about how that
> >>>>> would work with the approach proposed here. What is going to be the
> >>>>> approach for TDX?
> >>>>>
> >>>>> It feels like the most 'natural' thing would be to have a KVM exit
> >>>>> reason describing which pages have been shared back by the guest, and to
> >>>>> then allow the VMM to mmap those specific pages in response in the
> >>>>> memfd. Is this something that has been discussed or considered?
> >>>>
> >>>> The proposed solution is to exit to userspace with a new exit reason, 
> >>>> KVM_EXIT_MEMORY_ERROR,
> >>>> when the guest makes the hypercall to request conversion[1].  The 
> >>>> private fd itself
> >>>> will never allow mapping memory into userspace, instead userspace will 
> >>>> need to punch
> >>>> a hole in the private fd backing store.  The absense of a valid mapping 
> >>>> in the private
> >>>> fd is how KVM detects that a pfn is "shared" (memslots without a private 
> >>>> fd are always
> >>>> shared)[2].
> >>>
> >>> Right. I'm still a bit confused about how the VMM is going to get the
> >>> shared page mapped in its page-table. Once it has punched a hole into
> >>> the private fd, how is it supposed to access the actual physical page
> >>> that the guest shared?
> >>
> >> The guest doesn't share a _host_ physical page, the guest shares a _guest_ 
> >> physical
> >> page.  Until host userspace converts the gfn to shared and thus maps the 
> >> gfn=>hva
> >> via mmap(), the guest is blocked and can't read/write/exec the memory.  
> >> AFAIK, no
> >> architecture allows in-place decryption of guest private memory.  s390 
> >> allows a
> >> page to be "made accessible" to the host for the purposes of swap, and 
> >> other
> >> architectures will have similar behavior for migrating a protected VM, but 
> >> those
> >> scenarios are not sharing the page (and they also make the page 
> >> inaccessible to
> >> the guest).
> > 
> > I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
> > doing in-place sharing, which also means it can retain the content of
> > the page as part of the conversion.
> > 
> > Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
> > I _believe_ it should be technically possible to do in-place sharing for
> > them too.
> 
> In general this isn't possible as the physical memory could be
> encrypted, so some temporary memory is required. We have prototyped
> having a single temporary page for the setup when populating the guest's
> initial memory - this has the nice property of not requiring any
> additional allocation during the process but with the downside of
> effectively two memcpy()s per page (one to the temporary page and
> another, with optional encryption, into the now private page).

Interesting, thanks for the explanation.

> >>> Is there an assumption somewhere that the VMM should have this page 
> >>> mapped in
> >>> via an alias that it can legally access only once it has punched a hole at
> >>> the corresponding offset in the private fd or something along those lines?
> >>
> >> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't 
> >> to
> >> wait until the conversion to mmap() the shared variant, though obviously 
> >> it will
> >> pot

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-29 Thread Quentin Perret
On Monday 28 Mar 2022 at 18:58:35 (+), Sean Christopherson wrote:
> On Mon, Mar 28, 2022, Quentin Perret wrote:
> > Hi Sean,
> > 
> > Thanks for the reply, this helps a lot.
> > 
> > On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
> > > On Thu, Mar 24, 2022, Quentin Perret wrote:
> > > > For Protected KVM (and I suspect most other confidential computing
> > > > solutions), guests have the ability to share some of their pages back
> > > > with the host kernel using a dedicated hypercall. This is necessary
> > > > for e.g. virtio communications, so these shared pages need to be mapped
> > > > back into the VMM's address space. I'm a bit confused about how that
> > > > would work with the approach proposed here. What is going to be the
> > > > approach for TDX?
> > > > 
> > > > It feels like the most 'natural' thing would be to have a KVM exit
> > > > reason describing which pages have been shared back by the guest, and to
> > > > then allow the VMM to mmap those specific pages in response in the
> > > > memfd. Is this something that has been discussed or considered?
> > > 
> > > The proposed solution is to exit to userspace with a new exit reason, 
> > > KVM_EXIT_MEMORY_ERROR,
> > > when the guest makes the hypercall to request conversion[1].  The private 
> > > fd itself
> > > will never allow mapping memory into userspace, instead userspace will 
> > > need to punch
> > > a hole in the private fd backing store.  The absense of a valid mapping 
> > > in the private
> > > fd is how KVM detects that a pfn is "shared" (memslots without a private 
> > > fd are always
> > > shared)[2].
> > 
> > Right. I'm still a bit confused about how the VMM is going to get the
> > shared page mapped in its page-table. Once it has punched a hole into
> > the private fd, how is it supposed to access the actual physical page
> > that the guest shared?
> 
> The guest doesn't share a _host_ physical page, the guest shares a _guest_ 
> physical
> page.  Until host userspace converts the gfn to shared and thus maps the 
> gfn=>hva
> via mmap(), the guest is blocked and can't read/write/exec the memory.  
> AFAIK, no
> architecture allows in-place decryption of guest private memory.  s390 allows 
> a
> page to be "made accessible" to the host for the purposes of swap, and other
> architectures will have similar behavior for migrating a protected VM, but 
> those
> scenarios are not sharing the page (and they also make the page inaccessible 
> to
> the guest).

I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
doing in-place sharing, which also means it can retain the content of
the page as part of the conversion.

Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
I _believe_ it should be technically possible to do in-place sharing for
them too.

> > Is there an assumption somewhere that the VMM should have this page mapped 
> > in
> > via an alias that it can legally access only once it has punched a hole at
> > the corresponding offset in the private fd or something along those lines?
> 
> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
> wait until the conversion to mmap() the shared variant, though obviously it 
> will
> potentially consume double the memory if the VMM actually populates both the
> private and shared backing stores.

Gotcha. This is what confused me I think -- in this approach private and
shared pages are in fact entirely different.

In which scenario could you end up with both the private and shared
pages live at the same time? Would this be something like follows?

 - userspace creates a private fd, fallocates into it, and associates
   the  tuple with a private memslot;

 - userspace then mmaps anonymous memory (for ex.), and associates it
   with a standard memslot, which happens to be positioned at exactly
   the right offset w.r.t to the private memslot (with this offset
   defined by the bit that is set for the private addresses in the gpa
   space);

 - the guest runs, and accesses both 'aliases' of the page without doing
   an explicit share hypercall.

Is there another option?

Is implicit sharing a thing? E.g., if a guest makes a memory access in
the shared gpa range at an address that doesn't have a backing memslot,
will KVM check whether there is a corresponding private memslot at the
right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
would that just generate an MMIO exit as usual?

>

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Quentin Perret
Hi Sean,

Thanks for the reply, this helps a lot.

On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
> On Thu, Mar 24, 2022, Quentin Perret wrote:
> > For Protected KVM (and I suspect most other confidential computing
> > solutions), guests have the ability to share some of their pages back
> > with the host kernel using a dedicated hypercall. This is necessary
> > for e.g. virtio communications, so these shared pages need to be mapped
> > back into the VMM's address space. I'm a bit confused about how that
> > would work with the approach proposed here. What is going to be the
> > approach for TDX?
> > 
> > It feels like the most 'natural' thing would be to have a KVM exit
> > reason describing which pages have been shared back by the guest, and to
> > then allow the VMM to mmap those specific pages in response in the
> > memfd. Is this something that has been discussed or considered?
> 
> The proposed solution is to exit to userspace with a new exit reason, 
> KVM_EXIT_MEMORY_ERROR,
> when the guest makes the hypercall to request conversion[1].  The private fd 
> itself
> will never allow mapping memory into userspace, instead userspace will need 
> to punch
> a hole in the private fd backing store.  The absense of a valid mapping in 
> the private
> fd is how KVM detects that a pfn is "shared" (memslots without a private fd 
> are always
> shared)[2].

Right. I'm still a bit confused about how the VMM is going to get the
shared page mapped in its page-table. Once it has punched a hole into
the private fd, how is it supposed to access the actual physical page
that the guest shared? Is there an assumption somewhere that the VMM
should have this page mapped in via an alias that it can legally access
only once it has punched a hole at the corresponding offset in the
private fd or something along those lines?

> The key point is that KVM never decides to convert between shared and 
> private, it's
> always a userspace decision.  Like normal memslots, where userspace has full 
> control
> over what gfns are a valid, this gives userspace full control over whether a 
> gfn is
> shared or private at any given time.

I'm understanding this as 'the VMM is allowed to punch holes in the
private fd whenever it wants'. Is this correct? What happens if it does
so for a page that a guest hasn't shared back?

> Another important detail is that this approach means the kernel and KVM treat 
> the
> shared backing store and private backing store as independent, albeit related,
> entities.  This is very deliberate as it makes it easier to reason about what 
> is
> and isn't allowed/required.  E.g. the kernel only needs to handle freeing 
> private
> memory, there is no special handling for conversion to shared because no such 
> path
> exists as far as host pfns are concerned.  And userspace doesn't need any new 
> "rules"
> for protecting itself against a malicious guest, e.g. userspace already needs 
> to
> ensure that it has a valid mapping prior to accessing guest memory (or be 
> able to
> handle any resulting signals).  A malicious guest can DoS itself by 
> instructing
> userspace to communicate over memory that is currently mapped private, but 
> there
> are no new novel attack vectors from the host's perspective as coercing the 
> host
> into accessing an invalid mapping after shared=>private conversion is just a 
> variant
> of a use-after-free.

Interesting. I was (maybe incorrectly) assuming that it would be
difficult to handle illegal host accesses w/ TDX. IOW, this would
essentially crash the host. Is this remotely correct or did I get that
wrong?

> One potential conversions that's TBD (at least, I think it is, I haven't read 
> through
> this most recent version) is how to support populating guest private memory 
> with
> non-zero data, e.g. to allow in-place conversion of the initial guest 
> firmware instead
> of having to an extra memcpy().

Right. FWIW, in the pKVM case we should be pretty immune to this I
think. The initial firmware is loaded in guest memory by the hypervisor
itself (the EL2 code in arm64 speak) as the first vCPU starts running.
And that firmware can then use e.g. virtio to load the guest payload and
measure/check it. IOW, we currently don't have expectations regarding
the initial state of guest memory, but it might be handy to have support
for pre-loading the payload in the future (should save a copy as you
said).

> [1] KVM will also exit to userspace with the same info on "implicit" 
> conversions,
> i.e. if the guest accesses the "wrong" GPA.  Neither SEV-SNP nor TDX 
> mandate
> explicit conversions in their guest<->host ABIs, so KVM has to support 
> implicit
> conversions :-/
> 
> [2] Ideally (IMO), KVM would require userspace to completely remove the 
> private memslot,
> but that's too slow due to use of SRCU in both KVM and userspace (QEMU at 
> least uses
> SRCU for memslot changes).



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-24 Thread Quentin Perret
Hi Chao,

+CC Will and Marc for visibility.

On Thursday 10 Mar 2022 at 22:08:58 (+0800), Chao Peng wrote:
> This is the v5 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
>  
> Introduction
> 
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace fallocate/punch hole
> on the fd to notify KVM to map/unmap secondary MMU page tables.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 
> 
> mm extension
> -
> Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created
> with these flags cannot read(), write() or mmap() etc via normal
> MMU operations. The file content can only be used with the newly
> introduced memfile_notifier extension.
> 
> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
> KVM when memory gets allocated/invalidated.
>   - memfile_pfn_ops: callbacks for KVM to call into memory backing store
> to request memory pages for guest private memory.
> 
> The memfile_notifier extension also provides APIs for memory backing
> store to register/unregister itself and to trigger the notifier when the
> bookmarked memory gets fallocated/invalidated.
> 
> memslot extension
> -
> Add the private fd and the fd offset to existing 'shared' memslot so that
> both private/shared guest memory can live in one single memslot. A page in
> the memslot is either private or shared. A page is private only when it's
> already allocated in the backing store fd, all the other cases it's treated
> as shared, this includes those already mapped as shared as well as those
> having not been mapped. This means the memory backing store is the place
> which tells the truth of which page is private.
> 
> Private memory map/unmap and conversion
> ---
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
>   - map: default fallocate() with mode=0.
>   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.

I recently came across this series which is interesting for the
Protected KVM work that's currently ongoing in the Android world (see
[1], [2] or [3] for more details). The idea is similar in a number of
ways to the Intel TDX stuff (from what I understand, but I'm clearly not
understanding it all so, ...) or the Arm CCA solution, but using stage-2
MMUs instead of encryption; and leverages the caveat of the nVHE
KVM/arm64 implementation to isolate the control of stage-2 MMUs from the
host.

For Protected KVM (and I suspect most other confidential computing
solutions), guests have the ability to share some of their pages back
with the host kernel using a dedicated hypercall. This is necessary
for e.g. virtio communications, so these shared pages need to be mapped
back into the VMM's address space. I'm a bit confused about how that
would work with the approach proposed here. What is going to be the
approach for TDX?

It feels like the most 'natural' thing would be to have a KVM exit
reason describing which pages have been shared back by the guest, and to
then allow the VMM to mmap those specific pages in response in the
memfd. Is this something that has been discussed or considered?

Thanks,
Quentin

[1] https://lwn.net/Articles/836693/
[2] https://www.youtube.com/watch?v=wY-u6n75iXc
[3] https://www.youtube.com/watch?v=54q6RzS9BpQ&t=10862s



Re: [PATCH] target/arm: Fix ISR_EL1 tracking when executing at EL2

2019-11-22 Thread Quentin Perret
On Friday 22 Nov 2019 at 13:58:33 (+), Marc Zyngier wrote:
> The ARMv8 ARM states when executing at EL2, EL3 or Secure EL1,
> ISR_EL1 shows the pending status of the physical IRQ, FIQ, or
> SError interrupts.
> 
> Unfortunately, QEMU's implementation only considers the HCR_EL2
> bits, and ignores the current exception level. This means a hypervisor
> trying to look at its own interrupt state actually sees the guest
> state, which is unexpected and breaks KVM as of Linux 5.3.
> 
> Instead, check for the running EL and return the physical bits
> if not running in a virtualized context.
> 
> Fixes: 636540e9c40b
> Reported-by: Quentin Perret 

And FWIW, Tested-by: Quentin Perret 

Thanks Marc :)
Quentin

> Signed-off-by: Marc Zyngier 
> ---
>  target/arm/helper.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/target/arm/helper.c b/target/arm/helper.c
> index a089fb5a69..027fffbff6 100644
> --- a/target/arm/helper.c
> +++ b/target/arm/helper.c
> @@ -1934,8 +1934,11 @@ static uint64_t isr_read(CPUARMState *env, const 
> ARMCPRegInfo *ri)
>  CPUState *cs = env_cpu(env);
>  uint64_t hcr_el2 = arm_hcr_el2_eff(env);
>  uint64_t ret = 0;
> +bool allow_virt = (arm_current_el(env) == 1 &&
> +   (!arm_is_secure_below_el3(env) ||
> +(env->cp15.scr_el3 & SCR_EEL2)));
>  
> -if (hcr_el2 & HCR_IMO) {
> +if (allow_virt && (hcr_el2 & HCR_IMO)) {
>  if (cs->interrupt_request & CPU_INTERRUPT_VIRQ) {
>  ret |= CPSR_I;
>  }
> @@ -1945,7 +1948,7 @@ static uint64_t isr_read(CPUARMState *env, const 
> ARMCPRegInfo *ri)
>  }
>  }
>  
> -if (hcr_el2 & HCR_FMO) {
> +if (allow_virt && (hcr_el2 & HCR_FMO)) {
>  if (cs->interrupt_request & CPU_INTERRUPT_VFIQ) {
>  ret |= CPSR_F;
>  }
> -- 
> 2.17.1
>