On Mon, Mar 28, 2022, Quentin Perret wrote:
> Hi Sean,
> 
> Thanks for the reply, this helps a lot.
> 
> On Monday 28 Mar 2022 at 17:13:10 (+0000), Sean Christopherson wrote:
> > On Thu, Mar 24, 2022, Quentin Perret wrote:
> > > For Protected KVM (and I suspect most other confidential computing
> > > solutions), guests have the ability to share some of their pages back
> > > with the host kernel using a dedicated hypercall. This is necessary
> > > for e.g. virtio communications, so these shared pages need to be mapped
> > > back into the VMM's address space. I'm a bit confused about how that
> > > would work with the approach proposed here. What is going to be the
> > > approach for TDX?
> > > 
> > > It feels like the most 'natural' thing would be to have a KVM exit
> > > reason describing which pages have been shared back by the guest, and to
> > > then allow the VMM to mmap those specific pages in response in the
> > > memfd. Is this something that has been discussed or considered?
> > 
> > The proposed solution is to exit to userspace with a new exit reason, 
> > KVM_EXIT_MEMORY_ERROR,
> > when the guest makes the hypercall to request conversion[1].  The private 
> > fd itself
> > will never allow mapping memory into userspace, instead userspace will need 
> > to punch
> > a hole in the private fd backing store.  The absense of a valid mapping in 
> > the private
> > fd is how KVM detects that a pfn is "shared" (memslots without a private fd 
> > are always
> > shared)[2].
> 
> Right. I'm still a bit confused about how the VMM is going to get the
> shared page mapped in its page-table. Once it has punched a hole into
> the private fd, how is it supposed to access the actual physical page
> that the guest shared?

The guest doesn't share a _host_ physical page, the guest shares a _guest_ 
physical
page.  Until host userspace converts the gfn to shared and thus maps the 
gfn=>hva
via mmap(), the guest is blocked and can't read/write/exec the memory.  AFAIK, 
no
architecture allows in-place decryption of guest private memory.  s390 allows a
page to be "made accessible" to the host for the purposes of swap, and other
architectures will have similar behavior for migrating a protected VM, but those
scenarios are not sharing the page (and they also make the page inaccessible to
the guest).

> Is there an assumption somewhere that the VMM should have this page mapped in
> via an alias that it can legally access only once it has punched a hole at
> the corresponding offset in the private fd or something along those lines?

Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
wait until the conversion to mmap() the shared variant, though obviously it will
potentially consume double the memory if the VMM actually populates both the
private and shared backing stores.

> > The key point is that KVM never decides to convert between shared and 
> > private, it's
> > always a userspace decision.  Like normal memslots, where userspace has 
> > full control
> > over what gfns are a valid, this gives userspace full control over whether 
> > a gfn is
> > shared or private at any given time.
> 
> I'm understanding this as 'the VMM is allowed to punch holes in the
> private fd whenever it wants'. Is this correct?

>From the kernel's perspective, yes, the VMM can punch holes at any time.  From 
>a
"do I want to DoS my guest" perspective, the VMM must honor its contract with 
the
guest and not spuriously unmap private memory.

> What happens if it does so for a page that a guest hasn't shared back?

When the hole is punched, KVM will unmap the corresponding private SPTEs.  If 
the
guest is still accessing the page as private, the next access will fault and KVM
will exit to userspace with KVM_EXIT_MEMORY_ERROR.  Of course the guest is 
probably
hosed if the hole punch was truly spurious, as at least hardware-based 
protected VMs
effectively destroy data when a private page is unmapped from the guest private 
SPTEs.

E.g. Linux guests for TDX and SNP will panic/terminate in such a scenario as 
they
will get a fault (injected by trusted hardware/firmware) saying that the guest 
is
trying to access an unaccepted/unvalidated page (TDX and SNP require the guest 
to
explicit accept all private pages that aren't part of the guest's initial 
pre-boot
image).

> > Another important detail is that this approach means the kernel and KVM 
> > treat the
> > shared backing store and private backing store as independent, albeit 
> > related,
> > entities.  This is very deliberate as it makes it easier to reason about 
> > what is
> > and isn't allowed/required.  E.g. the kernel only needs to handle freeing 
> > private
> > memory, there is no special handling for conversion to shared because no 
> > such path
> > exists as far as host pfns are concerned.  And userspace doesn't need any 
> > new "rules"
> > for protecting itself against a malicious guest, e.g. userspace already 
> > needs to
> > ensure that it has a valid mapping prior to accessing guest memory (or be 
> > able to
> > handle any resulting signals).  A malicious guest can DoS itself by 
> > instructing
> > userspace to communicate over memory that is currently mapped private, but 
> > there
> > are no new novel attack vectors from the host's perspective as coercing the 
> > host
> > into accessing an invalid mapping after shared=>private conversion is just 
> > a variant
> > of a use-after-free.
> 
> Interesting. I was (maybe incorrectly) assuming that it would be
> difficult to handle illegal host accesses w/ TDX. IOW, this would
> essentially crash the host. Is this remotely correct or did I get that
> wrong?

Handling illegal host kernel accesses for both TDX and SEV-SNP is extremely
difficult, bordering on impossible.  That's one of the biggest, if not _the_
biggest, motivations for the private fd approach.  On "conversion", the page 
that is
used to back the shared variant is a completely different, unrelated host 
physical
page.  Whether or not the private/shared backing page is freed is orthogonal to
what version is mapped into the guest.  E.g. if the guest converts a 4kb chunk 
of
a 2mb hugepage, the private backing store could keep the physical page on hole
punch (example only, I don't know if this is the actual proposed 
implementation).

The idea is that it'll be much, much more difficult for the host to perform an
illegal access if the actual private memory is not mapped anywhere (modulo the
kernel's direct map, which we may or may not leave intact).  The private backing
store just needs to ensure it properly sanitizing pages before freeing them.

Reply via email to