On Fri, Mar 04, 2022 at 11:24:30AM -0800, Andy Lutomirski wrote:
> On 2/23/22 04:05, Steven Price wrote:
> > On 23/02/2022 11:49, Chao Peng wrote:
> > > On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
> > > > On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> > > > > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
> > > > > > On 1/18/22 05:21, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
> > > > > > > 
> > > > > > > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> > > > > > > the file is inaccessible from userspace through ordinary MMU 
> > > > > > > access
> > > > > > > (e.g., read/write/mmap). However, the file content can be accessed
> > > > > > > via a different mechanism (e.g. KVM MMU) indirectly.
> > > > > > > 
> > > > > > > It provides semantics required for KVM guest private memory 
> > > > > > > support
> > > > > > > that a file descriptor with this seal set is going to be used as 
> > > > > > > the
> > > > > > > source of guest memory in confidential computing environments such
> > > > > > > as Intel TDX/AMD SEV but may not be accessible from host 
> > > > > > > userspace.
> > > > > > > 
> > > > > > > At this time only shmem implements this seal.
> > > > > > > 
> > > > > > 
> > > > > > I don't dislike this *that* much, but I do dislike this. 
> > > > > > F_SEAL_INACCESSIBLE
> > > > > > essentially transmutes a memfd into a different type of object.  
> > > > > > While this
> > > > > > can apparently be done successfully and without races (as in this 
> > > > > > code),
> > > > > > it's at least awkward.  I think that either creating a special 
> > > > > > inaccessible
> > > > > > memfd should be a single operation that create the correct type of 
> > > > > > object or
> > > > > > there should be a clear justification for why it's a two-step 
> > > > > > process.
> > > > > 
> > > > > Now one justification maybe from Stever's comment to patch-00: for ARM
> > > > > usage it can be used with creating a normal memfd, (partially)populate
> > > > > it with initial guest memory content (e.g. firmware), and then
> > > > > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest 
> > > > > in
> > > > > KVM (definitely the current code needs to be changed to support that).
> > > > 
> > > > Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  
> > > > So this won't work.
> > > 
> > > Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
> > > need to make sure access to existing mmap-ed area should be prevented,
> > > but that is hard.
> > > 
> > > > 
> > > > In any case, the whole confidential VM initialization story is a bit 
> > > > buddy.  From the earlier emails, it sounds like ARM expects the host to 
> > > > fill in guest memory and measure it.  From my recollection of Intel's 
> > > > scheme (which may well be wrong, and I could easily be confusing it 
> > > > with SGX), TDX instead measures what is essentially a transcript of the 
> > > > series of operations that initializes the VM.  These are fundamentally 
> > > > not the same thing even if they accomplish the same end goal.  For TDX, 
> > > > we unavoidably need an operation (ioctl or similar) that initializes 
> > > > things according to the VM's instructions, and ARM ought to be able to 
> > > > use roughly the same mechanism.
> > > 
> > > Yes, TDX requires a ioctl. Steven may comment on the ARM part.
> > 
> > The Arm story is evolving so I can't give a definite answer yet. Our
> > current prototyping works by creating the initial VM content in a
> > memslot as with a normal VM and then calling an ioctl which throws the
> > big switch and converts all the (populated) pages to be protected. At
> > this point the RMM performs a measurement of the data that the VM is
> > being populated with.
> > 
> > The above (in our prototype) suffers from all the expected problems with
> > a malicious VMM being able to trick the host kernel into accessing those
> > pages after they have been protected (causing a fault detected by the
> > hardware).
> > 
> > The ideal (from our perspective) approach would be to follow the same
> > flow but where the VMM populates a memfd rather than normal anonymous
> > pages. The memfd could then be sealed and the pages converted to
> > protected ones (with the RMM measuring them in the process).
> > 
> > The question becomes how is that memfd populated? It would be nice if
> > that could be done using normal operations on a memfd (i.e. using
> > mmap()) and therefore this code could be (relatively) portable. This
> > would mean that any pages mapped from the memfd would either need to
> > block the sealing or be revoked at the time of sealing.
> > 
> > The other approach is we could of course implement a special ioctl which
> > effectively does a memcpy into the (created empty and sealed) memfd and
> > does the necessary dance with the RMM to measure the contents. This
> > would match the "transcript of the series of operations" described above
> > - but seems much less ideal from the viewpoint of the VMM.
> 
> A VMM that supports Other Vendors will need to understand this sort of model
> regardless.
> 
> I don't particularly mind the idea of having the kernel consume a normal
> memfd and spit out a new object, but I find the concept of changing the type
> of the object in place, even if it has other references, and trying to
> control all the resulting races to be somewhat alarming.
> 
> In pseudo-Rust, this is the difference between:
> 
> fn convert_to_private(in: &mut Memfd)
> 
> and
> 
> fn convert_to_private(in: Memfd) -> PrivateMemoryFd
> 
> This doesn't map particularly nicely to the kernel, though.

I understand this Rust semantics and the difficulty to handle races.
Probably we should not expose F_SEAL_INACCESSIBLE to userspace, instead
we can use a new in-kernel flag to indicate the same thing. That flag
should be set only when the memfd is created with MFD_INACCESSIBLE.

Chao
> 
> --Andy\

Reply via email to