Michael Roth <[email protected]> writes:
>
> [...snip...]
>
> I made a super-long-winded reply to that thread, but to summarize:
>
> PRESERVE flag has different enumeration/behavior/enforcement for pre-launch
> vs. post-launch, and similar considerations might come into play for
> other flags, so to make it easier to enumerate what flags are available
> for pre-launch/post-launch, maybe we could have 2 capabilities instead
> of 1:
>
> KVM_CAP_MEMORY_ATTRIBUTES2_PRE_LAUNCH_FLAGS
> KVM_CAP_MEMORY_ATTRIBUTES2_FLAGS
>
> where SNP/TDX would only advertise PRESERVE for PRE_LAUNCH, and pKVM I
> guess would enumerate it for both (or maybe just POST_LAUNCH?)
>
> That lets us keep the flags definitions more straightforward but still
> allows userspace to easily enumerate what exactly should be available at
> pre vs. post launch time, and give us some flexibility to detail
> variations in behavior between the 2 phases without documenting
> edge-cases in terms of VM types.
>
Oops Michael I only read this after the meeting today.
Sean, today at guest_memfd biweekly we also discussed this topic. I
brought up this topic because IMO the interface is starting to get
a little awkward, I'm struggling to put the awkwardness into words.
Here are some awkward points:
For PRESERVE, even though it is defined (now) as that what the host
writes will be readable in the guest, it only works for both to-private
and to-shared conversions for KVM_X86_SW_PROTECTED_VMs and pKVM. That's
because guest_memfd doesn't actually invoke encryption during the
conversion. For TDX and SNP, the encryption can only be done before the
VM is finalized, through vendor-specific ioctls that go through
kvm_gmem_populate() to load memory into the guest.
For ZERO, it is defined in api.rst that ZERO is not supported for
to-private conversions, and the rationale there was that when ZEROing,
guest_memfd/KVM can zero, but it's really the contract between the guest
and the vendor trusted firmware whether the guest sees zeros later.
Another awkward point is that ZERO was meant to enable an optimization
for TDX since the firmware zeroes memory, but it actually only zeroes
memory when the page is unmapped from Secure EPTs. guest_memfd (for now)
doesn't track whether the page was unmapped from Secure EPTs as part of
the conversion, so guest_memfd can't assume it was mapped before the
conversion request. To uphold the ZERO contract with userspace,
guest_memfd applies zeroing for TDX anyway.
Summarizing from guest_memfd biweekly today:
David suggested enumerating the combinations, something like
`SHARED_ZERO` and friends (since to-private and ZERO is not supported)
and Michael then brought up the other axis of pre/post launch. IIRC
there might be another axis since pKVM would need to determine
dynamically if a to_shared conversion can be permitted for the range
being converted, based on whether the guest had requested a to_shared
conversion.
I think this might just result in too many flags, and could paint us
into a corner if more options get supported later.
I spent even more time thinking about this today. I get that we want a
consistent contract to userspace, can we scope the contract differently?
What if we scope as "what KVM guarantees the content will look like
after guest_memfd updates attributes"? This is a smaller contract, since
it doesn't promise anything about what the guest sees. Running this
through a few examples:
+ Pre-finalize, SNP, to-private, PRESERVE: guest_memfd guarantees that
after setting memory attributes, the contents of the pages will not
change. The contents are then ready for populate. What populate does
to the memory is another contract between SNP and the guest that is
out of scope of guest_memfd's contract.
+ Post-finalize, SNP, to-private, PRESERVE: guest_memfd guarantees that
after setting memory attributes, the contents of the pages will not
change. SNP's contract with the guest does not, though. After the page
gets faulted in, the guest sees scrambled data. This may be a
meaningless operation now, but it leaves the door open so perhaps we
could have an SNP-specific ioctl in future where step 1 is to set
memory attributes within guest_memfd to private and step 2 is to
encrypt in place.
+ pKVM, to-private, PRESERVE: guest_memfd guarantees that after setting
memory attributes, the contents of the pages don't change. Separately,
pKVM doesn't do encryption, so the pKVM guest reads the same contents
the host wrote. The distinction here from the current state is that
guest_memfd didn't guarantee that the pKVM guest will see the same
content the host wrote since that's a separate contract between the
pKVM guest and pKVM.
+ Post-finalize, TDX, to-shared, ZERO: guest_memfd guarantees that
contents of the pages will be zeroed in the process of updating
guest_memfd attributes. Host userspace reads zeros after faulting it
in, which is because guest_memfd did zero the pages after conversion
to shared. A future optimization is possible, where guest_memfd only
zeroes the pages that were unmapped from Secure EPTs, since (this
version of) TDX zeros memory when unmapping from Secure EPTs.
+ Post-finalize, TDX, to-shared, PRESERVE: -EOPNOTSUPP. guest_memfd is
unable to guarantee that the process of setting memory attributes will
not change memory contents. The process of setting memory attributes
requires unmapping from Secure EPTs, which will zero the memory. (In
future, if we want to relax this, we could permit this if nothing in
the requested range was mapped in Secure EPTs)
+ Post-finalize, SNP, to-shared, PRESERVE: guest_memfd guarantees that
after setting memory attributes, the contents of the pages will not
change. For SNP, unmapping doesn't change memory contents? The guest
reads garbage, and that's a separate contract between SNP and the
guest. In the guest_memfd contract, guest_memfd PRESERVEs the memory
contents in the process of setting memory attributes, and can fulfil
that.
+ Post-finalize, TDX, to-private, ZERO: guest_memfd zeroes the shared
memory before updating the attributes to be private, because it
promised to. If this memory gets faulted in to Secure EPTs, TDX
firmware zeros it again, because that's TDX's contract with the
guest. I can't see any benefit to userspace in using this combination,
but the guest_memfd contract and implementation are simple.
TLDR:
+ PRESERVE == guarantee that the process of setting memory attributes
doesn't change memory contents.
+ implementation == do nothing in most cases, except -EOPNOTSUPP for
to-shared on TDX, since unmapping is a required part of setting
memory attributes to private, and a TDX side effect of unmapping
is zeroing memory,
+ ZERO == guarantee that the process of setting memory attributes zeroes
memory contents.
+ implementation == memset(zero) in most cases. For TDX, a future
optimization exists, where memset() can be skipped for pages that
were mapped in Secure EPTs before conversion
+ UNSPECIFIED == no guarantees
+ implementation == guest_memfd does nothing explicitly about memory
contents. The implementation is pretty much the same as PRESERVE
except guest_memfd won't take into account vendor-specific side
effects of the process of conversion. Except for the test vehicle
KVM_X86_SW_PROTECTED_VMS, where memory is scrambled.
>>
>> [...snip...]
>>
>
> Looking at the example you have there:
>
> + Note: These content modes apply to the entire requested range, not
> + just the parts of the range that underwent conversion. For example, if
> + this was the initial state:
> +
> + * [0x0000, 0x1000): shared
> + * [0x1000, 0x2000): private
> + * [0x2000, 0x3000): shared
> + and range [0x0000, 0x3000) was set to shared, the content mode would
> + apply to all memory in [0x0000, 0x3000), not just the range that
> + underwent conversion [0x1000, 0x2000).
>
> Userspace would be aware of whether the range contains pages that were
> already set to private, so if it really wants to set the just the
> [0x1000, 0x2000) range to shared with appropriate content mode, it is
> fully able to do so by just issuing the ioctl for that specific range.
> If it attempts to issue it for the entire range, it only seems like it
> would defy normal expectations and cause confusion to skip ranges, and
> I'm not sure it gains us anything useful in exchange for that potential
> confusion.
>
Great that we're aligned here :) No complaints from guest_memfd biweekly
today as well :)
>>
>> [...snip...]
>>