Hi Ackerley, Here are my thoughts, at least when it comes to pKVM.
On Tue, 24 Feb 2026 at 10:14, Ackerley Tng <[email protected]> wrote: > > Ackerley Tng <[email protected]> writes: > > > Ackerley Tng <[email protected]> writes: > > > >> > >> [...snip...] > >> > > Before this lands, Sean wants, at the very minimum, an in-principle > > agreement on guest_memfd behavior with respect to whether or not memory > > should be preserved on conversion. > >> > >> [...snip...] > >> > > Here's what I've come up with, following up from last guest_memfd > biweekly. > > Every KVM_SET_MEMORY_ATTRIBUTES2 request will be accompanied by an > enum set_memory_attributes_content_policy: > > enum set_memory_attributes_content_policy { > SET_MEMORY_ATTRIBUTES_CONTENT_ZERO, > SET_MEMORY_ATTRIBUTES_CONTENT_READABLE, > SET_MEMORY_ATTRIBUTES_CONTENT_ENCRYPTED, > } > > Within guest_memfd's KVM_SET_MEMORY_ATTRIBUTES2 handler, guest_memfd > will make an arch call > > kvm_gmem_arch_content_policy_supported(kvm, policy, gfn, nr_pages) > > where every arch will get to return some error if the requested policy > is not supported for the given range. This hook provides the validation mechanism pKVM requires. > ZERO is the simplest of the above, it means that after the conversion > the memory will be zeroed for the next reader. > > + TDX and SNP today will support ZERO since the firmware handles > zeroing. > + pKVM and SW_PROTECTED_VM will apply software zeroing. > + Purpose: having this policy in the API allows userspace to be sure > that the memory is zeroed after the conversion - there is no need to > zero again in userspace (addresses concern that Sean pointed out) > > READABLE means that after the conversion, the memory is readable by > userspace (if converting to shared) or readable by the guest (if > converting to private). > > + TDX and SNP (today) can't support this, so return -EOPNOTSUPP > + SW_PROTECTED_VM will support this and do nothing extra on > conversion, since there is no encryption anyway and all content > remains readable. > + pKVM will make use of the arch function above. > > Here's where I need input: (David's questions during the call about > the full flow beginning with the guest prompted this). > > Since pKVM doesn't encrypt the memory contents, there must be some way > that pKVM can say no when userspace requests to convert and retain > READABLE contents? I think pKVM's arch function can be used to check > if the guest previously made a conversion request. Fuad, to check that > the guest made a conversion request, what's other parameters are > needed other than gfn and nr_pages? The gfn and nr_pages parameters are enough I think. To clarify how pKVM would use this hook: all memory sharing and unsharing must be initiated by the guest via a hypercall. When the guest issues this hypercall, the pKVM hypervisor (EL2) exits to the host kernel (EL1). The host kernel records the exit reason (share or unshare) along with the specific memory address in the kvm_run structure before exiting to userspace. We do not track this pending conversion state in the hypervisor. If a compromised host kernel wants to lie and corrupt the state, it can crash the system or the guest (which is an accepted DOS risk), but it cannot compromise guest confidentiality because EL2 still strictly enforces Stage-2 permissions. Our primary goal here is to prevent a malicious or buggy userspace VMM from crashing the system. When the VMM subsequently issues the KVM_SET_MEMORY_ATTRIBUTES2 ioctl with the READABLE policy, we will use the kvm_gmem_arch_content_policy_supported() hook in EL1 to validate the ioctl. We will cross-reference the requested gfn and nr_pages against the pending exit reason stored in kvm_run. If the VMM attempts an unsolicited conversion (i.e., there is no matching exit request in kvm_run, or the addresses do not match), our current plan is to reject the request and return an error. In the future, rather than outright rejecting an unsolicited conversion, we might evolve this to treat it as a host-initiated destructive reclaim, forcing an unshare and zeroing the memory. For the time being, explicit rejection is the simplest and safest path. > ENCRYPTED means that after the conversion, the memory contents are > retained as-is, with no decryption. > > + TDX and SNP (today) can't support this, so return -EOPNOTSUPP > + pKVM and SW_PROTECTED_VM can do nothing, but doing nothing retains > READABLE content, not ENCRYPTED content, hence SW_PROTECTED_VM > should return -EOPNOTSUPP. > + Michael, you mentioned during the call that SNP is planning to > introduce a policy that retains the ENCRYPTED version for a special > GHCB call. ENCRYPTED is meant for that use case. Does it work? I'm > assuming that SNP should only support this policy given some > conditions, so would the arch call as described above work? > + If this policy is specified on conversion from shared to private, > always return -EOPNOTSUPP. > + When this first lands, ENCRYPTED will not be a valid option, but I'm > listing it here so we have line of sight to having this support. > > READABLE and ENCRYPTED defines the state after conversion clearly > (instead of DONT_CARE or similar). > > DESTROY could be another policy, which means that after the > conversion, the memory is unreadable. This is the option to address > what David brought up during the call, for cases where userspace knows > it is going to free the memory already and doesn't care about the > state as long as nobody gets to read it. This will not implemented > when feature first lands, but is presented here just to show how this > can be extended in future. > > Right now, I'm thinking that one of the above policies MUST be > specified (not specifying a policy will result in -EINVAL). > > How does this sound? I don't think that returning -EINVAL is the right thing to do here. If userspace omits the policy, the API should default to SET_MEMORY_ATTRIBUTES_CONTENT_ZERO and proceed with the conversion. I believe that, in Linux APIs in general, omitting an optional behavior flag results in the safest, most standard default action. Also, returning -EINVAL when no policy is specified makes the policy parameter strictly mandatory. This makes it difficult for userspace's to seamlessly request clean-slate, destructive conversions. Software zeroing ensures deterministic behavior across pKVM, TDX, and SNP, isolating the KVM uAPI from micro-architectural data destruction nuances. Cheers, /fuad
