Ackerley Tng <[email protected]> writes: > > [...snip...] > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 23ec0b0c3e22..26e80745c8b4 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -117,7 +117,7 @@ description: > x86 includes both i386 and x86_64. > > Type: > - system, vm, or vcpu. > + system, vm, vcpu or guest_memfd. > > Parameters: > what parameters are accepted by the ioctl. > @@ -6523,11 +6523,22 @@ the capability to be present. > --------------------------------- > > :Capability: KVM_CAP_MEMORY_ATTRIBUTES2 > -:Architectures: x86 > -:Type: vm ioctl > +:Architectures: all > +:Type: vm, guest_memfd ioctl > :Parameters: struct kvm_memory_attributes2 (in/out) > :Returns: 0 on success, <0 on error > > +Errors: > + > + ========== =============================================================== > + EINVAL The specified `offset` or `size` were invalid (e.g. not > + page aligned, causes an overflow, or size is zero). > + EFAULT The parameter address was invalid. > + EAGAIN Some page within requested range had unexpected refcounts. The > + offset of the page will be returned in `error_offset`. > + ENOMEM Ran out of memory trying to track private/shared state > + ========== =============================================================== > + > KVM_SET_MEMORY_ATTRIBUTES2 is an extension to > KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to > userspace. The original (pre-extension) fields are shared with > @@ -6538,15 +6549,42 @@ Attribute values are shared with > KVM_SET_MEMORY_ATTRIBUTES. > :: > > struct kvm_memory_attributes2 { > - __u64 address; > + /* in */ > + union { > + __u64 address; > + __u64 offset; > + }; > __u64 size; > __u64 attributes; > __u64 flags; > - __u64 reserved[12]; > + /* out */ > + __u64 error_offset; > + __u64 reserved[11]; > }; > > #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) > > +Set attributes for a range of offsets within a guest_memfd to > +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed > +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is > +supported, after a successful call to set > +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable > +into host userspace and will only be mappable by the guest. > + > +To allow the range to be mappable into host userspace again, call > +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with > +KVM_MEMORY_ATTRIBUTE_PRIVATE unset. > + > +If this ioctl returns -EAGAIN, the offset of the page with unexpected > +refcounts will be returned in `error_offset`. This can occur if there > +are transient refcounts on the pages, taken by other parts of the > +kernel. > + > +Userspace is expected to figure out how to remove all known refcounts > +on the shared pages, such as refcounts taken by get_user_pages(), and > +try the ioctl again. A possible source of these long term refcounts is > +if the guest_memfd memory was pinned in IOMMU page tables. > + > See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`. >
Transferring/re-summarizing an internal comment from Sean upstream here! We can also follow up on this topic at the next guest_memfd biweekly. Before this lands, Sean wants, at the very minimum, an in-principle agreement on guest_memfd behavior with respect to whether or not memory should be preserved on conversion. Sean is against deferring whether to preserve memory to the underlying hardware because that is letting (effectively) micro-architectural behavior to define KVM's ABI. KVM's uAPI cannot let behavior be undefined, or be based on vendor, and maybe even on firmware version. Sean says that all decisions that affect guest data must be made by userspace. The architecture can restrict what is possible, e.g. neither SNP nor TDX currently support "generic" in-place conversion, but whether or not data is to be preserved must be an explicit request from userspace. If preserving data is impossible, then KVM needs to reject the request. (Vendor specific ioctls are out-of-scope, SNP and TDX cases were brought up purely to highlight that there's nothing that fundamentally prevents preserving data on conversion.) I suggested a few uAPI options for configuring content preservation on conversion: 1. guest_memfd creation time flag like GUEST_MEMFD_FLAG_PRESERVE_CONTENTS. This can be valid only if the kernel and vendor support content preservation This was rejected because we should not assume all current and future use cases will want the same content preservation config for a given guest_memfd. 2. KConfig: automatically select to preserve contents if the architecture supports content preservation This was rejected because it's not a decision explicitly made by userspace. 3. KVM module param to configure content preservation. This was rejected because the configuration may not generalize across all VMs on the same host. 4. guest_memfd ioctl flag SET_MEMORY_ATTRIBUTES2_FLAG_PRESERVE_CONTENTS. -EINVAL if kernel and vendor don't support content preservation Specifying a flag to choose whether content should be preserved at conversion-time is the current best suggestion. What does the rest of the community think of a conversion ioctl flag to choose whether to preserve memory contents on conversion? Fuad, I think you also made a related comment on an earlier internal version we were working on. What do you/pKVM think? > > [...snip...] >
