Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
On 10/28/2023 2:21 AM, Sean Christopherson wrote: Add flags to "struct kvm_gfn_range" to let notifier events target only shared and only private mappings, and write up the existing mmu_notifier events to be shared-only (private memory is never associated with a userspace virtual address, i.e. can't be reached via mmu_notifiers). Add two flags so that KVM can handle the three possibilities (shared, private, and shared+private) without needing something like a tri-state enum. I see the two flags are set/cleared in __kvm_handle_hva_range() in this patch and kvm_handle_gfn_range() from the later patch 13/35, but I didn't see they are used/read in this patch series if I didn't miss anything. How are they supposed to be used in KVM? Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com Signed-off-by: Sean Christopherson --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 7 +++ 2 files changed, 9 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 96aa930536b1..89c1a991a3b8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -263,6 +263,8 @@ struct kvm_gfn_range { gfn_t start; gfn_t end; union kvm_mmu_notifier_arg arg; + bool only_private; + bool only_shared; bool may_block; }; bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index cb9376833c18..302ccb87b4c1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, * the second or later invocation of the handler). */ gfn_range.arg = range->arg; + + /* +* HVA-based notifications aren't relevant to private +* mappings as they don't have a userspace mapping. +*/ + gfn_range.only_private = false; + gfn_range.only_shared = true; gfn_range.may_block = range->may_block; /*
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On 10/28/2023 2:21 AM, Sean Christopherson wrote: From: Chao Peng Add a new KVM exit type to allow userspace to handle memory faults that KVM cannot resolve, but that userspace *may* be able to handle (without terminating the guest). KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit conversions between private and shared memory. With guest private memory, there will be two kind of memory conversions: - explicit conversion: happens when the guest explicitly calls into KVM to map a range (as private or shared) - implicit conversion: happens when the guest attempts to access a gfn that is configured in the "wrong" state (private vs. shared) On x86 (first architecture to support guest private memory), explicit conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable as there is (obviously) no hypercall, and there is no guarantee that the guest actually intends to convert between private and shared, i.e. what KVM thinks is an implicit conversion "request" could actually be the result of a guest code bug. KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to be implicit conversions. Note! To allow for future possibilities where KVM reports KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's perspective), not '0'! Is "-EHWPOISON" case not considered unresolved, so it is not mentioned here? Due to historical baggage within KVM, exiting to userspace with '0' from deep callstacks, e.g. in emulation paths, is infeasible as doing so would require a near-complete overhaul of KVM, whereas KVM already propagates -errno return codes to userspace even when the -errno originated in a low level helper. Report the gpa+size instead of a single gfn even though the initial usage is expected to always report single pages. It's entirely possible, likely even, that KVM will someday support sub-page granularity faults, e.g. Intel's sub-page protection feature allows for additional protections at 128-byte granularity. Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com Link: https://lore.kernel.org/all/zq3amlo2syv3d...@google.com Cc: Anish Moorthy Cc: David Matlack Suggested-by: Sean Christopherson Co-developed-by: Yu Zhang Signed-off-by: Yu Zhang Signed-off-by: Chao Peng Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/api.rst | 41 ++ arch/x86/kvm/x86.c | 1 + include/linux/kvm_host.h | 11 + include/uapi/linux/kvm.h | 8 +++ 4 files changed, 61 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index ace984acc125..860216536810 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_FAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 size; + } memory; + +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field +describes properties of the faulting access that are likely pertinent. +Currently, no flags are defined. + +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume +kvm_run.exit_reason is stale/undefined for all other error numbers. + :: /* KVM_EXIT_NOTIFY */ @@ -7757,6 +,27 @@ This capability is aimed to mitigate the threat that malicious VMs can cause CPU stuck (due to event windows don't open up) and make the CPU unavailable to host or other VMs. +7.34 KVM_CAP_MEMORY_FAULT_INFO +-- + +:Architectures: x86 +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. + +The presence of this capability indicates that KVM_RUN will fill +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if +there is a valid memslot but no backing VMA for the corresponding host virtual +address. + +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set +to KVM_EXIT_MEMORY_FAULT. + +Note: Userspaces which attempt to resolve memory faults so that they can retry +KVM_RUN ar
Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
On 9/21/2023 5:03 AM, Sean Christopherson wrote: On Mon, Sep 18, 2023, Binbin Wu wrote: On 9/14/2023 9:55 AM, Sean Christopherson wrote: From: Chao Peng [...] +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES +/* + * Returns true if _all_ gfns in the range [@start, @end) have attributes + * matching @attrs. + */ +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, +unsigned long attrs) +{ + XA_STATE(xas, &kvm->mem_attr_array, start); + unsigned long index; + bool has_attrs; + void *entry; + + rcu_read_lock(); + + if (!attrs) { + has_attrs = !xas_find(&xas, end); IIUIC, xas_find() is inclusive for "end", so here should be "end - 1" ? Yes, that does appear to be the case. Inclusive vs. exclusive on gfn ranges has is the bane of my existence. Seems this one is not included in the "KVM: guest_memfd fixes" patch series? https://lore.kernel.org/kvm/20230921203331.3746712-1-sea...@google.com/
Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On 9/20/2023 10:24 PM, Sean Christopherson wrote: On Tue, Sep 19, 2023, Binbin Wu wrote: On 9/14/2023 9:55 AM, Sean Christopherson wrote: [...] + +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, + pgoff_t end) +{ + struct kvm_memory_slot *slot; + struct kvm *kvm = gmem->kvm; + unsigned long index; + bool flush = false; + + KVM_MMU_LOCK(kvm); + + kvm_mmu_invalidate_begin(kvm); + + xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { + pgoff_t pgoff = slot->gmem.pgoff; + + struct kvm_gfn_range gfn_range = { + .start = slot->base_gfn + max(pgoff, start) - pgoff, + .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff, + .slot = slot, + .may_block = true, + }; + + flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range); + } + + if (flush) + kvm_flush_remote_tlbs(kvm); + + KVM_MMU_UNLOCK(kvm); +} + +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start, + pgoff_t end) +{ + struct kvm *kvm = gmem->kvm; + + KVM_MMU_LOCK(kvm); + if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) + kvm_mmu_invalidate_end(kvm); kvm_mmu_invalidate_begin() is called unconditionally in kvm_gmem_invalidate_begin(), but kvm_mmu_invalidate_end() is not here. This makes the kvm_gmem_invalidate_{begin, end}() calls asymmetric. Another ouch :-( And there should be no need to acquire mmu_lock() unconditionally, the inode's mutex protects the bindings, not mmu_lock. I'll get a fix posted today. I think KVM can also add a sanity check to detect unresolved invalidations, e.g. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7ba1ab1832a9..2a2d18070856 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1381,8 +1381,13 @@ static void kvm_destroy_vm(struct kvm *kvm) * No threads can be waiting in kvm_swap_active_memslots() as the * last reference on KVM has been dropped, but freeing * memslots would deadlock without this manual intervention. +* +* If the count isn't unbalanced, i.e. KVM did NOT unregister between +* a start() and end(), then there shouldn't be any in-progress +* invalidations. */ WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait)); + WARN_ON(!kvm->mn_active_invalidate_count && kvm->mmu_invalidate_in_progress); kvm->mn_active_invalidate_count = 0; #else kvm_flush_shadow_all(kvm); or an alternative style if (kvm->mn_active_invalidate_count) kvm->mn_active_invalidate_count = 0; else WARN_ON(kvm->mmu_invalidate_in_progress) + KVM_MMU_UNLOCK(kvm); +} + +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len) +{ + struct list_head *gmem_list = &inode->i_mapping->private_list; + pgoff_t start = offset >> PAGE_SHIFT; + pgoff_t end = (offset + len) >> PAGE_SHIFT; + struct kvm_gmem *gmem; + + /* +* Bindings must stable across invalidation to ensure the start+end +* are balanced. +*/ + filemap_invalidate_lock(inode->i_mapping); + + list_for_each_entry(gmem, gmem_list, entry) { + kvm_gmem_invalidate_begin(gmem, start, end); + kvm_gmem_invalidate_end(gmem, start, end); + } Why to loop for each gmem in gmem_list here? IIUIC, offset is the offset according to the inode, it is only meaningful to the inode passed in, i.e, it is only meaningful to the gmem binding with the inode, not others. The code is structured to allow for multiple gmem instances per inode. This isn't actually possible in the initial code base, but it's on the horizon[*]. I included the list-based infrastructure in this initial series to ensure that guest_memfd can actually support multiple files per inode, and to minimize the churn when the "link" support comes along. [*] https://lore.kernel.org/all/cover.1691446946.git.ackerley...@google.com Got it, thanks for the explanation!
Re: [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory
On 9/15/2023 10:26 PM, Sean Christopherson wrote: On Fri, Sep 15, 2023, Yan Zhao wrote: On Wed, Sep 13, 2023 at 06:55:16PM -0700, Sean Christopherson wrote: +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault) +{ + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, + PAGE_SIZE, fault->write, fault->exec, + fault->is_private); +} + +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault) +{ + int max_order, r; + + if (!kvm_slot_can_be_private(fault->slot)) { + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); + return -EFAULT; + } + + r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn, +&max_order); + if (r) { + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); + return r; + } + + fault->max_level = min(kvm_max_level_for_order(max_order), + fault->max_level); + fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY); + + return RET_PF_CONTINUE; +} + static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_memory_slot *slot = fault->slot; @@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault return RET_PF_EMULATE; } + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) { In patch 21, fault->is_private is set as: ".is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT)", then, the inequality here means memory attribute has been updated after last check. So, why an exit to user space for converting is required instead of a mere retry? Or, is it because how .is_private is assigned in patch 21 is subjected to change in future? This. Retrying on SNP or TDX would hang the guest. I suppose we could special case VMs where .is_private is derived from the memory attributes, but the SW_PROTECTED_VM type is primary a development vehicle at this point. I'd like to have it mimic SNP/TDX as much as possible; performance is a secondary concern. So when .is_private is derived from the memory attributes, and if I didn't miss anything, there is no explicit conversion mechanism introduced yet so far, does it mean for pure sw-protected VM (withouth SNP/TDX), the page fault will be handled according to the memory attributes setup by host/user vmm, no implicit conversion will be triggered, right? E.g. userspace needs to be prepared for "spurious" exits due to races on SNP and TDX, which this can theoretically exercise. Though the window is quite small so I doubt that'll actually happen in practice; which of course also makes it less important to retry instead of exiting.
Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On 9/14/2023 9:55 AM, Sean Christopherson wrote: [...] + +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, + pgoff_t end) +{ + struct kvm_memory_slot *slot; + struct kvm *kvm = gmem->kvm; + unsigned long index; + bool flush = false; + + KVM_MMU_LOCK(kvm); + + kvm_mmu_invalidate_begin(kvm); + + xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) { + pgoff_t pgoff = slot->gmem.pgoff; + + struct kvm_gfn_range gfn_range = { + .start = slot->base_gfn + max(pgoff, start) - pgoff, + .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff, + .slot = slot, + .may_block = true, + }; + + flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range); + } + + if (flush) + kvm_flush_remote_tlbs(kvm); + + KVM_MMU_UNLOCK(kvm); +} + +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start, + pgoff_t end) +{ + struct kvm *kvm = gmem->kvm; + + KVM_MMU_LOCK(kvm); + if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) + kvm_mmu_invalidate_end(kvm); kvm_mmu_invalidate_begin() is called unconditionally in kvm_gmem_invalidate_begin(), but kvm_mmu_invalidate_end() is not here. This makes the kvm_gmem_invalidate_{begin, end}() calls asymmetric. + KVM_MMU_UNLOCK(kvm); +} + +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len) +{ + struct list_head *gmem_list = &inode->i_mapping->private_list; + pgoff_t start = offset >> PAGE_SHIFT; + pgoff_t end = (offset + len) >> PAGE_SHIFT; + struct kvm_gmem *gmem; + + /* +* Bindings must stable across invalidation to ensure the start+end +* are balanced. +*/ + filemap_invalidate_lock(inode->i_mapping); + + list_for_each_entry(gmem, gmem_list, entry) { + kvm_gmem_invalidate_begin(gmem, start, end); + kvm_gmem_invalidate_end(gmem, start, end); + } Why to loop for each gmem in gmem_list here? IIUIC, offset is the offset according to the inode, it is only meaningful to the inode passed in, i.e, it is only meaningful to the gmem binding with the inode, not others. + + filemap_invalidate_unlock(inode->i_mapping); + + return 0; +} + [...]
Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes
On 9/14/2023 9:55 AM, Sean Christopherson wrote: From: Chao Peng [...] +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES +/* + * Returns true if _all_ gfns in the range [@start, @end) have attributes + * matching @attrs. + */ +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, +unsigned long attrs) +{ + XA_STATE(xas, &kvm->mem_attr_array, start); + unsigned long index; + bool has_attrs; + void *entry; + + rcu_read_lock(); + + if (!attrs) { + has_attrs = !xas_find(&xas, end); IIUIC, xas_find() is inclusive for "end", so here should be "end - 1" ? + goto out; + } + + has_attrs = true; + for (index = start; index < end; index++) { + do { + entry = xas_next(&xas); + } while (xas_retry(&xas, entry)); + + if (xas.xa_index != index || xa_to_value(entry) != attrs) { + has_attrs = false; + break; + } + } + +out: + rcu_read_unlock(); + return has_attrs; +} + [...]
Re: [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events
On 9/14/2023 9:55 AM, Sean Christopherson wrote: Add flags to "struct kvm_gfn_range" to let notifier events target only shared and only private mappings, and write up the existing mmu_notifier events to be shared-only (private memory is never associated with a userspace virtual address, i.e. can't be reached via mmu_notifiers). Add two flags so that KVM can handle the three possibilities (shared, private, and shared+private) without needing something like a tri-state enum. How to understand the word "stage" in short log? Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com Signed-off-by: Sean Christopherson --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 7 +++ 2 files changed, 9 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d8c6ce6c8211..b5373cee2b08 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -263,6 +263,8 @@ struct kvm_gfn_range { gfn_t start; gfn_t end; union kvm_mmu_notifier_arg arg; + bool only_private; + bool only_shared; bool may_block; }; bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 174de2789657..a41f8658dfe0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, * the second or later invocation of the handler). */ gfn_range.arg = range->arg; + + /* +* HVA-based notifications aren't relevant to private +* mappings as they don't have a userspace mapping. +*/ + gfn_range.only_private = false; + gfn_range.only_shared = true; gfn_range.may_block = range->may_block; /*
Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry
On 9/14/2023 9:55 AM, Sean Christopherson wrote: From: Chao Peng Currently in mmu_notifier invalidate path, hva range is recorded and then checked against by mmu_notifier_retry_hva() in the page fault handling path. However, for the to be introduced private memory, a page fault may not have a hva associated, checking gfn(gpa) makes more sense. For existing hva based shared memory, gfn is expected to also work. The only downside is when aliasing multiple gfns to a single hva, the current algorithm of checking multiple ranges could result in a much larger range being rejected. Such aliasing should be uncommon, so the impact is expected small. Suggested-by: Sean Christopherson Signed-off-by: Chao Peng Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba [sean: convert vmx_set_apic_access_page_addr() to gfn-based API] Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 10 ++ arch/x86/kvm/vmx/vmx.c | 11 +-- include/linux/kvm_host.h | 33 + virt/kvm/kvm_main.c | 40 +++- 4 files changed, 63 insertions(+), 31 deletions(-) [...] -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, - unsigned long end) +void kvm_mmu_invalidate_begin(struct kvm *kvm) { + lockdep_assert_held_write(&kvm->mmu_lock); /* * The count increase must become visible at unlock time as no * spte can be established without taking the mmu_lock and * count is also read inside the mmu_lock critical section. */ kvm->mmu_invalidate_in_progress++; + + if (likely(kvm->mmu_invalidate_in_progress == 1)) + kvm->mmu_invalidate_range_start = INVALID_GPA; +} + +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end) +{ + lockdep_assert_held_write(&kvm->mmu_lock); + + WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress); + if (likely(kvm->mmu_invalidate_in_progress == 1)) { kvm->mmu_invalidate_range_start = start; kvm->mmu_invalidate_range_end = end; @@ -771,6 +781,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, } } +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) +{ + kvm_mmu_invalidate_range_add(kvm, range->start, range->end); + return kvm_unmap_gfn_range(kvm, range); +} + static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, const struct mmu_notifier_range *range) { @@ -778,7 +794,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, const struct kvm_mmu_notifier_range hva_range = { .start = range->start, .end= range->end, - .handler= kvm_unmap_gfn_range, + .handler= kvm_mmu_unmap_gfn_range, .on_lock= kvm_mmu_invalidate_begin, .on_unlock = kvm_arch_guest_memory_reclaimed, .flush_on_ret = true, @@ -817,8 +833,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, return 0; } -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start, - unsigned long end) +void kvm_mmu_invalidate_end(struct kvm *kvm) { /* * This sequence increase will notify the kvm page fault that @@ -833,6 +848,13 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start, * in conjunction with the smp_rmb in mmu_invalidate_retry(). */ kvm->mmu_invalidate_in_progress--; + + /* +* Assert that at least one range must be added between start() and +* end(). Not adding a range isn't fatal, but it is a KVM bug. +*/ + WARN_ON_ONCE(kvm->mmu_invalidate_in_progress && +kvm->mmu_invalidate_range_start == INVALID_GPA); Should the check happen before the decrease of kvm->mmu_invalidate_in_progress? Otherwise, KVM calls kvm_mmu_invalidate_begin(), then kvm_mmu_invalidate_end() the check will not take effect. } static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On 8/31/2023 12:44 AM, Ackerley Tng wrote: Binbin Wu writes: +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len) +{ + struct address_space *mapping = inode->i_mapping; + pgoff_t start, index, end; + int r; + + /* Dedicated guest is immutable by default. */ + if (offset + len > i_size_read(inode)) + return -EINVAL; + + filemap_invalidate_lock_shared(mapping); + + start = offset >> PAGE_SHIFT; + end = (offset + len) >> PAGE_SHIFT; + + r = 0; + for (index = start; index < end; ) { + struct folio *folio; + + if (signal_pending(current)) { + r = -EINTR; + break; + } + + folio = kvm_gmem_get_folio(inode, index); + if (!folio) { + r = -ENOMEM; + break; + } + + index = folio_next_index(folio); + + folio_unlock(folio); + folio_put(folio); May be a dumb question, why we get the folio and then put it immediately? Will it make the folio be released back to the page allocator? I was wondering this too, but it is correct. In filemap_grab_folio(), the refcount is incremented in three places: + When the folio is created in filemap_alloc_folio(), it is given a refcount of 1 in filemap_alloc_folio() -> folio_alloc() -> __folio_alloc_node() -> __folio_alloc() -> __alloc_pages() -> get_page_from_freelist() -> prep_new_page() -> post_alloc_hook() -> set_page_refcounted() + Then, in filemap_add_folio(), the refcount is incremented twice: + The first is from the filemap (1 refcount per page if this is a hugepage): filemap_add_folio() -> __filemap_add_folio() -> folio_ref_add() + The second is a refcount from the lru list filemap_add_folio() -> folio_add_lru() -> folio_get() -> folio_ref_inc() In the other path, if the folio exists in the page cache (filemap), the refcount is also incremented through filemap_grab_folio() -> __filemap_get_folio() -> filemap_get_entry() -> folio_try_get_rcu() I believe all the branches in kvm_gmem_get_folio() are taking a refcount on the folio while the kernel does some work on the folio like clearing the folio in clear_highpage() or getting the next index, and then when done, the kernel does folio_put(). This pattern is also used in shmem and hugetlb. :) Thanks for your explanation. It helps a lot. I'm not sure whose refcount the folio_put() in kvm_gmem_allocate() is dropping though: + The refcount for the filemap depends on whether this is a hugepage or not, but folio_put() strictly drops a refcount of 1. + The refcount for the lru list is just 1, but doesn't the page still remain in the lru list? I guess the refcount drop here is the one get on the fresh allocation. Now the filemap has grabbed the folio, so the lifecycle of the folio now is decided by the filemap/inode? + + /* 64-bit only, wrapping the index should be impossible. */ + if (WARN_ON_ONCE(!index)) + break; + + cond_resched(); + } + + filemap_invalidate_unlock_shared(mapping); + + return r; +} +
Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On 7/19/2023 7:44 AM, Sean Christopherson wrote: [...] + +static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index) +{ + struct folio *folio; + + /* TODO: Support huge pages. */ + folio = filemap_grab_folio(file->f_mapping, index); + if (!folio) Should use if ((IS_ERR(folio)) instead. + return NULL; + + /* +* Use the up-to-date flag to track whether or not the memory has been +* zeroed before being handed off to the guest. There is no backing +* storage for the memory, so the folio will remain up-to-date until +* it's removed. +* +* TODO: Skip clearing pages when trusted firmware will do it when +* assigning memory to the guest. +*/ + if (!folio_test_uptodate(folio)) { + unsigned long nr_pages = folio_nr_pages(folio); + unsigned long i; + + for (i = 0; i < nr_pages; i++) + clear_highpage(folio_page(folio, i)); + + folio_mark_uptodate(folio); + } + + /* +* Ignore accessed, referenced, and dirty flags. The memory is +* unevictable and there is no storage to write back to. +*/ + return folio; +} [...] + +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len) +{ + struct address_space *mapping = inode->i_mapping; + pgoff_t start, index, end; + int r; + + /* Dedicated guest is immutable by default. */ + if (offset + len > i_size_read(inode)) + return -EINVAL; + + filemap_invalidate_lock_shared(mapping); + + start = offset >> PAGE_SHIFT; + end = (offset + len) >> PAGE_SHIFT; + + r = 0; + for (index = start; index < end; ) { + struct folio *folio; + + if (signal_pending(current)) { + r = -EINTR; + break; + } + + folio = kvm_gmem_get_folio(inode, index); + if (!folio) { + r = -ENOMEM; + break; + } + + index = folio_next_index(folio); + + folio_unlock(folio); + folio_put(folio); May be a dumb question, why we get the folio and then put it immediately? Will it make the folio be released back to the page allocator? + + /* 64-bit only, wrapping the index should be impossible. */ + if (WARN_ON_ONCE(!index)) + break; + + cond_resched(); + } + + filemap_invalidate_unlock_shared(mapping); + + return r; +} + [...] + +int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, + unsigned int fd, loff_t offset) +{ + loff_t size = slot->npages << PAGE_SHIFT; + unsigned long start, end, flags; + struct kvm_gmem *gmem; + struct inode *inode; + struct file *file; + + BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff)); + + file = fget(fd); + if (!file) + return -EINVAL; + + if (file->f_op != &kvm_gmem_fops) + goto err; + + gmem = file->private_data; + if (gmem->kvm != kvm) + goto err; + + inode = file_inode(file); + flags = (unsigned long)inode->i_private; + + /* +* For simplicity, require the offset into the file and the size of the +* memslot to be aligned to the largest possible page size used to back +* the file (same as the size of the file itself). +*/ + if (!kvm_gmem_is_valid_size(offset, flags) || + !kvm_gmem_is_valid_size(size, flags)) + goto err; + + if (offset + size > i_size_read(inode)) + goto err; + + filemap_invalidate_lock(inode->i_mapping); + + start = offset >> PAGE_SHIFT; + end = start + slot->npages; + + if (!xa_empty(&gmem->bindings) && + xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) { + filemap_invalidate_unlock(inode->i_mapping); + goto err; + } + + /* +* No synchronize_rcu() needed, any in-flight readers are guaranteed to +* be see either a NULL file or this new file, no need for them to go +* away. +*/ + rcu_assign_pointer(slot->gmem.file, file); + slot->gmem.pgoff = start; + + xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL); + filemap_invalidate_unlock(inode->i_mapping); + + /* +* Drop the reference to the file, even on success. The file pins KVM, +* not the other way 'round. Active bindings are invalidated if the an extra ', or maybe around? +* file is closed before memslots are destroyed. +*/ + fput(file); + return 0; + +err: + fput(file); + return -EINVAL; +} + [...] []
Re: [RFC PATCH v11 08/29] KVM: Introduce per-page memory attributes
On 7/19/2023 7:44 AM, Sean Christopherson wrote: From: Chao Peng In confidential computing usages, whether a page is private or shared is necessary information for KVM to perform operations like page fault handling, page zapping etc. There are other potential use cases for per-page memory attributes, e.g. to make memory read-only (or no-exec, or exec-only, etc.) without having to modify memslots. Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow userspace to operate on the per-page memory attributes. - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to a guest memory range. - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported memory attributes. Use an xarray to store the per-page attributes internally, with a naive, not fully optimized implementation, i.e. prioritize correctness over performance for the initial implementation. Because setting memory attributes is roughly analogous to mprotect() on memory that is mapped into the guest, zap existing mappings prior to updating the memory attributes. Opportunistically provide an arch hook for the post-set path (needed to complete invalidation anyways) in s/anyways/anyway anticipation of x86 needing the hook to update metadata related to determining whether or not a given gfn can be backed with various sizes of hugepages. It's possible that future usages may not require an invalidation, e.g. if KVM ends up supporting RWX protections and userspace grants _more_ protections, but again opt for simplicity and punt optimizations to if/when they are needed. Suggested-by: Sean Christopherson Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com Cc: Fuad Tabba Signed-off-by: Chao Peng Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/api.rst | 60 include/linux/kvm_host.h | 14 +++ include/uapi/linux/kvm.h | 14 +++ virt/kvm/Kconfig | 4 + virt/kvm/kvm_main.c| 170 + 5 files changed, 262 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 34d4ce66e0c8..0ca8561775ac 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6068,6 +6068,56 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using the SET_ONE_REG interface. No error will be returned, but the resulting offset will not be applied. +4.139 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES +- + +:Capability: KVM_CAP_MEMORY_ATTRIBUTES +:Architectures: x86 +:Type: vm ioctl +:Parameters: u64 memory attributes bitmask(out) +:Returns: 0 on success, <0 on error + +Returns supported memory attributes bitmask. Supported memory attributes will +have the corresponding bits set in u64 memory attributes bitmask. + +The following memory attributes are defined:: + + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) + +4.140 KVM_SET_MEMORY_ATTRIBUTES +- + +:Capability: KVM_CAP_MEMORY_ATTRIBUTES +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_memory_attributes(in/out) +:Returns: 0 on success, <0 on error + +Sets memory attributes for pages in a guest memory range. Parameters are +specified via the following structure:: + + struct kvm_memory_attributes { + __u64 address; + __u64 size; + __u64 attributes; + __u64 flags; + }; + +The user sets the per-page memory attributes to a guest memory range indicated +by address/size, and in return KVM adjusts address and size to reflect the +actual pages of the memory range have been successfully set to the attributes. +If the call returns 0, "address" is updated to the last successful address + 1 +and "size" is updated to the remaining address size that has not been set +successfully. The user should check the return value as well as the size to +decide if the operation succeeded for the whole range or not. The user may want +to retry the operation with the returned address/size if the previous range was +partially successful. + +Both address and size should be page aligned and the supported attributes can be +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES. + +The "flags" field may be used for future extensions and should be set to 0s. + 5. The kvm_run structure @@ -8494,6 +8544,16 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 64-bit bitmap (each bit describing a block size). The default value is 0, to disable the eager page splitting. +8.41 KVM_CAP_MEMORY_ATTRIBUTES +-- + +:Capability: KVM_CAP_MEMORY_ATTRIBUTES +:Architectures: x86 +:Type: vm + +This capability indicates KVM supports per-page memory attributes and ioctls +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available. + 9. Known KVM API problems = dif