Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking
On 5/5/23 12:31, Sean Christopherson wrote: > On Fri, May 05, 2023, Micka�l Sala�n wrote: >> >> On 05/05/2023 18:28, Sean Christopherson wrote: >>> I have no doubt that we'll need to solve performance and scaling issues >>> with the >>> memory attributes implementation, e.g. to utilize xarray multi-range support >>> instead of storing information on a per-4KiB-page basis, but AFAICT, the >>> core >>> idea is sound. And a very big positive from a maintenance perspective is >>> that >>> any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should >>> also >>> benefit the other use case. >>> >>> [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com >>> [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com >>> [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com >> >> I agree, I used this mechanism because it was easier at first to rely on a >> previous work, but while I was working on the MBEC support, I realized that >> it's not the optimal way to do it. >> >> I was thinking about using a new special EPT bit similar to >> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you >> think? > > On x86, SPTEs are even more ephemeral than memslots. E.g. for historical > reasons, > KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the > guest > is moving around BARs, using option ROMs, etc. > > ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to > otrack attributes, but that works only because pKVM is more privileged than > the > host kernel, and the shared vs. private memory attribute that pKVM cares about > is very, very restricted in how it can be used and changed. > > I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, > and > it ended up being a constant battle with the kernel, e.g. page migration, and > with > KVM itself, e.g. the above memslot mess. Sorry for the delay in responding to this. I wanted to study the KVM code and fully understand your comment before responding. Yes, I quite agree with you. I will make an attempt to address this in the next version. I am working on it right now. Thanks. Madhavan
Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking
On Fri, May 05, 2023, Micka�l Sala�n wrote: > > On 05/05/2023 18:28, Sean Christopherson wrote: > > I have no doubt that we'll need to solve performance and scaling issues > > with the > > memory attributes implementation, e.g. to utilize xarray multi-range support > > instead of storing information on a per-4KiB-page basis, but AFAICT, the > > core > > idea is sound. And a very big positive from a maintenance perspective is > > that > > any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should > > also > > benefit the other use case. > > > > [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com > > [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com > > [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com > > I agree, I used this mechanism because it was easier at first to rely on a > previous work, but while I was working on the MBEC support, I realized that > it's not the optimal way to do it. > > I was thinking about using a new special EPT bit similar to > EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you > think? On x86, SPTEs are even more ephemeral than memslots. E.g. for historical reasons, KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the guest is moving around BARs, using option ROMs, etc. ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to otrack attributes, but that works only because pKVM is more privileged than the host kernel, and the shared vs. private memory attribute that pKVM cares about is very, very restricted in how it can be used and changed. I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, and it ended up being a constant battle with the kernel, e.g. page migration, and with KVM itself, e.g. the above memslot mess.
Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking
On 05/05/2023 18:28, Sean Christopherson wrote: On Fri, May 05, 2023, Micka�l Sala�n wrote: diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h index eb186bc57f6a..a7fb4ff888e6 100644 --- a/arch/x86/include/asm/kvm_page_track.h +++ b/arch/x86/include/asm/kvm_page_track.h @@ -3,6 +3,7 @@ #define _ASM_X86_KVM_PAGE_TRACK_H enum kvm_page_track_mode { + KVM_PAGE_TRACK_PREWRITE, Heh, just when I decide to finally kill off support for multiple modes[1] :-) My assessment from that changelog still holds true for this case: Drop "support" for multiple page-track modes, as there is no evidence that array-based and refcounted metadata is the optimal solution for other modes, nor is there any evidence that other use cases, e.g. for access-tracking, will be a good fit for the page-track machinery in general. E.g. one potential use case of access-tracking would be to prevent guest access to poisoned memory (from the guest's perspective). In that case, the number of poisoned pages is likely to be a very small percentage of the guest memory, and there is no need to reference count the number of access-tracking users, i.e. expanding gfn_track[] for a new mode would be grossly inefficient. And for poisoned memory, host userspace would also likely want to trap accesses, e.g. to inject #MC into the guest, and that isn't currently supported by the page-track framework. A better alternative for that poisoned page use case is likely a variation of the proposed per-gfn attributes overlay (linked), which would allow efficiently tracking the sparse set of poisoned pages, and by default would exit to userspace on access. Of particular relevance: - Using the page-track machinery is inefficient because the guest is likely going to write-protect a minority of its memory. And this select KVM_EXTERNAL_WRITE_TRACKING if KVM is particularly nasty because simply enabling HEKI in the Kconfig will cause KVM to allocate rmaps and gfn tracking. - There's no need to reference count the protection, i.e. 15 of the 16 bits of gfn_track are dead weight. - As proposed, adding a second "mode" would double the cost of gfn tracking. - Tying the protections to the memslots will create an impossible-to-maintain ABI because the protections will be lost if the owning memslot is deleted and recreated. - The page-track framework provides incomplete protection and will lead to an ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by adding calls to kvm_page_track_prewrite(), but misses things like kvm_vcpu_map(). - The scaling and maintenance issues will only get worse if/when someone tries to support dropping read and/or execute permissions, e.g. for execute-only. - The code is x86-only, and is likely to stay that way for the foreseeable future. The proposed alternative is to piggyback the memory attributes implementation[2] that is being added (if all goes according to plan) for confidential VMs. This use case (dropping permissions) came up not too long ago[3], which is why I have a ready-made answer). I have no doubt that we'll need to solve performance and scaling issues with the memory attributes implementation, e.g. to utilize xarray multi-range support instead of storing information on a per-4KiB-page basis, but AFAICT, the core idea is sound. And a very big positive from a maintenance perspective is that any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also benefit the other use case. [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com I agree, I used this mechanism because it was easier at first to rely on a previous work, but while I was working on the MBEC support, I realized that it's not the optimal way to do it. I was thinking about using a new special EPT bit similar to EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you think?
Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking
On Fri, May 05, 2023, Micka�l Sala�n wrote: > diff --git a/arch/x86/include/asm/kvm_page_track.h > b/arch/x86/include/asm/kvm_page_track.h > index eb186bc57f6a..a7fb4ff888e6 100644 > --- a/arch/x86/include/asm/kvm_page_track.h > +++ b/arch/x86/include/asm/kvm_page_track.h > @@ -3,6 +3,7 @@ > #define _ASM_X86_KVM_PAGE_TRACK_H > > enum kvm_page_track_mode { > + KVM_PAGE_TRACK_PREWRITE, Heh, just when I decide to finally kill off support for multiple modes[1] :-) My assessment from that changelog still holds true for this case: Drop "support" for multiple page-track modes, as there is no evidence that array-based and refcounted metadata is the optimal solution for other modes, nor is there any evidence that other use cases, e.g. for access-tracking, will be a good fit for the page-track machinery in general. E.g. one potential use case of access-tracking would be to prevent guest access to poisoned memory (from the guest's perspective). In that case, the number of poisoned pages is likely to be a very small percentage of the guest memory, and there is no need to reference count the number of access-tracking users, i.e. expanding gfn_track[] for a new mode would be grossly inefficient. And for poisoned memory, host userspace would also likely want to trap accesses, e.g. to inject #MC into the guest, and that isn't currently supported by the page-track framework. A better alternative for that poisoned page use case is likely a variation of the proposed per-gfn attributes overlay (linked), which would allow efficiently tracking the sparse set of poisoned pages, and by default would exit to userspace on access. Of particular relevance: - Using the page-track machinery is inefficient because the guest is likely going to write-protect a minority of its memory. And this select KVM_EXTERNAL_WRITE_TRACKING if KVM is particularly nasty because simply enabling HEKI in the Kconfig will cause KVM to allocate rmaps and gfn tracking. - There's no need to reference count the protection, i.e. 15 of the 16 bits of gfn_track are dead weight. - As proposed, adding a second "mode" would double the cost of gfn tracking. - Tying the protections to the memslots will create an impossible-to-maintain ABI because the protections will be lost if the owning memslot is deleted and recreated. - The page-track framework provides incomplete protection and will lead to an ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by adding calls to kvm_page_track_prewrite(), but misses things like kvm_vcpu_map(). - The scaling and maintenance issues will only get worse if/when someone tries to support dropping read and/or execute permissions, e.g. for execute-only. - The code is x86-only, and is likely to stay that way for the foreseeable future. The proposed alternative is to piggyback the memory attributes implementation[2] that is being added (if all goes according to plan) for confidential VMs. This use case (dropping permissions) came up not too long ago[3], which is why I have a ready-made answer). I have no doubt that we'll need to solve performance and scaling issues with the memory attributes implementation, e.g. to utilize xarray multi-range support instead of storing information on a per-4KiB-page basis, but AFAICT, the core idea is sound. And a very big positive from a maintenance perspective is that any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also benefit the other use case. [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com
[PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking
Add a new page tracking mode to deny a page update and throw a page fault to the guest. This is useful for KVM to be able to make some pages non-writable (not read-only because it doesn't imply execution restrictions), see the next Heki commits. This kind of synthetic kernel page fault needs to be handled by the guest, which is not currently the case, making it try again and again. This will be part of a follow-up patch series. Update emulator_read_write_onepage() to handle X86EMUL_CONTINUE and X86EMUL_PROPAGATE_FAULT. Update page_fault_handle_page_track() to call kvm_slot_page_track_is_active() whenever this is required for KVM_PAGE_TRACK_PREWRITE and KVM_PAGE_TRACK_WRITE, even if one tracker already returned true. Invert the return code semantic for read_emulate() and write_emulate(): - from 1=Ok 0=Error - to X86EMUL_* return codes (e.g. X86EMUL_CONTINUE == 0) Imported the prewrite page tracking support part originally written by Mihai Donțu, Marian Rotariu, and Ștefan Șicleru: https://lore.kernel.org/r/20211006173113.26445-27-ala...@bitdefender.com https://lore.kernel.org/r/20211006173113.26445-28-ala...@bitdefender.com Removed the GVA changes for page tracking, removed the X86EMUL_RETRY_INSTR case, and some emulation part for now. Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Kees Cook Cc: Madhavan T. Venkataraman Cc: Marian Rotariu Cc: Mihai Donțu Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Thomas Gleixner Cc: Vitaly Kuznetsov Cc: Wanpeng Li Cc: Ștefan Șicleru Signed-off-by: Mickaël Salaün Link: https://lore.kernel.org/r/20230505152046.6575-3-...@digikod.net --- arch/x86/include/asm/kvm_page_track.h | 12 + arch/x86/kvm/mmu/mmu.c| 64 ++- arch/x86/kvm/mmu/page_track.c | 33 +- arch/x86/kvm/mmu/spte.c | 6 +++ arch/x86/kvm/x86.c| 27 +++ 5 files changed, 122 insertions(+), 20 deletions(-) diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h index eb186bc57f6a..a7fb4ff888e6 100644 --- a/arch/x86/include/asm/kvm_page_track.h +++ b/arch/x86/include/asm/kvm_page_track.h @@ -3,6 +3,7 @@ #define _ASM_X86_KVM_PAGE_TRACK_H enum kvm_page_track_mode { + KVM_PAGE_TRACK_PREWRITE, KVM_PAGE_TRACK_WRITE, KVM_PAGE_TRACK_MAX, }; @@ -22,6 +23,16 @@ struct kvm_page_track_notifier_head { struct kvm_page_track_notifier_node { struct hlist_node node; + /* +* It is called when guest is writing the write-tracked page +* and the write emulation didn't happened yet. +* +* @vcpu: the vcpu where the write access happened +* @gpa: the physical address written by guest +* @node: this nodet +*/ + bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa, + struct kvm_page_track_notifier_node *node); /* * It is called when guest is writing the write-tracked page * and write emulation is finished at that time. @@ -73,6 +84,7 @@ kvm_page_track_register_notifier(struct kvm *kvm, void kvm_page_track_unregister_notifier(struct kvm *kvm, struct kvm_page_track_notifier_node *n); +bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa); void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes); void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 835426254e76..e5d1e241ff0f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -793,9 +793,13 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) slot = __gfn_to_memslot(slots, gfn); /* the non-leaf shadow pages are keeping readonly. */ - if (sp->role.level > PG_LEVEL_4K) - return kvm_slot_page_track_add_page(kvm, slot, gfn, - KVM_PAGE_TRACK_WRITE); + if (sp->role.level > PG_LEVEL_4K) { + kvm_slot_page_track_add_page(kvm, slot, gfn, +KVM_PAGE_TRACK_PREWRITE); + kvm_slot_page_track_add_page(kvm, slot, gfn, +KVM_PAGE_TRACK_WRITE); + return; + } kvm_mmu_gfn_disallow_lpage(slot, gfn); @@ -840,9 +844,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) gfn = sp->gfn; slots = kvm_memslots_for_spte_role(kvm, sp->role); slot = __gfn_to_memslot(slots, gfn); - if (sp->role.level > PG_LEVEL_4K) - return kvm_slot_page_track_remove_page(kvm, slot, gfn, - KVM_PAGE_TRACK_WRITE); + if (sp->role.level > PG_LEVEL_4K) { + kvm_slot_page_track_remove_page(kvm,