from:"Maxim Levitsky"

Re: Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally)

2022-10-03 Thread Maxim Levitsky

On Thu, 2022-09-29 at 22:38 +, Sean Christopherson wrote:
> On Mon, Aug 08, 2022, Maxim Levitsky wrote:
> > Hi Sean, Paolo, and everyone else who wants to review my nested AVIC work.
> 
> Before we dive deep into design details, I think we should first decide 
> whether
> or not nested AVIC is worth pursing/supporting.
> 
>   - Rome has a ucode/silicon bug with no known workaround and no anticipated 
> fix[*];
> AMD's recommended "workaround" is to disable AVIC.
>   - AVIC is not available in Milan, which may or may not be related to the
> aforementioned bug.
>   - AVIC is making a comeback on Zen4, but Zen4 comes with x2AVIC.
>   - x2APIC is likely going to become ubiquitous, e.g. Intel is effectively
> requiring x2APIC to fudge around xAPIC bugs.
>   - It's actually quite realistic to effectively force the guest to use 
> x2APIC,
> at least if it's a Linux guest.  E.g. turn x2APIC on in BIOS, which is 
> often
> (always?) controlled by the host, and Linux will use x2APIC.
> 
> In other words, given that AVIC is well on its way to becoming a "legacy" 
> feature,
> IMO there needs to be a fairly strong use case to justify taking on this much 
> code
> and complexity.  ~1500 lines of code to support a feature that has 
> historically
> been buggy _without_ nested support is going to require a non-trivial amount 
> of
> effort to review, stabilize, and maintain.
> 
> [*] 1235 "Guest With AVIC (Advanced Virtual Interrupt Controller) Enabled May 
> Fail
> to Process IPI (Inter-Processor Interrupt) Until Guest Is Re-Scheduled" in
> https://www.amd.com/system/files/TechDocs/56323-PUB_1.00.pdf
> 

I am afraid that you mixed things up:

You mistake is that x2avic is just a minor addition to AVIC. It is still for
all practical purposes the same feature.

1. The AVIC is indeed kind of broken on Zen2 (but AFAIK for all practical 
purposes,
   including nested it works fine, the errata only shows up in a unit test 
and/or
   under very specific workloads (most of the time a delayed wakeup doesn't 
cause a hang).
   Yet, I agree that for production Zen2 should not have AVIC enabled.

2. Zen3 does indeed have AVIC soft disabled in CPUID. AFAIK it works just fine,
   but I understand that customers won't use it against AMD's guidance.

3. On Zen4, AVIC is fully enabled and also extended to support x2apic mode.
   The fact that AVIC was extended to support X2apic mode also shows that AMD
   is committed to supporting it.

My nested AVIC code technically doesn't expose x2avic to the guest, but it
is pretty much trivial to add (I am only waiting to get my hands on Zen4 machine
to do it), and also even in its current form it would work just fine if the host
uses normal AVIC .

(or even doesn't use AVIC at all - the nested AVIC code works just fine
even if the host has its AVIC inhibited for some reason).

Adding nested x2avic support is literally about not passing through that MMIO 
address,
Enabling the x2avic bit in int_ctl, and opening up the access to x2apic msrs.
Plus I need to do some minor changes in unaccelerated IPI handler, dealing
With read-only logical ID and such.

Physid tables, apic backing pages, doorbell emulation, 
everything is pretty much unchanged.

So AVIC is nothing but a legacy feature, and my nested AVIC code will support
both nested AVIC and nested X2AVIC.

Best regards,
Maxim Levitsky

Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally)

2022-08-08 Thread Maxim Levitsky

On Mon, 2022-08-01 at 17:20 +, Sean Christopherson wrote:
> On Thu, Jul 28, 2022, Maxim Levitsky wrote:
> > On Mon, 2022-07-25 at 16:08 +, Sean Christopherson wrote:
> > > On Wed, Jul 20, 2022, Maxim Levitsky wrote:
> > > And on that topic, do you have performance numbers to justify using a 
> > > single
> > > shared node?  E.g. if every table instance has its own notifier, then no 
> > > additional
> > > refcounting is needed. 
> > 
> > The thing is that KVM goes over the list of notifiers and calls them for
> > every write from the emulator in fact even just for mmio write, and when you
> > enable write tracking on a page, you just write protect the page and add a
> > mark in the page track array, which is roughly 
> > 
> > 'don't install spte, don't install mmio spte, but just emulate the page 
> > fault if it hits this page'
> > 
> > So adding more than a bare minimum to this list, seems just a bit wrong.
> 
> Hmm, I see what you're saying.  To some extent, having a minimal page tracker
> implementation is just that, an implementation detail.  But for better or 
> worse,
> the existing API effectively pushes range checking to the callers.  I agree 
> that
> breaking from that pattern would be odd.
> 
> > >  It's not obvious that a shared node will provide better performance, e.g.
> > >  if there are only a handful of AVIC tables being shadowed, then a linear
> > >  walk of all nodes is likely fast enough, and doesn't bring the risk of a
> > >  write potentially being stalled due to having to acquire a VM-scoped
> > >  mutex.
> > 
> > The thing is that if I register multiple notifiers, they all will be called 
> > anyway,
> > but yes I can use container_of, and discover which table the notifier 
> > belongs to,
> > instead of having a hash table where I lookup the GFN of the fault.
> > 
> > The above means practically that all the shadow physid tables will be in a 
> > linear
> > list of notifiers, so I could indeed avoid per vm mutex on the write 
> > tracking,
> > however for simplicity I probably will still need it because I do modify 
> > the page,
> > and having per physid table mutex complicates things.
> > 
> > Currently in my code the locking is very simple and somewhat dumb, but the 
> > performance
> > is very good because the code isn't executed often, most of the time the 
> > AVIC hardware
> > works alone without any VM exits.
> 
> Yes, but because the code isn't executed often, pretty much any solution will
> provide good performance.
> 
> > Once the code is accepted upstream, it's one of the things that can be 
> > improved.
> > 
> > Note though that I still need a hash table and a mutex because on each VM 
> > entry,
> > the guest can use a different physid table, so I need to lookup it, and 
> > create it,
> > if not found, which would require read/write of the hash table and thus a 
> > mutex.
> 
> One of the points I'm trying to make is that a hash table isn't strictly 
> required.
> E.g. if I understand the update rules correctly, I believe tables can be 
> tracked
> via an RCU-protected list, with vCPUs taking a spinlock and doing 
> synchronize_rcu()
> when adding/removing a table.  That would avoid having to take any "real" 
> locks in
> the page track notifier.
> 
> The VM-scoped mutex worries me as it will be a bottleneck if L1 is running 
> multiple
> L2 VMs.  E.g. if L1 is frequently switching vmcs12 and thus avic_physical_id, 
> then
> nested VMRUN will effectively get serialized.  That is mitigated to some 
> extent by
> an RCU-protected list, as a sane L1 will use a single table for each L2, and 
> so a
> vCPU will need to add/remove a table if and only if it's the first/last vCPU 
> to
> start/stop running an L2 VM.

Hi Sean, Paolo, and everyone else who wants to review my nested AVIC work.
 
I would like to explain the design choices for locking, and life cycle of the 
shadow physid tables, and I hope
that this will make it easier for you to review my code and/or make some 
suggestions on how to improve it.
 
=
Explanation of the AVIC physid page (AVIC physical ID table)
=
 
This table gives a vCPU enough knowledge of its peers to send them IPIs without 
VM exit.
 
A vCPU doesn’t use this table to send IPIs to itself and or

Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally

2022-08-01 Thread Maxim Levitsky

On Thu, 2022-07-28 at 10:46 +0300, Maxim Levitsky wrote:
> On Mon, 2022-07-25 at 16:08 +, Sean Christopherson wrote:
> > On Wed, Jul 20, 2022, Maxim Levitsky wrote:
> > > On Sun, 2022-05-22 at 13:22 +0300, Maxim Levitsky wrote:
> > > > On Thu, 2022-05-19 at 16:37 +, Sean Christopherson wrote:
> > > > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > > > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > > Now for nested AVIC, this is what I would like to do:
> > >  
> > > - just like mmu, I prefer to register the write tracking notifier, when 
> > > the
> > >   VM is created.
> > > 
> > > - just like mmu, write tracking should only be enabled when nested AVIC is
> > >   actually used first time, so that write tracking is not always enabled 
> > > when
> > >   you just boot a VM with nested avic supported, since the VM might not 
> > > use
> > >   nested at all.
> > >  
> > > Thus I either need to use the __kvm_page_track_register_notifier too for 
> > > AVIC
> > > (and thus need to export it) or I need to have a boolean
> > > (nested_avic_was_used_once) and register the write tracking notifier only
> > > when false and do it not on VM creation but on first attempt to use nested
> > > AVIC.
> > >  
> > > Do you think this is worth it? I mean there is some value of registering 
> > > the
> > > notifier only when needed (this way it is not called for nothing) but it 
> > > does
> > > complicate things a bit.
> > 
> > Compared to everything else that you're doing in the nested AVIC code, 
> > refcounting
> > the shared kvm_page_track_notifier_node object is a trivial amount of 
> > complexity.
> Makes sense.
> 
> > And on that topic, do you have performance numbers to justify using a single
> > shared node?  E.g. if every table instance has its own notifier, then no 
> > additional
> > refcounting is needed. 
> 
> The thing is that KVM goes over the list of notifiers and calls them for 
> every write from the emulator
> in fact even just for mmio write, and when you enable write tracking on a 
> page,
> you just write protect the page and add a mark in the page track array, which 
> is roughly 
> 
> 'don't install spte, don't install mmio spte, but just emulate the page fault 
> if it hits this page'
> 
> So adding more than a bare minimum to this list, seems just a bit wrong.
> 
> 
> >  It's not obvious that a shared node will provide better
> > performance, e.g. if there are only a handful of AVIC tables being 
> > shadowed, then
> > a linear walk of all nodes is likely fast enough, and doesn't bring the 
> > risk of
> > a write potentially being stalled due to having to acquire a VM-scoped 
> > mutex.
> 
> The thing is that if I register multiple notifiers, they all will be called 
> anyway,
> but yes I can use container_of, and discover which table the notifier belongs 
> to,
> instead of having a hash table where I lookup the GFN of the fault.
> 
> The above means practically that all the shadow physid tables will be in a 
> linear
> list of notifiers, so I could indeed avoid per vm mutex on the write tracking,
> however for simplicity I probably will still need it because I do modify the 
> page,
> and having per physid table mutex complicates things.
> 
> Currently in my code the locking is very simple and somewhat dumb, but the 
> performance
> is very good because the code isn't executed often, most of the time the AVIC 
> hardware
> works alone without any VM exits.
> 
> Once the code is accepted upstream, it's one of the things that can be 
> improved.
> 
> 
> Note though that I still need a hash table and a mutex because on each VM 
> entry,
> the guest can use a different physid table, so I need to lookup it, and 
> create it,
> if not found, which would require read/write of the hash table and thus a 
> mutex.
> 
> 
> 
> > > I can also stash this boolean (like 'bool registered;') into the 'struct
> > > kvm_page_track_notifier_node',  and thus allow the
> > > kvm_page_track_register_notifier to be called more that once -  then I can
> > > also get rid of __kvm_page_track_register_notifier. 
> > 
> > No, allowing redundant registration without proper refcounting leads to 
> > pain,
> > e.g. X registers, Y registers, X unregisters, kaboom.
> > 
> 
> True, but then what about adding a refcount to 'struct 
> kvm_page_track_notifier_node'
> instead of a boolean, and allowing redundant registration? 
> Probably not worth it, in which case I am OK to add a refcount to my avic 
> code.
> 
> Or maybe just scrap the whole thing and just leave registration and 
> activation of the
> write tracking as two separate things? Honestly now that looks like the most 
> clean
> solution.


Kind ping on this. Do you still want me to enable write tracking on the 
notifier registeration,
or scrap the idea?


Best regards,
Maxim Levitsky
> 
> Best regards,
>   Maxim Levitsky

Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally

2022-07-28 Thread Maxim Levitsky

On Mon, 2022-07-25 at 16:08 +, Sean Christopherson wrote:
> On Wed, Jul 20, 2022, Maxim Levitsky wrote:
> > On Sun, 2022-05-22 at 13:22 +0300, Maxim Levitsky wrote:
> > > On Thu, 2022-05-19 at 16:37 +, Sean Christopherson wrote:
> > > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > Now for nested AVIC, this is what I would like to do:
> >  
> > - just like mmu, I prefer to register the write tracking notifier, when the
> >   VM is created.
> > 
> > - just like mmu, write tracking should only be enabled when nested AVIC is
> >   actually used first time, so that write tracking is not always enabled 
> > when
> >   you just boot a VM with nested avic supported, since the VM might not use
> >   nested at all.
> >  
> > Thus I either need to use the __kvm_page_track_register_notifier too for 
> > AVIC
> > (and thus need to export it) or I need to have a boolean
> > (nested_avic_was_used_once) and register the write tracking notifier only
> > when false and do it not on VM creation but on first attempt to use nested
> > AVIC.
> >  
> > Do you think this is worth it? I mean there is some value of registering the
> > notifier only when needed (this way it is not called for nothing) but it 
> > does
> > complicate things a bit.
> 
> Compared to everything else that you're doing in the nested AVIC code, 
> refcounting
> the shared kvm_page_track_notifier_node object is a trivial amount of 
> complexity.
Makes sense.

> 
> And on that topic, do you have performance numbers to justify using a single
> shared node?  E.g. if every table instance has its own notifier, then no 
> additional
> refcounting is needed. 

The thing is that KVM goes over the list of notifiers and calls them for every 
write from the emulator
in fact even just for mmio write, and when you enable write tracking on a page,
you just write protect the page and add a mark in the page track array, which 
is roughly 

'don't install spte, don't install mmio spte, but just emulate the page fault 
if it hits this page'

So adding more than a bare minimum to this list, seems just a bit wrong.

>  It's not obvious that a shared node will provide better
> performance, e.g. if there are only a handful of AVIC tables being shadowed, 
> then
> a linear walk of all nodes is likely fast enough, and doesn't bring the risk 
> of
> a write potentially being stalled due to having to acquire a VM-scoped mutex.

The thing is that if I register multiple notifiers, they all will be called 
anyway,
but yes I can use container_of, and discover which table the notifier belongs 
to,
instead of having a hash table where I lookup the GFN of the fault.

The above means practically that all the shadow physid tables will be in a 
linear
list of notifiers, so I could indeed avoid per vm mutex on the write tracking,
however for simplicity I probably will still need it because I do modify the 
page,
and having per physid table mutex complicates things.

Currently in my code the locking is very simple and somewhat dumb, but the 
performance
is very good because the code isn't executed often, most of the time the AVIC 
hardware
works alone without any VM exits.

Once the code is accepted upstream, it's one of the things that can be improved.

Note though that I still need a hash table and a mutex because on each VM entry,
the guest can use a different physid table, so I need to lookup it, and create 
it,
if not found, which would require read/write of the hash table and thus a mutex.

> 
> > I can also stash this boolean (like 'bool registered;') into the 'struct
> > kvm_page_track_notifier_node',  and thus allow the
> > kvm_page_track_register_notifier to be called more that once -  then I can
> > also get rid of __kvm_page_track_register_notifier. 
> 
> No, allowing redundant registration without proper refcounting leads to pain,
> e.g. X registers, Y registers, X unregisters, kaboom.
> 

True, but then what about adding a refcount to 'struct 
kvm_page_track_notifier_node'
instead of a boolean, and allowing redundant registration? 
Probably not worth it, in which case I am OK to add a refcount to my avic code.

Or maybe just scrap the whole thing and just leave registration and activation 
of the
write tracking as two separate things? Honestly now that looks like the most 
clean
solution.

Best regards,
Maxim Levitsky

Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally

2022-07-20 Thread Maxim Levitsky

On Sun, 2022-05-22 at 13:22 +0300, Maxim Levitsky wrote:
> On Thu, 2022-05-19 at 16:37 +, Sean Christopherson wrote:
> > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > > node->track_write = kvm_mmu_pte_write;
> > > node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> > > kvm_page_track_register_notifier(kvm, node);
> > 
> > Can you add a patch to move this call to kvm_page_track_register_notifier() 
> > into
> > mmu_enable_write_tracking(), and simultaneously add a WARN in the register 
> > path
> > that page tracking is enabled?
> > 
> > Oh, actually, a better idea. Add an inner 
> > __kvm_page_track_register_notifier()
> > that is not exported and thus used only by KVM, invoke 
> > mmu_enable_write_tracking()
> > from the exported kvm_page_track_register_notifier(), and then do the above.
> > That will require modifying KVMGT and KVM in a single patch, but that's ok.
> > 
> > That will avoid any possibility of an external user failing to enabling 
> > tracking
> > before registering its notifier, and also avoids bikeshedding over what to 
> > do with
> > the one-line wrapper to enable tracking.
> > 
> 
> This is a good idea as well, especially looking at kvmgt and seeing that
> it registers the page track notifier, when the vGPU is opened.
> 
> I'll do this in the next series.
> 
> Thanks for the review!

After putting some thought into this, I am not 100% sure anymore I want to do 
it this way.
 
Let me explain the current state of things:

For mmu: 
- write tracking notifier is registered on VM initialization (that is pretty 
much always),
and if it is called because write tracking was enabled due to some other reason
(currently only KVMGT), it checks the number of shadow mmu pages and if zero, 
bails out.
 
- write tracking enabled when shadow root is allocated.
 
This can be kept as is by using the __kvm_page_track_register_notifier as you 
suggested.
 
For KVMGT:
- both write tracking and notifier are enabled when an vgpu mdev device is 
first opened.
That 'works' only because KVMGT doesn't allow to assign more that one mdev to 
same VM,
thus a per VM notifier and the write tracking for that VM are enabled at the 
same time
 
 
Now for nested AVIC, this is what I would like to do:
 
- just like mmu, I prefer to register the write tracking notifier, when the VM 
is created.
- just like mmu, write tracking should only be enabled when nested AVIC is 
actually used
  first time, so that write tracking is not always enabled when you just boot a 
VM with nested avic supported,
  since the VM might not use nested at all.
 
Thus I either need to use the __kvm_page_track_register_notifier too for AVIC 
(and thus need to export it)
or I need to have a boolean (nested_avic_was_used_once) and register the write 
tracking
notifier only when false and do it not on VM creation but on first attempt to 
use nested AVIC.
 
Do you think this is worth it? I mean there is some value of registering the 
notifier only when needed
(this way it is not called for nothing) but it does complicate things a bit.
 
I can also stash this boolean (like 'bool registered;') into the 'struct 
kvm_page_track_notifier_node', 
and thus allow the kvm_page_track_register_notifier to be called more that once 
- 
then I can also get rid of __kvm_page_track_register_notifier. 

What do you think about this?
 
Best regards,
Maxim Levitsky


> 
> Best regards,
> Maxim Levitsky

Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.

2022-06-23 Thread Maxim Levitsky

On Thu, 2022-05-19 at 16:06 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > Neither of these settings should be changed by the guest and it is
> > a burden to support it in the acceleration code, so just inhibit
> > it instead.
> > 
> > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/include/asm/kvm_host.h |  3 +++
> >  arch/x86/kvm/lapic.c| 25 ++---
> >  arch/x86/kvm/lapic.h|  8 
> >  3 files changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 63eae00625bda..636df87542555 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
> > APICV_INHIBIT_REASON_ABSENT,
> > /* AVIC is disabled because SEV doesn't support it */
> > APICV_INHIBIT_REASON_SEV,
> > +   /* APIC ID and/or APIC base was changed by the guest */
> 
> I don't see any reason to inhibit APICv if the APIC base is changed.  KVM has
> never supported that, and disabling APICv won't "fix" anything.
> 
> Ignoring that is a minor simplification, but also allows for a more intuitive
> name, e.g.
> 
>   APICV_INHIBIT_REASON_APIC_ID_MODIFIED,
> 
> The inhibit also needs to be added avic_check_apicv_inhibit_reasons() and
> vmx_check_apicv_inhibit_reasons().
> 
> > +   APICV_INHIBIT_REASON_RO_SETTINGS,
> >  };
> >  
> >  struct kvm_arch {
> > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > hpa_t   hv_root_tdp;
> > spinlock_t hv_root_tdp_lock;
> >  #endif
> > +   bool apic_id_changed;
> >  };
> >  
> >  struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 66b0eb0bda94e..8996675b3ef4c 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct 
> > kvm_lapic *apic, u32 lvt0_val)
> > }
> >  }
> >  
> > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
> 
> The "check" part is misleading/confusing.  "check" helpers usually query and 
> return
> state.  I assume you avoided "changed" because the ID may or may not actually 
> be
> changing.  Maybe kvm_apic_id_updated()?  Ah, better idea.  What about
> kvm_lapic_xapic_id_updated()?  See below for reasoning.
> 
> > +{
> > +   if (kvm_apic_has_initial_apic_id(apic))
> 
> Rather than add a single-use helper, invoke the helper from 
> kvm_apic_state_fixup()
> in the !x2APIC path, then this can KVM_BUG_ON() x2APIC to help document that 
> KVM
> should never allow the ID to change for x2APIC.
> 
> > +   return;
> > +
> > +   pr_warn_once("APIC ID change is unsupported by KVM");
> 
> It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to 
> disable
> APICv.
> 
> > +   kvm_set_apicv_inhibit(apic->vcpu->kvm,
> > +   APICV_INHIBIT_REASON_RO_SETTINGS);
> > +
> > +   apic->vcpu->kvm->arch.apic_id_changed = true;
> > +}
> > +
> >  static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  {
> > int ret = 0;
> > @@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic 
> > *apic, u32 reg, u32 val)
> >  
> > switch (reg) {
> > case APIC_ID:   /* Local APIC ID */
> > -   if (!apic_x2apic_mode(apic))
> > +   if (!apic_x2apic_mode(apic)) {
> > +
> 
> Spurious newline.
> 
> > kvm_apic_set_xapic_id(apic, val >> 24);
> > -   else
> > +   kvm_lapic_check_initial_apic_id(apic);
> > +   } else
> 
> Needs curly braces for both paths.
> 
> > ret = 1;
> > break;
> >  
> 
> E.g.
> 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/lapic.c| 21 +++--
>  arch/x86/kvm/svm/avic.c |  3 ++-
>  arch/x86/kvm/vmx/vmx.c  |  3 ++-
>  4 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d895d25c5b2f..d888fa1bae77 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h

Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.

2022-05-22 Thread Maxim Levitsky

On Sun, 2022-05-22 at 07:47 -0700, Jim Mattson wrote:
> On Sun, May 22, 2022 at 2:03 AM Maxim Levitsky  wrote:
> > On Thu, 2022-05-19 at 16:06 +, Sean Christopherson wrote:
> > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > Neither of these settings should be changed by the guest and it is
> > > > a burden to support it in the acceleration code, so just inhibit
> > > > it instead.
> > > > 
> > > > Also add a boolean 'apic_id_changed' to indicate if apic id ever 
> > > > changed.
> > > > 
> > > > Signed-off-by: Maxim Levitsky 
> > > > ---
> > > > +   return;
> > > > +
> > > > +   pr_warn_once("APIC ID change is unsupported by KVM");
> > > 
> > > It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need 
> > > to disable
> > > APICv.
> > 
> > Here, as I said, it would be nice to see that warning if someone complains.
> > Fact is that AVIC code was totally broken in this regard, and there are 
> > probably more,
> > so it would be nice to see if anybody complains.
> > 
> > If you insist, I'll remove this warning.
> 
> This may be fine for a hobbyist, but it's a terrible API in an
> enterprise environment. To be honest, I have no way of propagating
> this warning from /var/log/messages on a particular host to a
> potentially impacted customer. Worse, if they're not the first
> impacted customer since the last host reboot, there's no warning to
> propagate. I suppose I could just tell every later customer, "Your VM
> was scheduled to run on a host that previously reported, 'APIC ID
> change is unsupported by KVM.' If you notice any unusual behavior,
> that might be the reason for it," but that isn't going to inspire
> confidence. I could schedule a drain and reboot of the host, but that
> defeats the whole point of the "_once" suffix.

Mostly agree, and I read alrady few discussions about exactly this,
those warnings are mostly useless, but they are used in the
cases where we don't have the courage to just exit with KVM_EXIT_INTERNAL_ERROR.

I do not thing though that the warning is completely useless, 
as we often have the kernel log of the target machine when things go wrong, 
so *we* can notice it.
In other words a kernel warning is mostly useless but better that nothing.

About KVM_EXIT_WARNING, this is IMHO a very good idea, probably combined
with some form of taint flag, which could be read by qemu and then shown
over hmp/qmp interfaces.

Best regards,
Maxim levitsky


> 
> I know that there's a long history of doing this in KVM, but I'd like
> to ask that we:
> a) stop piling on
> b) start fixing the existing uses
> 
> If KVM cannot emulate a perfectly valid operation, an exit to
> userspace with KVM_EXIT_INTERNAL_ERROR is warranted. Perhaps for
> operations that we suspect KVM might get wrong, we should have a new
> userspace exit: KVM_EXIT_WARNING?
> 
> I'm not saying that you should remove the warning. I'm just asking
> that it be augmented with a direct signal to userspace that KVM may no
> longer be reliable.
>

Re: [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper

2022-05-22 Thread Maxim Levitsky

On Thu, 2022-05-19 at 16:43 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > This is a tiny refactoring, and can be useful to check
> > if a GPA/GFN is within a memslot a bit more cleanly.
> 
> This doesn't explain the actual motivation, which is to use the new helper 
> from
> arch code.
I'll add this in the next version
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  include/linux/kvm_host.h | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 252ee4a61b58b..12e261559070b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1580,6 +1580,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
> >  void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
> >  bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
> >  
> > +
> > +static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +   return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
> > +}
> > +
> 
> Spurious newline.
> 
> > +
> >  /*
> >   * Returns a pointer to the memslot if it contains gfn.
> >   * Otherwise returns NULL.
> > @@ -1590,12 +1597,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t 
> > gfn)
> > if (!slot)
> > return NULL;
> >  
> > -   if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
> > +   if (gfn_in_memslot(slot, gfn))
> > return slot;
> > else
> > return NULL;
> 
> At this point, maybe:

No objections.

Thanks for the review.

Best regards,
Maxim Levitsky

> 
>   if (!slot || !gfn_in_memslot(slot, gfn))
>   return NULL;
> 
>   return slot;
> 
> >  }
> >  
> > +
> >  /*
> >   * Returns a pointer to the memslot that contains gfn. Otherwise returns 
> > NULL.
> >   *
> > -- 
> > 2.26.3
> >

Re: [RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page

2022-05-22 Thread Maxim Levitsky

On Thu, 2022-05-19 at 16:55 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > This will be used on SVM to reload shadow page of the AVIC physid table
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d2f73ce87a1e3..ad744ab99734c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9949,12 +9949,12 @@ void kvm_arch_mmu_notifier_invalidate_range(struct 
> > kvm *kvm,
> > kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> >  }
> >  
> > -static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> > +static void kvm_vcpu_reload_apic_pages(struct kvm_vcpu *vcpu)
> >  {
> > if (!lapic_in_kernel(vcpu))
> > return;
> >  
> > -   static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
> > +   static_call_cond(kvm_x86_reload_apic_pages)(vcpu);
> >  }
> >  
> >  void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
> > @@ -10071,7 +10071,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > if (kvm_check_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu))
> > vcpu_load_eoi_exitmap(vcpu);
> > if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
> > -   kvm_vcpu_reload_apic_access_page(vcpu);
> > +   kvm_vcpu_reload_apic_pages(vcpu);
> 
> My vote is to add a new request and new kvm_x86_ops hook instead of 
> piggybacking
> KVM_REQ_APIC_PAGE_RELOAD.  The usage in 
> kvm_arch_mmu_notifier_invalidate_range()
> very subtlies relies on the memslot and vma being allocated/controlled by KVM.
> 
> The use in avic_physid_shadow_table_flush_memslot() is too similar in that it
> also deals with memslot changes, but at the same time is _very_ different in 
> that
> it's dealing with user controlled memslots.
> 

No objections, will do.

Best regards,
Maxim Levitsky

Re: [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper

2022-05-22 Thread Maxim Levitsky

On Thu, 2022-05-19 at 16:43 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > This is a tiny refactoring, and can be useful to check
> > if a GPA/GFN is within a memslot a bit more cleanly.
> 
> This doesn't explain the actual motivation, which is to use the new helper 
> from
> arch code.
I'll add this in the next version
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  include/linux/kvm_host.h | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 252ee4a61b58b..12e261559070b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1580,6 +1580,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
> >  void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
> >  bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
> >  
> > +
> > +static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +   return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
> > +}
> > +
> 
> Spurious newline.
> 
> > +
> >  /*
> >   * Returns a pointer to the memslot if it contains gfn.
> >   * Otherwise returns NULL.
> > @@ -1590,12 +1597,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t 
> > gfn)
> > if (!slot)
> > return NULL;
> >  
> > -   if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
> > +   if (gfn_in_memslot(slot, gfn))
> > return slot;
> > else
> > return NULL;
> 
> At this point, maybe:

No objections.

Thanks for the review.

Best regards,
Maxim Levitsky

> 
>   if (!slot || !gfn_in_memslot(slot, gfn))
>   return NULL;
> 
>   return slot;
> 
> >  }
> >  
> > +
> >  /*
> >   * Returns a pointer to the memslot that contains gfn. Otherwise returns 
> > NULL.
> >   *
> > -- 
> > 2.26.3
> >

Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally

2022-05-22 Thread Maxim Levitsky

On Thu, 2022-05-19 at 16:37 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > node->track_write = kvm_mmu_pte_write;
> > node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> > kvm_page_track_register_notifier(kvm, node);
> 
> Can you add a patch to move this call to kvm_page_track_register_notifier() 
> into
> mmu_enable_write_tracking(), and simultaneously add a WARN in the register 
> path
> that page tracking is enabled?
> 
> Oh, actually, a better idea. Add an inner __kvm_page_track_register_notifier()
> that is not exported and thus used only by KVM, invoke 
> mmu_enable_write_tracking()
> from the exported kvm_page_track_register_notifier(), and then do the above.
> That will require modifying KVMGT and KVM in a single patch, but that's ok.
> 
> That will avoid any possibility of an external user failing to enabling 
> tracking
> before registering its notifier, and also avoids bikeshedding over what to do 
> with
> the one-line wrapper to enable tracking.
> 

This is a good idea as well, especially looking at kvmgt and seeing that
it registers the page track notifier, when the vGPU is opened.

I'll do this in the next series.

Thanks for the review!

Best regards,
Maxim Levitsky

Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally

2022-05-22 Thread Maxim Levitsky

On Thu, 2022-05-19 at 16:27 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > This will be used to enable write tracking from nested AVIC code
> > and can also be used to enable write tracking in GVT-g module
> > when it actually uses it as opposed to always enabling it,
> > when the module is compiled in the kernel.
> 
> Wrap at ~75.
Well, the checkpatch.pl didn't complain, so I didn't notice.

> 
> > No functional change intended.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/include/asm/kvm_host.h   |  2 +-
> >  arch/x86/include/asm/kvm_page_track.h |  1 +
> >  arch/x86/kvm/mmu.h|  8 +---
> >  arch/x86/kvm/mmu/mmu.c| 17 ++---
> >  arch/x86/kvm/mmu/page_track.c | 10 --
> >  5 files changed, 25 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 636df87542555..fc7df778a3d71 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1254,7 +1254,7 @@ struct kvm_arch {
> >  * is used as one input when determining whether certain memslot
> >  * related allocations are necessary.
> >  */
> 
> The above comment needs to be rewritten.
Good catch, thank a lot!!

> 
> > -   bool shadow_root_allocated;
> > +   bool mmu_page_tracking_enabled;
> >  #if IS_ENABLED(CONFIG_HYPERV)
> > hpa_t   hv_root_tdp;
> > diff --git a/arch/x86/include/asm/kvm_page_track.h 
> > b/arch/x86/include/asm/kvm_page_track.h
> > index eb186bc57f6a9..955a5ae07b10e 100644
> > --- a/arch/x86/include/asm/kvm_page_track.h
> > +++ b/arch/x86/include/asm/kvm_page_track.h
> > @@ -50,6 +50,7 @@ int kvm_page_track_init(struct kvm *kvm);
> >  void kvm_page_track_cleanup(struct kvm *kvm);
> >  
> >  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
> > +int kvm_page_track_write_tracking_enable(struct kvm *kvm);
> >  int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
> >  
> >  void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 671cfeccf04e9..44d15551f7156 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -269,7 +269,7 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> >  int kvm_mmu_post_init_vm(struct kvm *kvm);
> >  void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
> >  
> > -static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
> > +static inline bool mmu_page_tracking_enabled(struct kvm *kvm)
> >  {
> > /*
> >  * Read shadow_root_allocated before related pointers. Hence, threads
> > @@ -277,9 +277,11 @@ static inline bool kvm_shadow_root_allocated(struct 
> > kvm *kvm)
> >  * see the pointers. Pairs with smp_store_release in
> >  * mmu_first_shadow_root_alloc.
> >  */
> 
> This comment also needs to be rewritten.
Also thanks a lot, next time I'll check comments better.

> 
> > -   return smp_load_acquire(&kvm->arch.shadow_root_allocated);
> > +   return smp_load_acquire(&kvm->arch.mmu_page_tracking_enabled);
> >  }
> 
> ...
> 
> > diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> > index 2e09d1b6249f3..8857d629036d7 100644
> > --- a/arch/x86/kvm/mmu/page_track.c
> > +++ b/arch/x86/kvm/mmu/page_track.c
> > @@ -21,10 +21,16 @@
> >  
> >  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
> 
> This can be static, it's now used only by page_track.c.
I'll fix this.
> 
> >  {
> > -   return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
> > -  !tdp_enabled || kvm_shadow_root_allocated(kvm);
> > +   return mmu_page_tracking_enabled(kvm);
> >  }
> >  
> > +int kvm_page_track_write_tracking_enable(struct kvm *kvm)
> 
> This is too similar to the "enabled" version; 
> "kvm_page_track_enable_write_tracking()"
> would maintain namespacing and be less confusing.
Makes sense, thanks, will do!

> 
> Hmm, I'd probably vote to make this a "static inline" in kvm_page_track.h, and
> rename mmu_enable_write_tracking() to kvm_mmu_enable_write_tracking and 
> export.
> Not a strong preference, just feels silly to export a one-liner.

The sole reason I did it this way, because 'page_track.c' this way contains all 
the interfaces
that an external user of write tracking needs to use.

> 
> > +{
> > +   return mmu_enable_write_tracking(kvm);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_page_track_write_tracking_enable);
> > +
> > +
> >  void kvm_page_track_free_memslot(struct kvm_memory_slot *slot)
> >  {
> > int i;
> > -- 
> > 2.26.3
> > 

Best regards,
Maxim Levitsky

Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.

2022-05-22 Thread Maxim Levitsky

On Thu, 2022-05-19 at 16:06 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > Neither of these settings should be changed by the guest and it is
> > a burden to support it in the acceleration code, so just inhibit
> > it instead.
> > 
> > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/include/asm/kvm_host.h |  3 +++
> >  arch/x86/kvm/lapic.c| 25 ++---
> >  arch/x86/kvm/lapic.h|  8 
> >  3 files changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 63eae00625bda..636df87542555 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
> > APICV_INHIBIT_REASON_ABSENT,
> > /* AVIC is disabled because SEV doesn't support it */
> > APICV_INHIBIT_REASON_SEV,
> > +   /* APIC ID and/or APIC base was changed by the guest */
> 
> I don't see any reason to inhibit APICv if the APIC base is changed.  KVM has
> never supported that, and disabling APICv won't "fix" anything.

I kind of tacked the APIC base on the thing just to be a good citezen.

In theory currently if the guest changes the APIC base, neither APICv
nor AVIC will even notice, so the guest will still be able to access the
default APIC base and the new APIC base, which is kind of wrong.

Inhibiting APICv/AVIC in this case makes it better and it is very cheap to do.

If you still think that it shouln't be done, I'll remove it.


> 
> Ignoring that is a minor simplification, but also allows for a more intuitive
> name, e.g.
> 
>   APICV_INHIBIT_REASON_APIC_ID_MODIFIED,
> 
> The inhibit also needs to be added avic_check_apicv_inhibit_reasons() and
> vmx_check_apicv_inhibit_reasons().
> 
> > +   APICV_INHIBIT_REASON_RO_SETTINGS,

> >  };
> >  
> >  struct kvm_arch {
> > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > hpa_t   hv_root_tdp;
> > spinlock_t hv_root_tdp_lock;
> >  #endif
> > +   bool apic_id_changed;
> >  };
> >  
> >  struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 66b0eb0bda94e..8996675b3ef4c 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct 
> > kvm_lapic *apic, u32 lvt0_val)
> > }
> >  }
> >  
> > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
> 
> The "check" part is misleading/confusing.  "check" helpers usually query and 
> return
> state.  I assume you avoided "changed" because the ID may or may not actually 
> be
> changing.  Maybe kvm_apic_id_updated()?  Ah, better idea.  What about
> kvm_lapic_xapic_id_updated()?  See below for reasoning.

This is a very good idea!

> 
> > +{
> > +   if (kvm_apic_has_initial_apic_id(apic))
> 
> Rather than add a single-use helper, invoke the helper from 
> kvm_apic_state_fixup()
> in the !x2APIC path, then this can KVM_BUG_ON() x2APIC to help document that 
> KVM
> should never allow the ID to change for x2APIC.

yes, but we do allow non default x2apic id via userspace api - I wasn't able to 
convience
you to remove this :)

> 
> > +   return;
> > +
> > +   pr_warn_once("APIC ID change is unsupported by KVM");
> 
> It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to 
> disable
> APICv.

Here, as I said, it would be nice to see that warning if someone complains.
Fact is that AVIC code was totally broken in this regard, and there are 
probably more,
so it would be nice to see if anybody complains.

If you insist, I'll remove this warning.

> 
> > +   kvm_set_apicv_inhibit(apic->vcpu->kvm,
> > +   APICV_INHIBIT_REASON_RO_SETTINGS);
> > +
> > +   apic->vcpu->kvm->arch.apic_id_changed = true;
> > +}
> > +
> >  static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  {
> > int ret = 0;
> > @@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic 
> > *apic, u32 reg, u32 val)
> >  
> > switch (reg) {
> > case APIC_ID:   /* Local APIC ID */
> > -   if (!apic_x2apic_mode(apic))
> > +   if (!apic_x2apic_mode(apic)) {
> > +
> 
> Spurious newline.
Will fix.
> 
>

Re: [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID

2022-05-22 Thread Maxim Levitsky

On Thu, 2022-05-19 at 16:10 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > AVIC is now inhibited if the guest changes apic id, thus remove
> > that broken code.
> 
> Can you explicitly call out what's broken?  Just something short on the code 
> not
> handling the scenario where APIC ID is changed back to vcpu_id to help future
> archaeologists.  I forget if there are other bugs...
> 

Well, the avic_handle_apic_id_update is called each time the AVIC is 
uninhibited,
because while it is inhibited, the AVIC code doesn't track changes to APIC ID 
and such.

Also there are many ways it is broken for example

1. a CPU can't move its APIC ID to a free slot due to (!new) check

2. If APIC ID is moved to a used slot, then the CPU that used that overwritten
slot can't correctly move it, since its now not its slot, not to mention races.

BTW, if you see a value in it, I can fix this code instead - a lock + going 
over all the apic ids,
should be quite easy to implement. In case of two vCPUs using the same APIC ID,
I can write non present entry to the table, so none will be able to be 
addressed,
hoping that the situation is only temporary.

Same can be done for IPIv.

Best regards,
Maxim Levitsky

Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.

2022-05-18 Thread Maxim Levitsky

On Wed, 2022-05-18 at 15:39 +, Sean Christopherson wrote:
> On Wed, May 18, 2022, Maxim Levitsky wrote:
> > On Wed, 2022-05-18 at 16:28 +0800, Chao Gao wrote:
> > > > struct kvm_arch {
> > > > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > > > hpa_t   hv_root_tdp;
> > > > spinlock_t hv_root_tdp_lock;
> > > > #endif
> > > > +   bool apic_id_changed;
> > > 
> > > What's the value of this boolean? No one reads it.
> > 
> > I use it in later patches to kill the guest during nested VM entry 
> > if it attempts to use nested AVIC after any vCPU changed APIC ID.
> 
> Then the flag should be introduced in the later patch, because (a) it's dead 
> code
> if that patch is never merged and (b) it's impossible to review this patch for
> correctness without seeing the usage, e.g. setting apic_id_changed isn't 
> guarded
> with a lock and so the usage may or may not be susceptible to races.

I can't disagree with you on this, this was just somewhat a hack I wasn't sure
(and not yet 100% sure I will move forward with) so I cut this corner.

Thanks for the review!

Best regards,
Maxim Levitsky

> 
> > > > +   apic->vcpu->kvm->arch.apic_id_changed = true;
> > > > +}
> > > > +

Re: [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons

2022-05-18 Thread Maxim Levitsky

On Wed, 2022-05-18 at 15:56 +, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > These days there are too many AVIC/APICv inhibit
> > reasons, and it doesn't hurt to have some documentation
> > for them.
> 
> Please wrap at ~75 chars.
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/include/asm/kvm_host.h | 15 +++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index f164c6c1514a4..63eae00625bda 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1046,14 +1046,29 @@ struct kvm_x86_msr_filter {
> >  };
> >  
> >  enum kvm_apicv_inhibit {
> > +   /* APICv/AVIC is disabled by module param and/or not supported in 
> > hardware */
> 
> Rather than tag every one as APICv vs. AVIC, what about reorganizing the 
> enums so
> that the common vs. AVIC flags are bundled together?  And then the redundant 
> info
> in the comments about "XYZ is inhibited" can go away too, i.e. the individual
> comments can focus on explaining what triggers the inhibit (and for some, why 
> that
> action is incompatible with APIC virtualization).

Very good idea, will do!

Best regards,
Maxim Levitsky

> 
> E.g.
>   /***/
>   /* INHIBITs are relevant to both Intel's APICv and AMD's AVIC. */
>   /***/
> 
>   /* APIC/AVIC is unsupported and/or disabled via module param. */
>   APICV_INHIBIT_REASON_DISABLE,
> 
>   /* The local APIC is not in-kernel.  See KVM_CREATE_IRQCHIP. */
>   APICV_INHIBIT_REASON_ABSENT,
> 
>   /*
>* At least one IRQ vector is configured for HyperV's AutoEOI, which
>* requires manually injecting the IRQ to do EOI on behalf of the guest.
>*/
>   APICV_INHIBIT_REASON_HYPERV,
>   
> 
>   /**/
>   /* INHIBITs relevant only to AMD's AVIC. */
>   /**/
> 
> > APICV_INHIBIT_REASON_DISABLE,
> > +   /* APICv/AVIC is inhibited because AutoEOI feature is being used by a 
> > HyperV guest*/
> > APICV_INHIBIT_REASON_HYPERV,
> > +   /* AVIC is inhibited on a CPU because it runs a nested guest */
> > APICV_INHIBIT_REASON_NESTED,
> > +   /* AVIC is inhibited due to wait for an irq window (AVIC doesn't 
> > support this) */
> > APICV_INHIBIT_REASON_IRQWIN,
> > +   /*
> > +* AVIC is inhibited because i8254 're-inject' mode is used
> > +* which needs EOI intercept which AVIC doesn't support
> > +*/
> > APICV_INHIBIT_REASON_PIT_REINJ,
> > +   /* AVIC is inhibited because the guest has x2apic in its CPUID*/
> > APICV_INHIBIT_REASON_X2APIC,
> > +   /* AVIC/APICv is inhibited because KVM_GUESTDBG_BLOCKIRQ was enabled */
> > APICV_INHIBIT_REASON_BLOCKIRQ,
> > +   /*
> > +* AVIC/APICv is inhibited because the guest didn't yet
> 
> s/guest/userspace
> 
> > +* enable kernel/split irqchip
> > +*/
> > APICV_INHIBIT_REASON_ABSENT,
> > +   /* AVIC is disabled because SEV doesn't support it */
> > APICV_INHIBIT_REASON_SEV,
> >  };
> >  
> > -- 
> > 2.26.3
> >

Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.

2022-05-18 Thread Maxim Levitsky

On Wed, 2022-05-18 at 19:51 +0800, Chao Gao wrote:
> On Wed, May 18, 2022 at 12:50:27PM +0300, Maxim Levitsky wrote:
> > > > struct kvm_arch {
> > > > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > > > hpa_t   hv_root_tdp;
> > > > spinlock_t hv_root_tdp_lock;
> > > > #endif
> > > > +   bool apic_id_changed;
> > > 
> > > What's the value of this boolean? No one reads it.
> > 
> > I use it in later patches to kill the guest during nested VM entry 
> > if it attempts to use nested AVIC after any vCPU changed APIC ID.
> > 
> > I mentioned this boolean in the commit description.
> > 
> > This boolean avoids the need to go over all vCPUs and checking
> > if they still have the initial apic id.
> 
> Do you want to kill the guest if APIC base got changed? If yes,
> you can check if APICV_INHIBIT_REASON_RO_SETTINGS is set and save
> the boolean.

Yep, I thrown in the apic base just because I can. It doesn't matter to 
my nested AVIC logic at all, but since it is also something that guests
don't change, I also don't care if this will lead to inhibit and
killing the guest if it attempts to use nested AVIC.

That boolean should have the same value as the APICV_INHIBIT_REASON_RO_SETTINGS
inhibit, so yes I can instead check if the inhibit is active.

I don't know if that is cleaner that this boolean though, individual
inhibit value is currently not something that anybody uses in logic.

Best regards,
Maxim Levitsky


> 
> > In the future maybe we can introduce a more generic 'taint'
> > bitmap with various flags like that, indicating that the guest
> > did something unexpected.
> > 
> > BTW, the other option in regard to the nested AVIC is just to ignore this 
> > issue completely.
> > The code itself always uses vcpu_id's, thus regardless of when/how often 
> > the guest changes
> > its apic ids, my code would just use the initial APIC ID values 
> > consistently.
> > 
> > In this case I won't need this boolean.
> > 
> > > > };
> > > > 
> > > > struct kvm_vm_stat {
> > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > > index 66b0eb0bda94e..8996675b3ef4c 100644
> > > > --- a/arch/x86/kvm/lapic.c
> > > > +++ b/arch/x86/kvm/lapic.c
> > > > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct 
> > > > kvm_lapic *apic, u32 lvt0_val)
> > > > }
> > > > }
> > > > 
> > > > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
> > > > +{
> > > > +   if (kvm_apic_has_initial_apic_id(apic))
> > > > +   return;
> > > > +
> > > > +   pr_warn_once("APIC ID change is unsupported by KVM");
> > > 
> > > It is misleading because changing xAPIC ID is supported by KVM; it just
> > > isn't compatible with APICv. Probably this pr_warn_once() should be
> > > removed.
> > 
> > Honestly since nobody uses this feature, I am not sure if to call this 
> > supported,
> > I am sure that KVM has more bugs in regard of using non standard APIC ID.
> > This warning might hopefuly make someone complain about it if this
> > feature is actually used somewhere.
> 
> Now I got you. It is fine to me.
>

Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.

2022-05-18 Thread Maxim Levitsky

On Wed, 2022-05-18 at 16:28 +0800, Chao Gao wrote:
> On Wed, Apr 27, 2022 at 11:02:57PM +0300, Maxim Levitsky wrote:
> > Neither of these settings should be changed by the guest and it is
> > a burden to support it in the acceleration code, so just inhibit
> > it instead.
> > 
> > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> > arch/x86/include/asm/kvm_host.h |  3 +++
> > arch/x86/kvm/lapic.c| 25 ++---
> > arch/x86/kvm/lapic.h|  8 
> > 3 files changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 63eae00625bda..636df87542555 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
> > APICV_INHIBIT_REASON_ABSENT,
> > /* AVIC is disabled because SEV doesn't support it */
> > APICV_INHIBIT_REASON_SEV,
> > +   /* APIC ID and/or APIC base was changed by the guest */
> > +   APICV_INHIBIT_REASON_RO_SETTINGS,
> 
> You need to add it to check_apicv_inhibit_reasons as well.
True, forgot about it.

> 
> > };
> > 
> > struct kvm_arch {
> > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > hpa_t   hv_root_tdp;
> > spinlock_t hv_root_tdp_lock;
> > #endif
> > +   bool apic_id_changed;
> 
> What's the value of this boolean? No one reads it.

I use it in later patches to kill the guest during nested VM entry 
if it attempts to use nested AVIC after any vCPU changed APIC ID.

I mentioned this boolean in the commit description.

This boolean avoids the need to go over all vCPUs and checking
if they still have the initial apic id.

In the future maybe we can introduce a more generic 'taint'
bitmap with various flags like that, indicating that the guest
did something unexpected.

BTW, the other option in regard to the nested AVIC is just to ignore this issue 
completely.
The code itself always uses vcpu_id's, thus regardless of when/how often the 
guest changes
its apic ids, my code would just use the initial APIC ID values consistently.

In this case I won't need this boolean.

> 
> > };
> > 
> > struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 66b0eb0bda94e..8996675b3ef4c 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct 
> > kvm_lapic *apic, u32 lvt0_val)
> > }
> > }
> > 
> > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
> > +{
> > +   if (kvm_apic_has_initial_apic_id(apic))
> > +   return;
> > +
> > +   pr_warn_once("APIC ID change is unsupported by KVM");
> 
> It is misleading because changing xAPIC ID is supported by KVM; it just
> isn't compatible with APICv. Probably this pr_warn_once() should be
> removed.

Honestly since nobody uses this feature, I am not sure if to call this 
supported,
I am sure that KVM has more bugs in regard of using non standard APIC ID.
This warning might hopefuly make someone complain about it if this
feature is actually used somewhere.

> 
> > +
> > +   kvm_set_apicv_inhibit(apic->vcpu->kvm,
> > +   APICV_INHIBIT_REASON_RO_SETTINGS);
> 
> The indentation here looks incorrect to me.
>   kvm_set_apicv_inhibit(apic->vcpu->kvm,
> APICV_INHIBIT_REASON_RO_SETTINGS);

True, will fix.

> 
> > +
> > +   apic->vcpu->kvm->arch.apic_id_changed = true;
> > +}
> > +
> > static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > {
> > int ret = 0;
> > @@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic 
> > *apic, u32 reg, u32 val)
> > 
> > switch (reg) {
> > case APIC_ID:   /* Local APIC ID */
> > -   if (!apic_x2apic_mode(apic))
> > +   if (!apic_x2apic_mode(apic)) {
> > +
> > kvm_apic_set_xapic_id(apic, val >> 24);
> > -   else
> > +   kvm_lapic_check_initial_apic_id(apic);
> > +   } else
> > ret = 1;
> > break;
> > 
> > @@ -2335,8 +2350,11 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
> > value)
> >  MSR_IA32_APICBASE_BASE;
> > 
> > if ((value & MSR_IA32_APICBASE_ENABLE) &

[RFC PATCH v3 19/19] KVM: x86: nSVM: expose the nested AVIC to the guest

2022-04-27 Thread Maxim Levitsky

This patch enables and exposes to the nested guest
the support for the nested AVIC.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 099329711ad13..431281ccc40ef 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4087,6 +4087,9 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu 
*vcpu)
if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC))
kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_X2APIC);
}
+
+   svm->avic_enabled = enable_apicv && guest_cpuid_has(vcpu, 
X86_FEATURE_AVIC);
+
init_vmcb_after_set_cpuid(vcpu);
 }
 
@@ -4827,6 +4830,9 @@ static __init void svm_set_cpu_caps(void)
if (vgif)
kvm_cpu_cap_set(X86_FEATURE_VGIF);
 
+   if (enable_apicv)
+   kvm_cpu_cap_set(X86_FEATURE_AVIC);
+
/* Nested VM can receive #VMEXIT instead of triggering #GP */
kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
}
-- 
2.26.3

[RFC PATCH v3 18/19] KVM: x86: SVM/nSVM: add optional non strict AVIC doorbell mode

2022-04-27 Thread Maxim Levitsky

By default, peers of a vCPU, can send it doorbell messages,
only when that vCPU is assigned (loaded) a physical CPU.

However when doorbell messages are not allowed, this causes all of
the vCPU's peers to get VM exits, which is suboptimal when this
vCPU is not halted, and therefore just temporary not running
in the guest mode due to being scheduled out and/or
having a userspace VM exit.

In this case peers can't make this vCPU enter guest mode faster,
and thus the VM exits they get don't do anything good.

Therefore this patch introduces (disabled by default)
new non strict mode (enabled by setting avic_doorbell_strict
kvm_amd module param to 0), such as when it is enabled,
and a vCPU is scheduled out but not halted, its peers can continue
sending  doorbell messages to the last physical CPU where the vCPU was
last running.

Security wise, a malicious guest with a compromised guest kernel,
can in this mode in some cases slow down whatever is
running on the last physical CPU where a vCPU was running
by spamming it with doorbell messages (hammering on ICR),
from its another vCPU.

Thus this mode is disabled by default.

However if admin policy is to have 1:1 vCPU/pCPU mapping,
this mode can be useful to avoid VM exits when a vCPU has
a userspace VM exit and such.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 16 +---
 arch/x86/kvm/svm/svm.c  | 25 +
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 149df26e17462..4bf0f00f13c12 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1704,7 +1704,7 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, 
int cpu, bool r)
 
 void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
-   u64 entry;
+   u64 old_entry, new_entry;
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -1723,14 +1723,16 @@ void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (kvm_vcpu_is_blocking(vcpu))
return;
 
-   entry = READ_ONCE(*(svm->avic_physical_id_cache));
-   WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
+   old_entry = READ_ONCE(*(svm->avic_physical_id_cache));
+   new_entry = old_entry;
 
-   entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
-   entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
-   entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+   new_entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+   new_entry |= (h_physical_id & 
AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
+   new_entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+
+   if (old_entry != new_entry)
+   WRITE_ONCE(*(svm->avic_physical_id_cache), new_entry);
 
-   WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 }
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b31bab832360e..099329711ad13 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -191,6 +191,10 @@ module_param(avic, bool, 0444);
 static bool force_avic;
 module_param_unsafe(force_avic, bool, 0444);
 
+static bool avic_doorbell_strict = true;
+module_param(avic_doorbell_strict, bool, 0444);
+
+
 bool __read_mostly dump_invalid_vmcb;
 module_param(dump_invalid_vmcb, bool, 0644);
 
@@ -1402,10 +1406,23 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int 
cpu)
 
 static void svm_vcpu_put(struct kvm_vcpu *vcpu)
 {
-   if (kvm_vcpu_apicv_active(vcpu))
-   __avic_vcpu_put(vcpu);
-
-   __nested_avic_put(vcpu);
+   /*
+* Forbid this vCPU's peers to send doorbell messages.
+* Unless non strict doorbell mode is used.
+*
+* In this mode, doorbell messages are forbidden only when a vCPU
+* blocks, since for correctness only in this case it is needed
+* to intercept an IPI to wake up a vCPU.
+*
+* However this reduces the isolation of the guest since flood of
+* spurious doorbell messages can slow a CPU running another task
+* while this vCPU is scheduled out.
+*/
+   if (avic_doorbell_strict) {
+   if (kvm_vcpu_apicv_active(vcpu))
+   __avic_vcpu_put(vcpu);
+   __nested_avic_put(vcpu);
+   }
 
svm_prepare_host_switch(vcpu);
 
-- 
2.26.3

[RFC PATCH v3 17/19] KVM: x86: nSVM: implement nested AVIC doorbell emulation

2022-04-27 Thread Maxim Levitsky

This patch implements the doorbell msr emulation
for nested AVIC.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 49 +
 arch/x86/kvm/svm/svm.c  |  2 ++
 arch/x86/kvm/svm/svm.h  |  1 +
 3 files changed, 52 insertions(+)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e8c53fd77f0b1..149df26e17462 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1165,6 +1165,55 @@ unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct 
kvm_vcpu *vcpu)
return 0;
 }
 
+int avic_emulate_doorbell_write(struct kvm_vcpu *vcpu, u64 data)
+{
+   int source_l1_apicid = vcpu->vcpu_id;
+   int target_l1_apicid = data & AVIC_DOORBELL_PHYSICAL_ID_MASK;
+   bool target_running, target_nested;
+   struct kvm_vcpu *target;
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   if (!svm->avic_enabled || (data & ~AVIC_DOORBELL_PHYSICAL_ID_MASK))
+   return 1;
+
+   target = avic_vcpu_by_l1_apicid(vcpu->kvm, target_l1_apicid);
+   if (!target)
+   /* Guest bug: targeting invalid APIC ID. */
+   return 0;
+
+   target_running = READ_ONCE(target->mode) == IN_GUEST_MODE;
+   target_nested = is_guest_mode(target);
+
+   trace_kvm_avic_nested_doorbell(source_l1_apicid, target_l1_apicid,
+  target_nested, target_running);
+
+   /*
+* Target is not in the nested mode, thus the doorbell doesn't affect 
it.
+* If it just became nested after is_guest_mode was checked,
+* it means that it just processed AVIC state and KVM doesn't need
+* to send it another doorbell.
+*/
+   if (!target_nested)
+   return 0;
+
+   /*
+* If the target vCPU is in guest mode, kick the real doorbell.
+* Otherwise KVM needs to try to wake it up if it was sleeping.
+*
+* If the target is not longer in guest mode (just exited it),
+* it will either halt and before that it will notice pending IRR
+* bits, and cancel halting, or it will enter the guest mode again,
+* and notice the IRR bits as well.
+*/
+   if (target_running)
+   wrmsr(MSR_AMD64_SVM_AVIC_DOORBELL,
+ kvm_cpu_get_apicid(READ_ONCE(target->cpu)), 0);
+   else
+   kvm_vcpu_wake_up(target);
+
+   return 0;
+}
+
 static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool 
flat)
 {
struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d96a73931d1e5..b31bab832360e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2772,6 +2772,8 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
u32 ecx = msr->index;
u64 data = msr->data;
switch (ecx) {
+   case MSR_AMD64_SVM_AVIC_DOORBELL:
+   return avic_emulate_doorbell_write(vcpu, data);
case MSR_AMD64_TSC_RATIO:
 
if (!svm->tsc_scaling_enabled) {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 93fd9d6f5fd85..14e2c5c451cad 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -714,6 +714,7 @@ unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct 
kvm_vcpu *vcpu);
 void avic_reload_apic_pages(struct kvm_vcpu *vcpu);
 void avic_free_nested(struct kvm_vcpu *vcpu);
 bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu);
+int avic_emulate_doorbell_write(struct kvm_vcpu *vcpu, u64 data);
 
 struct avic_physid_table *
 avic_physid_shadow_table_get(struct kvm_vcpu *vcpu, gfn_t gfn);
-- 
2.26.3

[RFC PATCH v3 16/19] KVM: x86: nSVM: implement support for nested AVIC vmexits

2022-04-27 Thread Maxim Levitsky

* SVM_EXIT_AVIC_UNACCELERATED_ACCESS is always forwarded to the L1

* SVM_EXIT_AVIC_INCOMPLETE_IPI is hidden from the guest if:

   - is_running was false in shadow physid page because L1's vCPU
 was scheduled out - in this case, the vCPU is waken up,
 and it will process nested AVIC on next VM entry

  - invalid physical address of avic backing page was present
in the guest's physid page, which KVM translates to
valid physical address of a dummy page and is_running=false.

If this condition happens,
the AVIC_IPI_FAILURE_INVALID_BACKING_PAGE VM exit is injected to
the nested hypervisor.

* Note that it is possible to have SVM_EXIT_AVIC_INCOMPLETE_IPI
  VM exit happen both due to host and guest related reason
  at the same time:

  For example if a broadcast IPI was attempted and some shadow
  physid entries had 'is_running=false' set by the guest,
  and some had it set to false due to scheduled out L1 vCPUs.

  To support this case, all relevant entries of guest's physical
  and logical id tables are checked, and both host related actions
  (e.g wakeup) and guest vm exit reflection are done.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c   | 204 +-
 arch/x86/kvm/svm/nested.c |  14 +++
 2 files changed, 216 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index f13ca1e7b2845..e8c53fd77f0b1 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -917,6 +917,164 @@ static void avic_kick_target_vcpus(struct kvm *kvm, 
struct kvm_lapic *source,
}
 }
 
+static void
+avic_kick_target_vcpu_nested_physical(struct vcpu_svm *svm,
+ int target_l2_apic_id,
+ int *index,
+ bool *invalid_page)
+{
+   u64 gentry, sentry;
+   int target_l1_apicid;
+   struct avic_physid_table *t = svm->nested.l2_physical_id_table;
+
+   if (WARN_ON_ONCE(!t))
+   return;
+
+   /*
+* This shouldn't normally happen because this condition
+* should cause AVIC_IPI_FAILURE_INVALID_TARGET vmexit,
+* however the guest can change the page and trigger this.
+*/
+   if (target_l2_apic_id >= t->nentries)
+   return;
+
+   gentry = t->entries[target_l2_apic_id].gentry;
+   sentry = *t->entries[target_l2_apic_id].sentry;
+
+   /* Same reasoning as above  */
+   if (!(gentry & AVIC_PHYSICAL_ID_ENTRY_VALID_MASK))
+   return;
+
+   /*
+* This races against the guest updating is_running bit.
+*
+* Race itself happens on real hardware as well, and the guest
+* must use the correct means to avoid it.
+*
+* AVIC hardware already set IRR and should have done memory
+* barrier, and then found out that is_running is false
+* in shadow physid table.
+*
+* We are doing another is_running check (in the guest physid table),
+* completing it, thus don't need additional memory barrier.
+*/
+
+   target_l1_apicid = physid_entry_get_apicid(gentry);
+
+   if (target_l1_apicid == -1) {
+
+   /* is_running is false, need to vmexit to the guest */
+   if (*index == -1) {
+   u64 backing_page_phys = 
physid_entry_get_backing_table(sentry);
+
+   *index = target_l2_apic_id;
+   if (backing_page_phys == t->dummy_page_hpa)
+   *invalid_page = true;
+   }
+   } else {
+   /* Wake up the target vCPU and hide the VM exit from the guest 
*/
+   struct kvm_vcpu *target = avic_vcpu_by_l1_apicid(svm->vcpu.kvm, 
target_l1_apicid);
+
+   if (target && target != &svm->vcpu)
+   kvm_vcpu_wake_up(target);
+   }
+
+   trace_kvm_avic_nested_kick_vcpu(svm->vcpu.vcpu_id,
+   target_l2_apic_id,
+   target_l1_apicid);
+}
+
+static void
+avic_kick_target_vcpus_nested_logical(struct vcpu_svm *svm, unsigned long dest,
+ int *index, bool *invalid_page)
+{
+   int logical_id;
+   u8 cluster = 0;
+   u64 *logical_id_table = (u64 *)svm->nested.l2_logical_id_table.hva;
+   int physical_index = -1;
+
+   if (WARN_ON_ONCE(!logical_id_table))
+   return;
+
+   if (nested_avic_get_reg(&svm->vcpu, APIC_DFR) == APIC_DFR_CLUSTER) {
+   if (dest >= 0x40)
+   return;
+   cluster = dest & 0x3C;
+   dest &= 0x3;
+   }
+
+   for_each_set_bit(logical_id, &dest, 8) {
+   int logical_index = cluster | logical_id;
+   u64 lo

[RFC PATCH v3 15/19] KVM: x86: nSVM: add code to reload AVIC physid table when it is invalidated

2022-04-27 Thread Maxim Levitsky

An AVIC table invalidation is not supposed to happen often, and can
only happen when the guest does something suspicious such as:

  - It places physid page in a memslot that is enabled/disabled and memslot
flushing happens.

  - It tries to update apic backing page addresses - guest has no
reason to touch this, and doing so on real hardware will likely
result in unpredictable results.

  - It writes to reserved bits of a tracked page.


  - It write floods a physid table while no vCPU is using it
(the page is likely reused at that point to contain something else)


All of the above causes a KVM_REQ_APIC_PAGE_RELOAD request to be raised
on all vCPUS, which kicks them out of the guest mode,
and then first vCPU to reach the handler will re-create the entries of
the physid page, and others will notice this and do nothing.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 13 +
 arch/x86/kvm/svm/svm.c  |  1 +
 arch/x86/kvm/svm/svm.h  |  1 +
 3 files changed, 15 insertions(+)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e6ec525a88625..f13ca1e7b2845 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -379,6 +379,7 @@ static void avic_physid_shadow_table_invalidate(struct kvm 
*kvm,
struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 
lockdep_assert_held(&kvm_svm->avic.tables_lock);
+   kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
avic_physid_shadow_table_erase(kvm, t);
 }
 
@@ -1638,3 +1639,15 @@ bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu)
return true;
return false;
 }
+
+void avic_reload_apic_pages(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_svm *vcpu_svm = to_svm(vcpu);
+   struct avic_physid_table *t = vcpu_svm->nested.l2_physical_id_table;
+
+   int nentries = vcpu_svm->nested.ctl.avic_physical_id &
+   AVIC_PHYSICAL_ID_TABLE_SIZE_MASK;
+
+   if (t && is_guest_mode(vcpu) && nested_avic_in_use(vcpu))
+   avic_physid_shadow_table_sync(vcpu, t, nentries);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a39bb0b27a51d..d96a73931d1e5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4677,6 +4677,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.enable_nmi_window = svm_enable_nmi_window,
.enable_irq_window = svm_enable_irq_window,
.update_cr8_intercept = svm_update_cr8_intercept,
+   .reload_apic_pages = avic_reload_apic_pages,
.refresh_apicv_exec_ctrl = avic_refresh_apicv_exec_ctrl,
.check_apicv_inhibit_reasons = avic_check_apicv_inhibit_reasons,
.apicv_post_state_restore = avic_apicv_post_state_restore,
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 17fcc09cf4be1..93fd9d6f5fd85 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -711,6 +711,7 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
 unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
+void avic_reload_apic_pages(struct kvm_vcpu *vcpu);
 void avic_free_nested(struct kvm_vcpu *vcpu);
 bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu);
 
-- 
2.26.3

[RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page

2022-04-27 Thread Maxim Levitsky

This will be used on SVM to reload shadow page of the AVIC physid table

No functional change intended

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm-x86-ops.h | 2 +-
 arch/x86/include/asm/kvm_host.h| 3 +--
 arch/x86/kvm/vmx/vmx.c | 8 
 arch/x86/kvm/x86.c | 6 +++---
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 96e4e9842dfc6..997edb7453ac2 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -82,7 +82,7 @@ KVM_X86_OP_OPTIONAL(hwapic_isr_update)
 KVM_X86_OP_OPTIONAL_RET0(guest_apic_has_interrupt)
 KVM_X86_OP_OPTIONAL(load_eoi_exitmap)
 KVM_X86_OP_OPTIONAL(set_virtual_apic_mode)
-KVM_X86_OP_OPTIONAL(set_apic_access_page_addr)
+KVM_X86_OP_OPTIONAL(reload_apic_pages)
 KVM_X86_OP(deliver_interrupt)
 KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
 KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fc7df778a3d71..52fa04c3108b1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1436,7 +1436,7 @@ struct kvm_x86_ops {
bool (*guest_apic_has_interrupt)(struct kvm_vcpu *vcpu);
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu);
-   void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu);
+   void (*reload_apic_pages)(struct kvm_vcpu *vcpu);
void (*deliver_interrupt)(struct kvm_lapic *apic, int delivery_mode,
  int trig_mode, int vector);
int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
@@ -1909,7 +1909,6 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
-
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
unsigned long ipi_bitmap_high, u32 min,
unsigned long icr, int op_64_bit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cf8581978bce3..7defd31703c61 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6339,7 +6339,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
vmx_update_msr_bitmap_x2apic(vcpu);
 }
 
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+static void vmx_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
struct page *page;
 
@@ -,7 +,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.enable_irq_window = vmx_enable_irq_window,
.update_cr8_intercept = vmx_update_cr8_intercept,
.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-   .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+   .reload_apic_pages = vmx_reload_apic_access_page,
.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
.apicv_post_state_restore = vmx_apicv_post_state_restore,
@@ -7940,12 +7940,12 @@ static __init int hardware_setup(void)
enable_vnmi = 0;
 
/*
-* set_apic_access_page_addr() is used to reload apic access
+* kvm_vcpu_reload_apic_pages() is used to reload apic access
 * page upon invalidation.  No need to do anything if not
 * using the APIC_ACCESS_ADDR VMCS field.
 */
if (!flexpriority_enabled)
-   vmx_x86_ops.set_apic_access_page_addr = NULL;
+   vmx_x86_ops.reload_apic_pages = NULL;
 
if (!cpu_has_vmx_tpr_shadow())
vmx_x86_ops.update_cr8_intercept = NULL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2f73ce87a1e3..ad744ab99734c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9949,12 +9949,12 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm 
*kvm,
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 }
 
-static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
+static void kvm_vcpu_reload_apic_pages(struct kvm_vcpu *vcpu)
 {
if (!lapic_in_kernel(vcpu))
return;
 
-   static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
+   static_call_cond(kvm_x86_reload_apic_pages)(vcpu);
 }
 
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
@@ -10071,7 +10071,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu))
vcpu_load_eoi_exitmap(vcpu);
if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
-   kvm_vcpu_reload_apic_access_page(vcpu);
+   kvm_vcpu_reload_apic_pages(vcpu);
if (kvm_check_request(KVM_REQ_HV_CRASH, vcpu)) {
vcpu->run->exit_

[RFC PATCH v3 13/19] KVM: x86: nSVM: wire nested AVIC to nested guest entry/exit

2022-04-27 Thread Maxim Levitsky

  * Passthrough guest's avic pages that can be passed through
 - logical id table
 - avic backing page

  * Passthrough AVIC's mmio range
 - nested guest is responsible for marking it RW
   in its NPT tables.

  * Write track physical id page
 - all peer's avic backing pages are pinned
   as long as the shadow table is not invalidated/
   freed.

  * Cache guest AVIC settings.

  * Add SDM mandated changes to emulated VM enter/exit.

Note that nested AVIC still can't be enabled, thus this
code has no effect yet.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c   |  51 ++-
 arch/x86/kvm/svm/nested.c | 127 +-
 arch/x86/kvm/svm/svm.c|   2 +
 arch/x86/kvm/svm/svm.h|  24 +++
 4 files changed, 199 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 34da9fabd5194..e6ec525a88625 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -59,6 +59,18 @@ static inline struct kvm_vcpu *avic_vcpu_by_l1_apicid(struct 
kvm *kvm,
return kvm_get_vcpu_by_id(kvm, l1_apicid);
 }
 
+static u32 nested_avic_get_reg(struct kvm_vcpu *vcpu, int reg_off)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   void *nested_apic_regs = svm->nested.l2_apic_access_page.hva;
+
+   if (WARN_ON_ONCE(!nested_apic_regs))
+   return 0;
+
+   return *((u32 *) (nested_apic_regs + reg_off));
+}
+
 static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
  struct avic_physid_table *t,
  int n,
@@ -531,6 +543,20 @@ static void avic_physid_shadow_table_flush_memslot(struct 
kvm *kvm,
mutex_unlock(&kvm_svm->avic.tables_lock);
 }
 
+void avic_free_nested(struct kvm_vcpu *vcpu)
+{
+   struct avic_physid_table *t;
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   t = svm->nested.l2_physical_id_table;
+   if (t) {
+   avic_physid_shadow_table_put(vcpu->kvm, t);
+   svm->nested.l2_physical_id_table = NULL;
+   }
+
+   kvm_vcpu_unmap(vcpu, &svm->nested.l2_apic_access_page, true);
+   kvm_vcpu_unmap(vcpu, &svm->nested.l2_logical_id_table, true);
+}
 
 /*
  * This is a wrapper of struct amd_iommu_ir_data.
@@ -586,10 +612,18 @@ void avic_vm_destroy(struct kvm *kvm)
 {
unsigned long flags;
struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
+   unsigned long i;
+   struct kvm_vcpu *vcpu;
 
if (!enable_apicv)
return;
 
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   vcpu_load(vcpu);
+   avic_free_nested(vcpu);
+   vcpu_put(vcpu);
+   }
+
if (avic->logical_id_table_page)
__free_page(avic->logical_id_table_page);
if (avic->physical_id_table_page)
@@ -1501,7 +1535,7 @@ void __nested_avic_load(struct kvm_vcpu *vcpu, int cpu)
if (kvm_vcpu_is_blocking(vcpu))
return;
 
-   if (svm->nested.initialized)
+   if (svm->nested.initialized && svm->avic_enabled)
avic_update_peer_physid_entries(vcpu, cpu);
 }
 
@@ -1511,7 +1545,7 @@ void __nested_avic_put(struct kvm_vcpu *vcpu)
 
lockdep_assert_preemption_disabled();
 
-   if (svm->nested.initialized)
+   if (svm->nested.initialized && svm->avic_enabled)
avic_update_peer_physid_entries(vcpu, -1);
 }
 
@@ -1591,3 +1625,16 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
 
nested_avic_load(vcpu);
 }
+
+bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu)
+{
+   int off;
+
+   if (!nested_avic_in_use(vcpu))
+   return false;
+
+   for (off = 0x10; off < 0x80; off += 0x10)
+   if (nested_avic_get_reg(vcpu, APIC_IRR + off))
+   return true;
+   return false;
+}
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bed5e1692cef0..eb5e9b600e052 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -387,6 +387,14 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu 
*vcpu,
memcpy(to->reserved_sw, from->reserved_sw,
   sizeof(struct hv_enlightenments));
}
+
+   /* copy avic related settings only when it is enabled */
+   if (from->int_ctl & AVIC_ENABLE_MASK) {
+   to->avic_vapic_bar  = from->avic_vapic_bar;
+   to->avic_backing_page   = from->avic_backing_page;
+   to->avic_logical_id = from->avic_logical_id;
+   to->avic_physical_id= from->avic_physical_id;
+   }
 }
 
 void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm,
@@ -539,6 +547,79 @@ void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm)
svm->nested.vmcb02.

[RFC PATCH v3 12/19] KVM: x86: nSVM: make nested AVIC physid write tracking be aware of the host scheduling

2022-04-27 Thread Maxim Levitsky

For each vCPU
  - store a linked list of all shadow physical id entries
which address it.

  - Update those entries when this vCPU is scheduled
in/out

  - update this list, when physid tables are modified by
other means (guest write and/or table sync)

To avoid races vs vcpu schedule, use a spinlock.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 113 +---
 arch/x86/kvm/svm/svm.c  |   7 +++
 arch/x86/kvm/svm/svm.h  |  10 
 3 files changed, 122 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index f462b7e48e3ca..34da9fabd5194 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -67,8 +67,12 @@ static void avic_physid_shadow_entry_set_vcpu(struct kvm 
*kvm,
struct avic_physid_entry_descr *e = &t->entries[n];
u64 sentry = READ_ONCE(*e->sentry);
u64 old_sentry = sentry;
+   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
struct kvm_vcpu *new_vcpu = NULL;
int l0_apicid = -1;
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
 
WARN_ON(!test_bit(n, t->valid_entires));
 
@@ -79,6 +83,9 @@ static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
new_vcpu = avic_vcpu_by_l1_apicid(kvm, new_l1_apicid);
 
if (new_vcpu)
+   list_add_tail(&e->link, 
&to_svm(new_vcpu)->nested.physid_ref_entries);
+
+   if (new_vcpu && to_svm(new_vcpu)->nested_avic_active)
l0_apicid = kvm_cpu_get_apicid(new_vcpu->cpu);
 
physid_entry_set_apicid(&sentry, l0_apicid);
@@ -87,6 +94,8 @@ static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
 
if (sentry != old_sentry)
WRITE_ONCE(*e->sentry, sentry);
+
+   raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
 }
 
 static void avic_physid_shadow_entry_create(struct kvm *kvm,
@@ -131,7 +140,11 @@ static void avic_physid_shadow_entry_remove(struct kvm 
*kvm,
   int n)
 {
struct avic_physid_entry_descr *e = &t->entries[n];
+   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
hpa_t backing_page_hpa;
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
 
if (!test_and_clear_bit(n, t->valid_entires))
WARN_ON(1);
@@ -147,8 +160,49 @@ static void avic_physid_shadow_entry_remove(struct kvm 
*kvm,
 
e->gentry = 0;
*e->sentry = 0;
+
+   raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
 }
 
+static void avic_update_peer_physid_entries(struct kvm_vcpu *vcpu, int cpu)
+{
+   /*
+* Update all shadow physid tables which contain entries
+* which reference this vCPU with its new physical location
+*/
+   struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
+   struct vcpu_svm *vcpu_svm = to_svm(vcpu);
+   struct avic_physid_entry_descr *e;
+   int updated_nentries = 0;
+   int l0_apicid = -1;
+   unsigned long flags;
+   bool new_active = cpu != -1;
+
+   if (cpu != -1)
+   l0_apicid = kvm_cpu_get_apicid(cpu);
+
+   raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
+
+   list_for_each_entry(e, &vcpu_svm->nested.physid_ref_entries, link) {
+   u64 sentry = READ_ONCE(*e->sentry);
+   u64 old_sentry = sentry;
+
+   physid_entry_set_apicid(&sentry, l0_apicid);
+
+   if (sentry != old_sentry) {
+   updated_nentries++;
+   WRITE_ONCE(*e->sentry, sentry);
+   }
+   }
+
+   if (updated_nentries)
+   trace_kvm_avic_physid_update_vcpu_host(vcpu->vcpu_id,
+  l0_apicid, 
updated_nentries);
+
+   vcpu_svm->nested_avic_active = new_active;
+
+   raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
+}
 
 static bool
 avic_physid_shadow_table_setup_write_tracking(struct kvm *kvm,
@@ -603,6 +657,7 @@ int avic_vm_init(struct kvm *kvm)
hash_add(svm_vm_data_hash, &avic->hnode, avic->vm_id);
spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 
+   raw_spin_lock_init(&avic->table_entries_lock);
mutex_init(&avic->tables_lock);
INIT_LIST_HEAD(&avic->physid_tables);
 
@@ -1428,9 +1483,51 @@ static void avic_vcpu_load(struct kvm_vcpu *vcpu)
 static void avic_vcpu_put(struct kvm_vcpu *vcpu)
 {
preempt_disable();
-
__avic_vcpu_put(vcpu);
+   preempt_enable();
+}
+
 
+void __nested_avic_load(struct kvm_vcpu *vcpu, int cpu)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   lockdep_assert_preemption_disabled(

[RFC PATCH v3 11/19] KVM: x86: nSVM: implement shadowing of AVIC's physical id table

2022-04-27 Thread Maxim Levitsky

Implement the shadow physical id table and its
write tracking code which will be soon used for the nested AVIC.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 461 +++-
 arch/x86/kvm/svm/svm.h  |  71 +++
 2 files changed, 524 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e5cbbb97fbab6..f462b7e48e3ca 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -51,6 +51,433 @@ static u32 next_vm_id = 0;
 static bool next_vm_id_wrapped = 0;
 static DEFINE_SPINLOCK(svm_vm_data_hash_lock);
 
+
+static inline struct kvm_vcpu *avic_vcpu_by_l1_apicid(struct kvm *kvm,
+ int l1_apicid)
+{
+   WARN_ON(l1_apicid == -1);
+   return kvm_get_vcpu_by_id(kvm, l1_apicid);
+}
+
+static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
+ struct avic_physid_table *t,
+ int n,
+ int new_l1_apicid)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   u64 sentry = READ_ONCE(*e->sentry);
+   u64 old_sentry = sentry;
+   struct kvm_vcpu *new_vcpu = NULL;
+   int l0_apicid = -1;
+
+   WARN_ON(!test_bit(n, t->valid_entires));
+
+   if (!list_empty(&e->link))
+   list_del_init(&e->link);
+
+   if (new_l1_apicid != -1)
+   new_vcpu = avic_vcpu_by_l1_apicid(kvm, new_l1_apicid);
+
+   if (new_vcpu)
+   l0_apicid = kvm_cpu_get_apicid(new_vcpu->cpu);
+
+   physid_entry_set_apicid(&sentry, l0_apicid);
+
+   trace_kvm_avic_physid_update_vcpu_guest(new_l1_apicid, l0_apicid);
+
+   if (sentry != old_sentry)
+   WRITE_ONCE(*e->sentry, sentry);
+}
+
+static void avic_physid_shadow_entry_create(struct kvm *kvm,
+   struct avic_physid_table *t,
+   int n,
+   u64 gentry)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   struct page *backing_page;
+   u64 backing_page_gpa = physid_entry_get_backing_table(gentry);
+   int l1_apic_id = physid_entry_get_apicid(gentry);
+   hpa_t backing_page_hpa;
+   u64 sentry = 0;
+
+
+   if (backing_page_gpa == INVALID_BACKING_PAGE)
+   return;
+
+   /* Pin the APIC backing page */
+   backing_page = gfn_to_page(kvm, gpa_to_gfn(backing_page_gpa));
+
+   if (is_error_page(backing_page))
+   /* Invalid GPA in the guest entry - point to a dummy entry */
+   backing_page_hpa = t->dummy_page_hpa;
+   else
+   backing_page_hpa = page_to_phys(backing_page);
+
+   physid_entry_set_backing_table(&sentry, backing_page_hpa);
+
+   e->gentry = gentry;
+   *e->sentry = sentry;
+
+   if (test_and_set_bit(n, t->valid_entires))
+   WARN_ON(1);
+
+   if (backing_page_hpa != t->dummy_page_hpa)
+   avic_physid_shadow_entry_set_vcpu(kvm, t, n, l1_apic_id);
+}
+
+static void avic_physid_shadow_entry_remove(struct kvm *kvm,
+  struct avic_physid_table *t,
+  int n)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   hpa_t backing_page_hpa;
+
+   if (!test_and_clear_bit(n, t->valid_entires))
+   WARN_ON(1);
+
+   /* Release the APIC backing page */
+   backing_page_hpa = physid_entry_get_backing_table(*e->sentry);
+
+   if (backing_page_hpa != t->dummy_page_hpa)
+   kvm_release_pfn_dirty(backing_page_hpa >> PAGE_SHIFT);
+
+   if (!list_empty(&e->link))
+   list_del_init(&e->link);
+
+   e->gentry = 0;
+   *e->sentry = 0;
+}
+
+
+static bool
+avic_physid_shadow_table_setup_write_tracking(struct kvm *kvm,
+ struct avic_physid_table *t,
+ bool enable)
+{
+   struct kvm_memory_slot *slot;
+
+   write_lock(&kvm->mmu_lock);
+   slot = gfn_to_memslot(kvm, t->gfn);
+   if (!slot) {
+   write_unlock(&kvm->mmu_lock);
+   return false;
+   }
+
+   if (enable)
+   kvm_slot_page_track_add_page(kvm, slot, t->gfn, 
KVM_PAGE_TRACK_WRITE);
+   else
+   kvm_slot_page_track_remove_page(kvm, slot, t->gfn, 
KVM_PAGE_TRACK_WRITE);
+   write_unlock(&kvm->mmu_lock);
+   return true;
+}
+
+static void
+avic_physid_shadow_table_erase(struct kvm *kvm, struct avic_physid_table *t)
+{
+   int i;
+
+   if (!t->nentries)
+   return;
+
+   avic_physid_shadow_table_setup_write_tracking(kvm, t, fal

[RFC PATCH v3 10/19] KVM: x86: nSVM: implement AVIC's physid/logid table access helpers

2022-04-27 Thread Maxim Levitsky

This implements a few helpers that help manipulate the AVIC's
physical and logical id table entries.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.h | 45 ++
 1 file changed, 45 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 6fcb164a6ee4a..dfca4c06e2071 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -628,6 +628,51 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
 unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
 
+#define INVALID_BACKING_PAGE   (~(u64)0)
+
+static inline u64 physid_entry_get_backing_table(u64 entry)
+{
+   if (!(entry & AVIC_PHYSICAL_ID_ENTRY_VALID_MASK))
+   return INVALID_BACKING_PAGE;
+   return entry & AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK;
+}
+
+static inline int physid_entry_get_apicid(u64 entry)
+{
+   if (!(entry & AVIC_PHYSICAL_ID_ENTRY_VALID_MASK))
+   return -1;
+   if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
+   return -1;
+
+   return entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+}
+
+static inline int logid_get_physid(u64 entry)
+{
+   if (!(entry & AVIC_LOGICAL_ID_ENTRY_VALID_BIT))
+   return -1;
+   return entry & AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK;
+}
+
+static inline void physid_entry_set_backing_table(u64 *entry, u64 value)
+{
+   *entry &= ~AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK;
+   *entry |= (AVIC_PHYSICAL_ID_ENTRY_VALID_MASK | value);
+}
+
+static inline void physid_entry_set_apicid(u64 *entry, int value)
+{
+   WARN_ON(!(*entry & AVIC_PHYSICAL_ID_ENTRY_VALID_MASK));
+
+   *entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+
+   if (value == -1)
+   *entry &= ~(AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
+   else
+   *entry |= (AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK | value);
+}
+
+
 /* sev.c */
 
 #define GHCB_VERSION_MAX   1ULL
-- 
2.26.3

[RFC PATCH v3 09/19] KVM: x86: nSVM: add nested AVIC tracepoints

2022-04-27 Thread Maxim Levitsky

This patch adds few tracepoints that will be used
to debug/profile the nested AVIC.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/trace.h | 157 ++-
 arch/x86/kvm/x86.c   |  13 
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index de47625175692..f7ddba5ae06a5 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1385,7 +1385,7 @@ TRACE_EVENT(kvm_apicv_accept_irq,
 );
 
 /*
- * Tracepoint for AMD AVIC
+ * Tracepoints for AMD AVIC
  */
 TRACE_EVENT(kvm_avic_incomplete_ipi,
TP_PROTO(u32 vcpu, u32 icrh, u32 icrl, u32 id, u32 index),
@@ -1479,6 +1479,161 @@ TRACE_EVENT(kvm_avic_kick_vcpu_slowpath,
  __entry->icrh, __entry->icrl, __entry->index)
 );
 
+TRACE_EVENT(kvm_avic_physid_table_alloc,
+   TP_PROTO(u64 gpa),
+   TP_ARGS(gpa),
+
+   TP_STRUCT__entry(
+   __field(u64, gpa)
+   ),
+
+   TP_fast_assign(
+   __entry->gpa = gpa;
+   ),
+
+   TP_printk("table at gpa 0x%llx",
+ __entry->gpa)
+);
+
+
+TRACE_EVENT(kvm_avic_physid_table_free,
+   TP_PROTO(u64 gpa),
+   TP_ARGS(gpa),
+
+   TP_STRUCT__entry(
+   __field(u64, gpa)
+   ),
+
+   TP_fast_assign(
+   __entry->gpa = gpa;
+   ),
+
+   TP_printk("table at gpa 0x%llx",
+ __entry->gpa)
+);
+
+TRACE_EVENT(kvm_avic_physid_table_reload,
+   TP_PROTO(u64 gpa, int nentries, int new_nentires),
+   TP_ARGS(gpa, nentries, new_nentires),
+
+   TP_STRUCT__entry(
+   __field(u64, gpa)
+   __field(int, nentries)
+   __field(int, new_nentires)
+   ),
+
+   TP_fast_assign(
+   __entry->gpa = gpa;
+   __entry->nentries = nentries;
+   __entry->new_nentires = new_nentires;
+   ),
+
+   TP_printk("table at gpa 0x%llx, nentires %d -> %d",
+ __entry->gpa, __entry->nentries, __entry->new_nentires)
+);
+
+TRACE_EVENT(kvm_avic_physid_table_write,
+   TP_PROTO(u64 gpa, int bytes),
+   TP_ARGS(gpa, bytes),
+
+   TP_STRUCT__entry(
+   __field(u64, gpa)
+   __field(int, bytes)
+   ),
+
+   TP_fast_assign(
+   __entry->gpa = gpa;
+   __entry->bytes = bytes;
+   ),
+
+   TP_printk("gpa 0x%llx, write of %d bytes",
+ __entry->gpa, __entry->bytes)
+);
+
+TRACE_EVENT(kvm_avic_physid_update_vcpu_host,
+   TP_PROTO(int vcpu_id, int cpu_id, int n),
+   TP_ARGS(vcpu_id, cpu_id, n),
+
+   TP_STRUCT__entry(
+   __field(int, vcpu_id)
+   __field(int, cpu_id)
+   __field(int, n)
+   ),
+
+   TP_fast_assign(
+   __entry->vcpu_id = vcpu_id;
+   __entry->cpu_id = cpu_id;
+   __entry->n = n;
+   ),
+
+   TP_printk("l1 vcpu %d -> l0 cpu %d (%d entries)",
+ __entry->vcpu_id, __entry->cpu_id, __entry->n)
+);
+
+TRACE_EVENT(kvm_avic_physid_update_vcpu_guest,
+   TP_PROTO(int vcpu_id, int cpu_id),
+   TP_ARGS(vcpu_id, cpu_id),
+
+   TP_STRUCT__entry(
+   __field(int, vcpu_id)
+   __field(int, cpu_id)
+   ),
+
+   TP_fast_assign(
+   __entry->vcpu_id = vcpu_id;
+   __entry->cpu_id = cpu_id;
+   ),
+
+   TP_printk("l1 vcpu %d -> l0 cpu %d",
+ __entry->vcpu_id, __entry->cpu_id)
+);
+
+TRACE_EVENT(kvm_avic_nested_doorbell,
+   TP_PROTO(int source_l1_apicid, int target_l1_apicid, bool 
target_nested,
+   bool target_running),
+   TP_ARGS(source_l1_apicid, target_l1_apicid, target_nested,
+   target_running),
+
+   TP_STRUCT__entry(
+   __field(int, source_l1_apicid)
+   __field(int, target_l1_apicid)
+   __field(bool, target_nested)
+   __field(bool, target_running)
+   ),
+
+   TP_fast_assign(
+   __entry->source_l1_apicid = source_l1_apicid;
+   __entry->target_l1_apicid = target_l1_apicid;
+   __entry->target_nested = target_nested;
+   __entry->target_running = target_running;
+   ),
+
+   TP_printk("source %d target %d (nested: %d, running %d)",
+ __entry->source_l1_apicid, __entry->target_l1_apicid,
+ __entry->target_nested, __entry->target_running)
+);
+
+TRACE_EVENT(kvm_avic_nested_kick_vcpu,
+   TP_PROTO(int source_l1_apic_id, int target_l2_apic_id, int 
target_l1_apic_id),
+   TP_ARGS(source_l1_apic_id, target_l2_apic_id, target_l1_apic_id),
+
+   TP_STRUCT__en

[RFC PATCH v3 08/19] KVM: x86: SVM: move avic state to separate struct

2022-04-27 Thread Maxim Levitsky

This will make the code a bit easier to read when nested AVIC support
is added.

No functional change intended.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 51 +++--
 arch/x86/kvm/svm/svm.h  | 14 ++-
 2 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 1102421668a11..e5cbbb97fbab6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -69,6 +69,8 @@ int avic_ga_log_notifier(u32 ga_tag)
unsigned long flags;
struct kvm_svm *kvm_svm;
struct kvm_vcpu *vcpu = NULL;
+   struct kvm_svm_avic *avic;
+
u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag);
u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag);
 
@@ -76,9 +78,13 @@ int avic_ga_log_notifier(u32 ga_tag)
trace_kvm_avic_ga_log(vm_id, vcpu_id);
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
-   hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) {
-   if (kvm_svm->avic_vm_id != vm_id)
+   hash_for_each_possible(svm_vm_data_hash, avic, hnode, vm_id) {
+
+
+   if (avic->vm_id != vm_id)
continue;
+
+   kvm_svm = container_of(avic, struct kvm_svm, avic);
vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id);
break;
}
@@ -98,18 +104,18 @@ int avic_ga_log_notifier(u32 ga_tag)
 void avic_vm_destroy(struct kvm *kvm)
 {
unsigned long flags;
-   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+   struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
 
if (!enable_apicv)
return;
 
-   if (kvm_svm->avic_logical_id_table_page)
-   __free_page(kvm_svm->avic_logical_id_table_page);
-   if (kvm_svm->avic_physical_id_table_page)
-   __free_page(kvm_svm->avic_physical_id_table_page);
+   if (avic->logical_id_table_page)
+   __free_page(avic->logical_id_table_page);
+   if (avic->physical_id_table_page)
+   __free_page(avic->physical_id_table_page);
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
-   hash_del(&kvm_svm->hnode);
+   hash_del(&avic->hnode);
spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 }
 
@@ -117,10 +123,9 @@ int avic_vm_init(struct kvm *kvm)
 {
unsigned long flags;
int err = -ENOMEM;
-   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
-   struct kvm_svm *k2;
struct page *p_page;
struct page *l_page;
+   struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
u32 vm_id;
 
if (!enable_apicv)
@@ -131,14 +136,14 @@ int avic_vm_init(struct kvm *kvm)
if (!p_page)
goto free_avic;
 
-   kvm_svm->avic_physical_id_table_page = p_page;
+   avic->physical_id_table_page = p_page;
 
/* Allocating logical APIC ID table (4KB) */
l_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!l_page)
goto free_avic;
 
-   kvm_svm->avic_logical_id_table_page = l_page;
+   avic->logical_id_table_page = l_page;
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
  again:
@@ -149,13 +154,15 @@ int avic_vm_init(struct kvm *kvm)
}
/* Is it still in use? Only possible if wrapped at least once */
if (next_vm_id_wrapped) {
-   hash_for_each_possible(svm_vm_data_hash, k2, hnode, vm_id) {
-   if (k2->avic_vm_id == vm_id)
+   struct kvm_svm_avic *avic2;
+
+   hash_for_each_possible(svm_vm_data_hash, avic2, hnode, vm_id) {
+   if (avic2->vm_id == vm_id)
goto again;
}
}
-   kvm_svm->avic_vm_id = vm_id;
-   hash_add(svm_vm_data_hash, &kvm_svm->hnode, kvm_svm->avic_vm_id);
+   avic->vm_id = vm_id;
+   hash_add(svm_vm_data_hash, &avic->hnode, avic->vm_id);
spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 
return 0;
@@ -169,8 +176,8 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 {
struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page));
-   phys_addr_t lpa = 
__sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
-   phys_addr_t ppa = 
__sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
+   phys_addr_t lpa = 
__sme_set(page_to_phys(kvm_svm->avic.logical_id_table_page));
+   phys_addr_t ppa = 
__sme_set(page_to_phys(kvm_svm->avic.physical_id_table_page));
 
vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK;
vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK;
@@ -193,7 +200,7 @@ static u64 *avic_get_physical_id_ent

[RFC PATCH v3 07/19] KVM: x86: mmu: tweak fast path for emulation of access to nested NPT pages

2022-04-27 Thread Maxim Levitsky

If a non leaf mmu page is write tracked externally for some reason,
which can in theory happen if it was used for nested avic physid page
before, then this code will enter an endless loop of page faults because
unprotecting the mmu page will not remove write tracking, nor will the
write tracker callback be called, because there is no mmu page at
this address.

Fix this by only invoking the fast path if we succeeded in zapping the
mmu page.

Fixes: 147277540bbc5 ("kvm: svm: Add support for additional SVM NPF error 
codes")
Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 633a3138d68e1..8f77d41e7fd80 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5341,8 +5341,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t 
cr2_or_gpa, u64 error_code,
 */
if (vcpu->arch.mmu->root_role.direct &&
(error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-   kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
-   return 1;
+   if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
+   return 1;
}
 
/*
-- 
2.26.3

[RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper

2022-04-27 Thread Maxim Levitsky

This is a tiny refactoring, and can be useful to check
if a GPA/GFN is within a memslot a bit more cleanly.

Signed-off-by: Maxim Levitsky 
---
 include/linux/kvm_host.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 252ee4a61b58b..12e261559070b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1580,6 +1580,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
 bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
 
+
+static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+   return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
+}
+
+
 /*
  * Returns a pointer to the memslot if it contains gfn.
  * Otherwise returns NULL.
@@ -1590,12 +1597,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
if (!slot)
return NULL;
 
-   if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
+   if (gfn_in_memslot(slot, gfn))
return slot;
else
return NULL;
 }
 
+
 /*
  * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
  *
-- 
2.26.3

[RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally

2022-04-27 Thread Maxim Levitsky

This will be used to enable write tracking from nested AVIC code
and can also be used to enable write tracking in GVT-g module
when it actually uses it as opposed to always enabling it,
when the module is compiled in the kernel.

No functional change intended.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h   |  2 +-
 arch/x86/include/asm/kvm_page_track.h |  1 +
 arch/x86/kvm/mmu.h|  8 +---
 arch/x86/kvm/mmu/mmu.c| 17 ++---
 arch/x86/kvm/mmu/page_track.c | 10 --
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 636df87542555..fc7df778a3d71 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1254,7 +1254,7 @@ struct kvm_arch {
 * is used as one input when determining whether certain memslot
 * related allocations are necessary.
 */
-   bool shadow_root_allocated;
+   bool mmu_page_tracking_enabled;
 
 #if IS_ENABLED(CONFIG_HYPERV)
hpa_t   hv_root_tdp;
diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a9..955a5ae07b10e 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -50,6 +50,7 @@ int kvm_page_track_init(struct kvm *kvm);
 void kvm_page_track_cleanup(struct kvm *kvm);
 
 bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
+int kvm_page_track_write_tracking_enable(struct kvm *kvm);
 int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 671cfeccf04e9..44d15551f7156 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -269,7 +269,7 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 int kvm_mmu_post_init_vm(struct kvm *kvm);
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
 
-static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
+static inline bool mmu_page_tracking_enabled(struct kvm *kvm)
 {
/*
 * Read shadow_root_allocated before related pointers. Hence, threads
@@ -277,9 +277,11 @@ static inline bool kvm_shadow_root_allocated(struct kvm 
*kvm)
 * see the pointers. Pairs with smp_store_release in
 * mmu_first_shadow_root_alloc.
 */
-   return smp_load_acquire(&kvm->arch.shadow_root_allocated);
+   return smp_load_acquire(&kvm->arch.mmu_page_tracking_enabled);
 }
 
+int mmu_enable_write_tracking(struct kvm *kvm);
+
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return 
kvm->arch.tdp_mmu_enabled; }
 #else
@@ -288,7 +290,7 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { 
return false; }
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
-   return !is_tdp_mmu_enabled(kvm) || kvm_shadow_root_allocated(kvm);
+   return !is_tdp_mmu_enabled(kvm) || mmu_page_tracking_enabled(kvm);
 }
 
 static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 904f0faff2186..fb744616bf7df 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3389,7 +3389,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
return r;
 }
 
-static int mmu_first_shadow_root_alloc(struct kvm *kvm)
+int mmu_enable_write_tracking(struct kvm *kvm)
 {
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
@@ -3399,21 +3399,20 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 * Check if this is the first shadow root being allocated before
 * taking the lock.
 */
-   if (kvm_shadow_root_allocated(kvm))
+   if (mmu_page_tracking_enabled(kvm))
return 0;
 
mutex_lock(&kvm->slots_arch_lock);
 
/* Recheck, under the lock, whether this is the first shadow root. */
-   if (kvm_shadow_root_allocated(kvm))
+   if (mmu_page_tracking_enabled(kvm))
goto out_unlock;
 
/*
 * Check if anything actually needs to be allocated, e.g. all metadata
 * will be allocated upfront if TDP is disabled.
 */
-   if (kvm_memslots_have_rmaps(kvm) &&
-   kvm_page_track_write_tracking_enabled(kvm))
+   if (kvm_memslots_have_rmaps(kvm) && mmu_page_tracking_enabled(kvm))
goto out_success;
 
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -3443,7 +3442,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 * all the related pointers are set.
 */
 out_success:
-   smp_store_release(&kvm->arch.shadow_root_allocated, true);
+   smp_store_release(&kvm->arch.mmu_page_tracking_enabled, true);
 
 out_unlock:
mutex_unlock(&kvm->slots_arch_lock);
@@ -3480,7 +3

[RFC PATCH v3 05/19] x86: KVMGT: use kvm_page_track_write_tracking_enable

2022-04-27 Thread Maxim Levitsky

This allows to enable the write tracking only when KVMGT is
actually used and doesn't carry any penalty otherwise.

Tested by booting a VM with a kvmgt mdev device.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/Kconfig | 3 ---
 arch/x86/kvm/mmu/mmu.c   | 2 +-
 drivers/gpu/drm/i915/Kconfig | 1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c | 5 +
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd77061364..41341905d3734 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -126,7 +126,4 @@ config KVM_XEN
 
  If in doubt, say "N".
 
-config KVM_EXTERNAL_WRITE_TRACKING
-   bool
-
 endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fb744616bf7df..633a3138d68e1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5753,7 +5753,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
 
-   if (IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || !tdp_enabled)
+   if (!tdp_enabled)
mmu_enable_write_tracking(kvm);
 
return 0;
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 98c5450b8eacc..7d8346f4bae11 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -130,7 +130,6 @@ config DRM_I915_GVT_KVMGT
depends on DRM_I915_GVT
depends on KVM
depends on VFIO_MDEV
-   select KVM_EXTERNAL_WRITE_TRACKING
default n
help
  Choose this option if you want to enable KVMGT support for
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 057ec44901045..4c62ab3ef245d 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1933,6 +1933,7 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
struct intel_vgpu *vgpu;
struct kvmgt_vdev *vdev;
struct kvm *kvm;
+   int ret;
 
vgpu = mdev_get_drvdata(mdev);
if (handle_valid(vgpu->handle))
@@ -1948,6 +1949,10 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
if (__kvmgt_vgpu_exist(vgpu, kvm))
return -EEXIST;
 
+   ret = kvm_page_track_write_tracking_enable(kvm);
+   if (ret)
+   return ret;
+
info = vzalloc(sizeof(struct kvmgt_guest_info));
if (!info)
return -ENOMEM;
-- 
2.26.3

[RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID

2022-04-27 Thread Maxim Levitsky

AVIC is now inhibited if the guest changes apic id, thus remove
that broken code.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 35 ---
 1 file changed, 35 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 54fe03714f8a6..1102421668a11 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -508,35 +508,6 @@ static int avic_handle_ldr_update(struct kvm_vcpu *vcpu)
return ret;
 }
 
-static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu)
-{
-   u64 *old, *new;
-   struct vcpu_svm *svm = to_svm(vcpu);
-   u32 id = kvm_xapic_id(vcpu->arch.apic);
-
-   if (vcpu->vcpu_id == id)
-   return 0;
-
-   old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id);
-   new = avic_get_physical_id_entry(vcpu, id);
-   if (!new || !old)
-   return 1;
-
-   /* We need to move physical_id_entry to new offset */
-   *new = *old;
-   *old = 0ULL;
-   to_svm(vcpu)->avic_physical_id_cache = new;
-
-   /*
-* Also update the guest physical APIC ID in the logical
-* APIC ID table entry if already setup the LDR.
-*/
-   if (svm->ldr_reg)
-   avic_handle_ldr_update(vcpu);
-
-   return 0;
-}
-
 static void avic_handle_dfr_update(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -555,10 +526,6 @@ static int avic_unaccel_trap_write(struct kvm_vcpu *vcpu)
AVIC_UNACCEL_ACCESS_OFFSET_MASK;
 
switch (offset) {
-   case APIC_ID:
-   if (avic_handle_apic_id_update(vcpu))
-   return 0;
-   break;
case APIC_LDR:
if (avic_handle_ldr_update(vcpu))
return 0;
@@ -650,8 +617,6 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
-   if (avic_handle_apic_id_update(vcpu) != 0)
-   return;
avic_handle_dfr_update(vcpu);
avic_handle_ldr_update(vcpu);
 }
-- 
2.26.3

[RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.

2022-04-27 Thread Maxim Levitsky

Neither of these settings should be changed by the guest and it is
a burden to support it in the acceleration code, so just inhibit
it instead.

Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/lapic.c| 25 ++---
 arch/x86/kvm/lapic.h|  8 
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 63eae00625bda..636df87542555 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
APICV_INHIBIT_REASON_ABSENT,
/* AVIC is disabled because SEV doesn't support it */
APICV_INHIBIT_REASON_SEV,
+   /* APIC ID and/or APIC base was changed by the guest */
+   APICV_INHIBIT_REASON_RO_SETTINGS,
 };
 
 struct kvm_arch {
@@ -1258,6 +1260,7 @@ struct kvm_arch {
hpa_t   hv_root_tdp;
spinlock_t hv_root_tdp_lock;
 #endif
+   bool apic_id_changed;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 66b0eb0bda94e..8996675b3ef4c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic 
*apic, u32 lvt0_val)
}
 }
 
+static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
+{
+   if (kvm_apic_has_initial_apic_id(apic))
+   return;
+
+   pr_warn_once("APIC ID change is unsupported by KVM");
+
+   kvm_set_apicv_inhibit(apic->vcpu->kvm,
+   APICV_INHIBIT_REASON_RO_SETTINGS);
+
+   apic->vcpu->kvm->arch.apic_id_changed = true;
+}
+
 static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 {
int ret = 0;
@@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, 
u32 reg, u32 val)
 
switch (reg) {
case APIC_ID:   /* Local APIC ID */
-   if (!apic_x2apic_mode(apic))
+   if (!apic_x2apic_mode(apic)) {
+
kvm_apic_set_xapic_id(apic, val >> 24);
-   else
+   kvm_lapic_check_initial_apic_id(apic);
+   } else
ret = 1;
break;
 
@@ -2335,8 +2350,11 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
 MSR_IA32_APICBASE_BASE;
 
if ((value & MSR_IA32_APICBASE_ENABLE) &&
-apic->base_address != APIC_DEFAULT_PHYS_BASE)
+apic->base_address != APIC_DEFAULT_PHYS_BASE) {
+   kvm_set_apicv_inhibit(apic->vcpu->kvm,
+   APICV_INHIBIT_REASON_RO_SETTINGS);
pr_warn_once("APIC base relocation is unsupported by KVM");
+   }
 }
 
 void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
@@ -2649,6 +2667,7 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
}
}
 
+   kvm_lapic_check_initial_apic_id(vcpu->arch.apic);
return 0;
 }
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4e4f8a22754f9..b9c406d383080 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -252,4 +252,12 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
 }
 
+static inline bool kvm_apic_has_initial_apic_id(struct kvm_lapic *apic)
+{
+   if (apic_x2apic_mode(apic))
+   return true;
+
+   return kvm_xapic_id(apic) == apic->vcpu->vcpu_id;
+}
+
 #endif
-- 
2.26.3

[RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons

2022-04-27 Thread Maxim Levitsky

These days there are too many AVIC/APICv inhibit
reasons, and it doesn't hurt to have some documentation
for them.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f164c6c1514a4..63eae00625bda 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1046,14 +1046,29 @@ struct kvm_x86_msr_filter {
 };
 
 enum kvm_apicv_inhibit {
+   /* APICv/AVIC is disabled by module param and/or not supported in 
hardware */
APICV_INHIBIT_REASON_DISABLE,
+   /* APICv/AVIC is inhibited because AutoEOI feature is being used by a 
HyperV guest*/
APICV_INHIBIT_REASON_HYPERV,
+   /* AVIC is inhibited on a CPU because it runs a nested guest */
APICV_INHIBIT_REASON_NESTED,
+   /* AVIC is inhibited due to wait for an irq window (AVIC doesn't 
support this) */
APICV_INHIBIT_REASON_IRQWIN,
+   /*
+* AVIC is inhibited because i8254 're-inject' mode is used
+* which needs EOI intercept which AVIC doesn't support
+*/
APICV_INHIBIT_REASON_PIT_REINJ,
+   /* AVIC is inhibited because the guest has x2apic in its CPUID*/
APICV_INHIBIT_REASON_X2APIC,
+   /* AVIC/APICv is inhibited because KVM_GUESTDBG_BLOCKIRQ was enabled */
APICV_INHIBIT_REASON_BLOCKIRQ,
+   /*
+* AVIC/APICv is inhibited because the guest didn't yet
+* enable kernel/split irqchip
+*/
APICV_INHIBIT_REASON_ABSENT,
+   /* AVIC is disabled because SEV doesn't support it */
APICV_INHIBIT_REASON_SEV,
 };
 
-- 
2.26.3

[RFC PATCH v3 00/19] RFC: nested AVIC

2022-04-27 Thread Maxim Levitsky

This is V3 of my nested AVIC patches.

I fixed few more bugs, and I also split the cod insto smaller patches.

Review is welcome!

Best regards,
Maxim Levitsky

Maxim Levitsky (19):
  KVM: x86: document AVIC/APICv inhibit reasons
  KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic
id/base from the defaults.
  KVM: x86: SVM: remove avic's broken code that updated APIC ID
  KVM: x86: mmu: allow to enable write tracking externally
  x86: KVMGT: use kvm_page_track_write_tracking_enable
  KVM: x86: mmu: add gfn_in_memslot helper
  KVM: x86: mmu: tweak fast path for emulation of access to nested NPT
pages
  KVM: x86: SVM: move avic state to separate struct
  KVM: x86: nSVM: add nested AVIC tracepoints
  KVM: x86: nSVM: implement AVIC's physid/logid table access helpers
  KVM: x86: nSVM: implement shadowing of AVIC's physical id table
  KVM: x86: nSVM: make nested AVIC physid write tracking be aware of the
host scheduling
  KVM: x86: nSVM: wire nested AVIC to nested guest entry/exit
  KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page
  KVM: x86: nSVM: add code to reload AVIC physid table when it is
invalidated
  KVM: x86: nSVM: implement support for nested AVIC vmexits
  KVM: x86: nSVM: implement nested AVIC doorbell emulation
  KVM: x86: SVM/nSVM: add optional non strict AVIC doorbell mode
  KVM: x86: nSVM: expose the nested AVIC to the guest

 arch/x86/include/asm/kvm-x86-ops.h|   2 +-
 arch/x86/include/asm/kvm_host.h   |  23 +-
 arch/x86/include/asm/kvm_page_track.h |   1 +
 arch/x86/kvm/Kconfig  |   3 -
 arch/x86/kvm/lapic.c  |  25 +-
 arch/x86/kvm/lapic.h  |   8 +
 arch/x86/kvm/mmu.h|   8 +-
 arch/x86/kvm/mmu/mmu.c|  21 +-
 arch/x86/kvm/mmu/page_track.c |  10 +-
 arch/x86/kvm/svm/avic.c   | 985 +++---
 arch/x86/kvm/svm/nested.c | 141 +++-
 arch/x86/kvm/svm/svm.c|  39 +-
 arch/x86/kvm/svm/svm.h| 166 -
 arch/x86/kvm/trace.h  | 157 +++-
 arch/x86/kvm/vmx/vmx.c|   8 +-
 arch/x86/kvm/x86.c|  19 +-
 drivers/gpu/drm/i915/Kconfig  |   1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c  |   5 +
 include/linux/kvm_host.h  |  10 +-
 19 files changed, 1507 insertions(+), 125 deletions(-)

-- 
2.26.3

Re: [RFC PATCH v2 04/10] KVM: x86: mmu: tweak fast path for emulation of access to nested NPT pages

2022-04-20 Thread Maxim Levitsky

On Thu, 2022-04-21 at 08:12 +0300, Maxim Levitsky wrote:
> ---
>  arch/x86/kvm/mmu/mmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 23f895d439cf5..b63398dfdac3b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5315,8 +5315,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t 
> cr2_or_gpa, u64 error_code,
>*/
>   if (vcpu->arch.mmu->root_role.direct &&
>   (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> - kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> - return 1;
> + if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> + return 1;
>   }
>  
>   /*

I forgot to add commit description here:

If non leaf mmu page is write tracked externally for some reason,
which can in theory happen if it was used for nested avic physid page
before, then this code will enter an endless loop of page faults because
unprotecting the page will not remove write tracking, nor will the
write tracker callback be called.

Fix this by only invoking the fast patch if we succeeded in zapping the
mmu page.

Fixes: 147277540bbc5 ("kvm: svm: Add support for additional SVM NPF error 
codes")
Signed-off-by: Maxim Levitsky 

--

In theory, KVMGT also does external write tracking so in theory this issue can 
happen today,
but it is highly unlikely.

Best regards,
Maxim Levitsk

[RFC PATCH v2 10/10] KVM: SVM: allow to avoid not needed updates to is_running

2022-04-20 Thread Maxim Levitsky

Allow optionally to make KVM not update is_running unless it is
functionally needed which is only when a vCPU halts,
or is in the guest mode.

This means security wise that if a vCPU is scheduled out,
other vCPUs could still send doorbell messages to the
last physical CPU where this vCPU was last running.

If a malicious guest tries to do it can slow down
the victim CPU by about 40% in my testing, so this
should only be enabled if physical CPUs are not shared
among guests.

The option is avic_doorbell_strict and is true by
default, setting it to false allows this relaxed
non strict mode.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 19 ---
 arch/x86/kvm/svm/svm.c  | 19 ++-
 arch/x86/kvm/svm/svm.h  |  1 +
 3 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 9176c35662ada..1bfe58ee961b2 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1641,7 +1641,7 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, 
int cpu, bool r)
 
 void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
-   u64 entry;
+   u64 old_entry, new_entry;
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -1660,14 +1660,16 @@ void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (kvm_vcpu_is_blocking(vcpu))
return;
 
-   entry = READ_ONCE(*(svm->avic_physical_id_cache));
-   WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
+   old_entry = READ_ONCE(*(svm->avic_physical_id_cache));
+   new_entry = old_entry;
 
-   entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
-   entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
-   entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+   new_entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+   new_entry |= (h_physical_id & 
AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
+   new_entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+
+   if (old_entry != new_entry)
+   WRITE_ONCE(*(svm->avic_physical_id_cache), new_entry);
 
-   WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 }
 
@@ -1777,6 +1779,9 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 {
+   if (!avic_doorbell_strict)
+   __nested_avic_put(vcpu);
+
if (!kvm_vcpu_apicv_active(vcpu))
return;
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3d9ab1e7b2b52..7e79fefc81650 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -190,6 +190,10 @@ module_param(avic, bool, 0444);
 static bool force_avic;
 module_param_unsafe(force_avic, bool, 0444);
 
+bool avic_doorbell_strict = true;
+module_param(avic_doorbell_strict, bool, 0444);
+
+
 bool __read_mostly dump_invalid_vmcb;
 module_param(dump_invalid_vmcb, bool, 0644);
 
@@ -1395,16 +1399,21 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int 
cpu)
 
if (kvm_vcpu_apicv_active(vcpu))
__avic_vcpu_load(vcpu, cpu);
-
__nested_avic_load(vcpu, cpu);
 }
 
 static void svm_vcpu_put(struct kvm_vcpu *vcpu)
 {
-   if (kvm_vcpu_apicv_active(vcpu))
-   __avic_vcpu_put(vcpu);
-
-   __nested_avic_put(vcpu);
+   /*
+* Forbid AVIC's peers to send interrupts
+* to this CPU unless we are in non strict mode,
+* in which case, we will do so only when this vCPU blocks
+*/
+   if (avic_doorbell_strict) {
+   if (kvm_vcpu_apicv_active(vcpu))
+   __avic_vcpu_put(vcpu);
+   __nested_avic_put(vcpu);
+   }
 
svm_prepare_host_switch(vcpu);
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7d1a5028750e6..7139bbb534f9e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -36,6 +36,7 @@ extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
 extern bool npt_enabled;
 extern int vgif;
 extern bool intercept_smi;
+extern bool avic_doorbell_strict;
 
 /*
  * Clean bits in VMCB.
-- 
2.26.3

[RFC PATCH v2 09/10] KVM: nSVM: implement support for nested AVIC

2022-04-20 Thread Maxim Levitsky

This implements initial support of using the AVIC in a nested guest

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c   | 850 +-
 arch/x86/kvm/svm/nested.c | 131 +-
 arch/x86/kvm/svm/svm.c|  18 +
 arch/x86/kvm/svm/svm.h| 150 +++
 arch/x86/kvm/trace.h  | 140 ++-
 arch/x86/kvm/x86.c|  11 +
 6 files changed, 1282 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 87756237c646d..9176c35662ada 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -51,6 +51,526 @@ static u32 next_vm_id = 0;
 static bool next_vm_id_wrapped = 0;
 static DEFINE_SPINLOCK(svm_vm_data_hash_lock);
 
+static u32 nested_avic_get_reg(struct kvm_vcpu *vcpu, int reg_off)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   void *nested_apic_regs = svm->nested.l2_apic_access_page.hva;
+
+   if (WARN_ON_ONCE(!nested_apic_regs))
+   return 0;
+
+   return *((u32 *) (nested_apic_regs + reg_off));
+}
+
+static inline struct kvm_vcpu *avic_vcpu_by_l1_apicid(struct kvm *kvm,
+ int l1_apicid)
+{
+   WARN_ON(l1_apicid == -1);
+   return kvm_get_vcpu_by_id(kvm, l1_apicid);
+}
+
+static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
+ struct avic_physid_table *t,
+ int n,
+ int new_l1_apicid)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   u64 sentry = READ_ONCE(*e->sentry);
+   u64 old_sentry = sentry;
+   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+   struct kvm_vcpu *new_vcpu = NULL;
+   int l0_apicid = -1;
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
+
+   WARN_ON(!test_bit(n, t->valid_entires));
+
+   if (!list_empty(&e->link))
+   list_del_init(&e->link);
+
+   if (new_l1_apicid != -1)
+   new_vcpu = avic_vcpu_by_l1_apicid(kvm, new_l1_apicid);
+
+   if (new_vcpu)
+   list_add_tail(&e->link, 
&to_svm(new_vcpu)->nested.physid_ref_entries);
+
+   if (new_vcpu && to_svm(new_vcpu)->nested_avic_active)
+   l0_apicid = kvm_cpu_get_apicid(new_vcpu->cpu);
+
+   physid_entry_set_apicid(&sentry, l0_apicid);
+
+   if (sentry != old_sentry)
+   WRITE_ONCE(*e->sentry, sentry);
+
+   raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
+}
+
+static void avic_physid_shadow_entry_create(struct kvm *kvm,
+   struct avic_physid_table *t,
+   int n,
+   u64 gentry)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   struct page *backing_page;
+   u64 backing_page_gpa = physid_entry_get_backing_table(gentry);
+   int l1_apic_id = physid_entry_get_apicid(gentry);
+   hpa_t backing_page_hpa;
+   u64 sentry = 0;
+
+
+   if (backing_page_gpa == INVALID_BACKING_PAGE)
+   return;
+
+   /* Pin the APIC backing page */
+   backing_page = gfn_to_page(kvm, gpa_to_gfn(backing_page_gpa));
+
+   if (is_error_page(backing_page))
+   /* Invalid GPA in the guest entry - point to a dummy entry */
+   backing_page_hpa = t->dummy_page_hpa;
+   else
+   backing_page_hpa = page_to_phys(backing_page);
+
+   physid_entry_set_backing_table(&sentry, backing_page_hpa);
+
+   e->gentry = gentry;
+   *e->sentry = sentry;
+
+   if (test_and_set_bit(n, t->valid_entires))
+   WARN_ON(1);
+
+   if (backing_page_hpa != t->dummy_page_hpa)
+   avic_physid_shadow_entry_set_vcpu(kvm, t, n, l1_apic_id);
+}
+
+static void avic_physid_shadow_entry_remove(struct kvm *kvm,
+  struct avic_physid_table *t,
+  int n)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+   hpa_t backing_page_hpa;
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
+
+   if (!test_and_clear_bit(n, t->valid_entires))
+   WARN_ON(1);
+
+   /* Release the APIC backing page */
+   backing_page_hpa = physid_entry_get_backing_table(*e->sentry);
+
+   if (backing_page_hpa != t->dummy_page_hpa)
+   kvm_release_pfn_dirty(backing_page_hpa >> PAGE_SHIFT);
+
+   if (!list_empty(&e->link))
+   list_del_init(&e->link);
+
+   e->gentry = 0;
+   *e->sentry = 0;
+
+   raw_spin_unlock_irqrestore(&k

[RFC PATCH v2 08/10] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page

2022-04-20 Thread Maxim Levitsky

This will be used on SVM to reload shadow page of the AVIC physid table

No functional change intended

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm-x86-ops.h | 2 +-
 arch/x86/include/asm/kvm_host.h| 3 +--
 arch/x86/kvm/vmx/vmx.c | 8 
 arch/x86/kvm/x86.c | 6 +++---
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 96e4e9842dfc6..997edb7453ac2 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -82,7 +82,7 @@ KVM_X86_OP_OPTIONAL(hwapic_isr_update)
 KVM_X86_OP_OPTIONAL_RET0(guest_apic_has_interrupt)
 KVM_X86_OP_OPTIONAL(load_eoi_exitmap)
 KVM_X86_OP_OPTIONAL(set_virtual_apic_mode)
-KVM_X86_OP_OPTIONAL(set_apic_access_page_addr)
+KVM_X86_OP_OPTIONAL(reload_apic_pages)
 KVM_X86_OP(deliver_interrupt)
 KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
 KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ae41d2df69fe9..f83cfcd7dd74c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1415,7 +1415,7 @@ struct kvm_x86_ops {
bool (*guest_apic_has_interrupt)(struct kvm_vcpu *vcpu);
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu);
-   void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu);
+   void (*reload_apic_pages)(struct kvm_vcpu *vcpu);
void (*deliver_interrupt)(struct kvm_lapic *apic, int delivery_mode,
  int trig_mode, int vector);
int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
@@ -1888,7 +1888,6 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
-
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
unsigned long ipi_bitmap_high, u32 min,
unsigned long icr, int op_64_bit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cf8581978bce3..7defd31703c61 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6339,7 +6339,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
vmx_update_msr_bitmap_x2apic(vcpu);
 }
 
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+static void vmx_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
struct page *page;
 
@@ -,7 +,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.enable_irq_window = vmx_enable_irq_window,
.update_cr8_intercept = vmx_update_cr8_intercept,
.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-   .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+   .reload_apic_pages = vmx_reload_apic_access_page,
.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
.apicv_post_state_restore = vmx_apicv_post_state_restore,
@@ -7940,12 +7940,12 @@ static __init int hardware_setup(void)
enable_vnmi = 0;
 
/*
-* set_apic_access_page_addr() is used to reload apic access
+* kvm_vcpu_reload_apic_pages() is used to reload apic access
 * page upon invalidation.  No need to do anything if not
 * using the APIC_ACCESS_ADDR VMCS field.
 */
if (!flexpriority_enabled)
-   vmx_x86_ops.set_apic_access_page_addr = NULL;
+   vmx_x86_ops.reload_apic_pages = NULL;
 
if (!cpu_has_vmx_tpr_shadow())
vmx_x86_ops.update_cr8_intercept = NULL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab336f7c82e4b..3ac2d0134271b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9949,12 +9949,12 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm 
*kvm,
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 }
 
-static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
+static void kvm_vcpu_reload_apic_pages(struct kvm_vcpu *vcpu)
 {
if (!lapic_in_kernel(vcpu))
return;
 
-   static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
+   static_call_cond(kvm_x86_reload_apic_pages)(vcpu);
 }
 
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
@@ -10071,7 +10071,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu))
vcpu_load_eoi_exitmap(vcpu);
if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
-   kvm_vcpu_reload_apic_access_page(vcpu);
+   kvm_vcpu_reload_apic_pages(vcpu);
if (kvm_check_request(KVM_REQ_HV_CRASH, vcpu)) {
vcpu->run->exit_

[RFC PATCH v2 07/10] KVM: x86: SVM: move avic state to separate struct

2022-04-20 Thread Maxim Levitsky

This will make the code a bit easier to read when nested AVIC support
is added.

No functional change intended.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 49 +++--
 arch/x86/kvm/svm/svm.h  | 14 +++-
 2 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index f375ca1d6518e..87756237c646d 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -69,6 +69,8 @@ int avic_ga_log_notifier(u32 ga_tag)
unsigned long flags;
struct kvm_svm *kvm_svm;
struct kvm_vcpu *vcpu = NULL;
+   struct kvm_svm_avic *avic;
+
u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag);
u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag);
 
@@ -76,9 +78,13 @@ int avic_ga_log_notifier(u32 ga_tag)
trace_kvm_avic_ga_log(vm_id, vcpu_id);
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
-   hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) {
-   if (kvm_svm->avic_vm_id != vm_id)
+   hash_for_each_possible(svm_vm_data_hash, avic, hnode, vm_id) {
+
+
+   if (avic->vm_id != vm_id)
continue;
+
+   kvm_svm = container_of(avic, struct kvm_svm, avic);
vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id);
break;
}
@@ -98,18 +104,18 @@ int avic_ga_log_notifier(u32 ga_tag)
 void avic_vm_destroy(struct kvm *kvm)
 {
unsigned long flags;
-   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+   struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
 
if (!enable_apicv)
return;
 
-   if (kvm_svm->avic_logical_id_table_page)
-   __free_page(kvm_svm->avic_logical_id_table_page);
-   if (kvm_svm->avic_physical_id_table_page)
-   __free_page(kvm_svm->avic_physical_id_table_page);
+   if (avic->logical_id_table_page)
+   __free_page(avic->logical_id_table_page);
+   if (avic->physical_id_table_page)
+   __free_page(avic->physical_id_table_page);
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
-   hash_del(&kvm_svm->hnode);
+   hash_del(&avic->hnode);
spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 }
 
@@ -117,10 +123,9 @@ int avic_vm_init(struct kvm *kvm)
 {
unsigned long flags;
int err = -ENOMEM;
-   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
-   struct kvm_svm *k2;
struct page *p_page;
struct page *l_page;
+   struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
u32 vm_id;
 
if (!enable_apicv)
@@ -131,14 +136,14 @@ int avic_vm_init(struct kvm *kvm)
if (!p_page)
goto free_avic;
 
-   kvm_svm->avic_physical_id_table_page = p_page;
+   avic->physical_id_table_page = p_page;
 
/* Allocating logical APIC ID table (4KB) */
l_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!l_page)
goto free_avic;
 
-   kvm_svm->avic_logical_id_table_page = l_page;
+   avic->logical_id_table_page = l_page;
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
  again:
@@ -149,13 +154,15 @@ int avic_vm_init(struct kvm *kvm)
}
/* Is it still in use? Only possible if wrapped at least once */
if (next_vm_id_wrapped) {
-   hash_for_each_possible(svm_vm_data_hash, k2, hnode, vm_id) {
-   if (k2->avic_vm_id == vm_id)
+   struct kvm_svm_avic *avic2;
+
+   hash_for_each_possible(svm_vm_data_hash, avic2, hnode, vm_id) {
+   if (avic2->vm_id == vm_id)
goto again;
}
}
-   kvm_svm->avic_vm_id = vm_id;
-   hash_add(svm_vm_data_hash, &kvm_svm->hnode, kvm_svm->avic_vm_id);
+   avic->vm_id = vm_id;
+   hash_add(svm_vm_data_hash, &avic->hnode, avic->vm_id);
spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 
return 0;
@@ -169,8 +176,8 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 {
struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page));
-   phys_addr_t lpa = 
__sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
-   phys_addr_t ppa = 
__sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
+   phys_addr_t lpa = 
__sme_set(page_to_phys(kvm_svm->avic.logical_id_table_page));
+   phys_addr_t ppa = 
__sme_set(page_to_phys(kvm_svm->avic.physical_id_table_page));
 
vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK;
vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK;
@@ -193,7 +200,7 @@ static u64 *avic_get_physical_id_ent

[RFC PATCH v2 06/10] KVM: x86: SVM: remove avic's broken code that updated APIC ID

2022-04-20 Thread Maxim Levitsky

Now that KVM doesn't allow to change APIC ID in case AVIC is
enabled, remove buggy AVIC code that tried to do so.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 35 ---
 1 file changed, 35 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 9b859218af59c..f375ca1d6518e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -442,35 +442,6 @@ static int avic_handle_ldr_update(struct kvm_vcpu *vcpu)
return ret;
 }
 
-static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu)
-{
-   u64 *old, *new;
-   struct vcpu_svm *svm = to_svm(vcpu);
-   u32 id = kvm_xapic_id(vcpu->arch.apic);
-
-   if (vcpu->vcpu_id == id)
-   return 0;
-
-   old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id);
-   new = avic_get_physical_id_entry(vcpu, id);
-   if (!new || !old)
-   return 1;
-
-   /* We need to move physical_id_entry to new offset */
-   *new = *old;
-   *old = 0ULL;
-   to_svm(vcpu)->avic_physical_id_cache = new;
-
-   /*
-* Also update the guest physical APIC ID in the logical
-* APIC ID table entry if already setup the LDR.
-*/
-   if (svm->ldr_reg)
-   avic_handle_ldr_update(vcpu);
-
-   return 0;
-}
-
 static void avic_handle_dfr_update(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -489,10 +460,6 @@ static int avic_unaccel_trap_write(struct kvm_vcpu *vcpu)
AVIC_UNACCEL_ACCESS_OFFSET_MASK;
 
switch (offset) {
-   case APIC_ID:
-   if (avic_handle_apic_id_update(vcpu))
-   return 0;
-   break;
case APIC_LDR:
if (avic_handle_ldr_update(vcpu))
return 0;
@@ -584,8 +551,6 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
-   if (avic_handle_apic_id_update(vcpu) != 0)
-   return;
avic_handle_dfr_update(vcpu);
avic_handle_ldr_update(vcpu);
 }
-- 
2.26.3

[RFC PATCH v2 05/10] KVM: x86: lapic: don't allow to change APIC ID when apic acceleration is enabled

2022-04-20 Thread Maxim Levitsky

No normal guest has any reason to change physical APIC IDs, and
allowing this introduces bugs into APIC acceleration code.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/lapic.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 66b0eb0bda94e..56996aeca9881 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2046,10 +2046,20 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, 
u32 reg, u32 val)
 
switch (reg) {
case APIC_ID:   /* Local APIC ID */
-   if (!apic_x2apic_mode(apic))
-   kvm_apic_set_xapic_id(apic, val >> 24);
-   else
+   if (apic_x2apic_mode(apic)) {
ret = 1;
+   break;
+   }
+   /*
+* Don't allow setting APIC ID with any APIC acceleration
+* enabled to avoid unexpected issues
+*/
+   if (enable_apicv && ((val >> 24) != apic->vcpu->vcpu_id)) {
+   kvm_vm_bugged(apic->vcpu->kvm);
+   break;
+   }
+
+   kvm_apic_set_xapic_id(apic, val >> 24);
break;
 
case APIC_TASKPRI:
@@ -2617,8 +2627,16 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s, bool set)
 {
-   if (apic_x2apic_mode(vcpu->arch.apic)) {
-   u32 *id = (u32 *)(s->regs + APIC_ID);
+   u32 *id = (u32 *)(s->regs + APIC_ID);
+
+   if (!apic_x2apic_mode(vcpu->arch.apic)) {
+   /* Don't allow setting APIC ID with any APIC acceleration
+* enabled to avoid unexpected issues
+*/
+   if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
+   return -EINVAL;
+   } else {
+
u32 *ldr = (u32 *)(s->regs + APIC_LDR);
u64 icr;
 
-- 
2.26.3

[RFC PATCH v2 04/10] KVM: x86: mmu: tweak fast path for emulation of access to nested NPT pages

2022-04-20 Thread Maxim Levitsky

---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 23f895d439cf5..b63398dfdac3b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5315,8 +5315,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t 
cr2_or_gpa, u64 error_code,
 */
if (vcpu->arch.mmu->root_role.direct &&
(error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-   kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
-   return 1;
+   if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
+   return 1;
}
 
/*
-- 
2.26.3

[RFC PATCH v2 03/10] KVM: x86: mmu: add gfn_in_memslot helper

2022-04-20 Thread Maxim Levitsky

This is a tiny refactoring, and can be useful to check
if a GPA/GFN is within a memslot a bit more cleanly.

Signed-off-by: Maxim Levitsky 
---
 include/linux/kvm_host.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 252ee4a61b58b..12e261559070b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1580,6 +1580,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
 bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
 
+
+static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+   return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
+}
+
+
 /*
  * Returns a pointer to the memslot if it contains gfn.
  * Otherwise returns NULL.
@@ -1590,12 +1597,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
if (!slot)
return NULL;
 
-   if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
+   if (gfn_in_memslot(slot, gfn))
return slot;
else
return NULL;
 }
 
+
 /*
  * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
  *
-- 
2.26.3

[RFC PATCH v2 02/10] x86: KVMGT: use kvm_page_track_write_tracking_enable

2022-04-20 Thread Maxim Levitsky

This allows to enable the write tracking only when KVMGT is
actually used and doesn't carry any penalty otherwise.

Tested by booting a VM with a kvmgt mdev device.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/Kconfig | 3 ---
 arch/x86/kvm/mmu/mmu.c   | 2 +-
 drivers/gpu/drm/i915/Kconfig | 1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c | 5 +
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd77061364..41341905d3734 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -126,7 +126,4 @@ config KVM_XEN
 
  If in doubt, say "N".
 
-config KVM_EXTERNAL_WRITE_TRACKING
-   bool
-
 endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2c4edae4b026d..23f895d439cf5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5727,7 +5727,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
 
-   if (IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || !tdp_enabled)
+   if (!tdp_enabled)
mmu_enable_write_tracking(kvm);
 
return 0;
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 98c5450b8eacc..7d8346f4bae11 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -130,7 +130,6 @@ config DRM_I915_GVT_KVMGT
depends on DRM_I915_GVT
depends on KVM
depends on VFIO_MDEV
-   select KVM_EXTERNAL_WRITE_TRACKING
default n
help
  Choose this option if you want to enable KVMGT support for
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 057ec44901045..4c62ab3ef245d 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1933,6 +1933,7 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
struct intel_vgpu *vgpu;
struct kvmgt_vdev *vdev;
struct kvm *kvm;
+   int ret;
 
vgpu = mdev_get_drvdata(mdev);
if (handle_valid(vgpu->handle))
@@ -1948,6 +1949,10 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
if (__kvmgt_vgpu_exist(vgpu, kvm))
return -EEXIST;
 
+   ret = kvm_page_track_write_tracking_enable(kvm);
+   if (ret)
+   return ret;
+
info = vzalloc(sizeof(struct kvmgt_guest_info));
if (!info)
return -ENOMEM;
-- 
2.26.3

[RFC PATCH v2 01/10] KVM: x86: mmu: allow to enable write tracking externally

2022-04-20 Thread Maxim Levitsky

This will be used to enable write tracking from nested AVIC code
and can also be used to enable write tracking in GVT-g module
when it actually uses it as opposed to always enabling it,
when the module is compiled in the kernel.

No functional change intended.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h   |  2 +-
 arch/x86/include/asm/kvm_page_track.h |  1 +
 arch/x86/kvm/mmu.h|  8 +---
 arch/x86/kvm/mmu/mmu.c| 17 ++---
 arch/x86/kvm/mmu/page_track.c | 10 --
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c20f715f0094..ae41d2df69fe9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1234,7 +1234,7 @@ struct kvm_arch {
 * is used as one input when determining whether certain memslot
 * related allocations are necessary.
 */
-   bool shadow_root_allocated;
+   bool mmu_page_tracking_enabled;
 
 #if IS_ENABLED(CONFIG_HYPERV)
hpa_t   hv_root_tdp;
diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a9..955a5ae07b10e 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -50,6 +50,7 @@ int kvm_page_track_init(struct kvm *kvm);
 void kvm_page_track_cleanup(struct kvm *kvm);
 
 bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
+int kvm_page_track_write_tracking_enable(struct kvm *kvm);
 int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 671cfeccf04e9..44d15551f7156 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -269,7 +269,7 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 int kvm_mmu_post_init_vm(struct kvm *kvm);
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
 
-static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
+static inline bool mmu_page_tracking_enabled(struct kvm *kvm)
 {
/*
 * Read shadow_root_allocated before related pointers. Hence, threads
@@ -277,9 +277,11 @@ static inline bool kvm_shadow_root_allocated(struct kvm 
*kvm)
 * see the pointers. Pairs with smp_store_release in
 * mmu_first_shadow_root_alloc.
 */
-   return smp_load_acquire(&kvm->arch.shadow_root_allocated);
+   return smp_load_acquire(&kvm->arch.mmu_page_tracking_enabled);
 }
 
+int mmu_enable_write_tracking(struct kvm *kvm);
+
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return 
kvm->arch.tdp_mmu_enabled; }
 #else
@@ -288,7 +290,7 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { 
return false; }
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
-   return !is_tdp_mmu_enabled(kvm) || kvm_shadow_root_allocated(kvm);
+   return !is_tdp_mmu_enabled(kvm) || mmu_page_tracking_enabled(kvm);
 }
 
 static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 69a30d6d1e2b9..2c4edae4b026d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3368,7 +3368,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
return r;
 }
 
-static int mmu_first_shadow_root_alloc(struct kvm *kvm)
+int mmu_enable_write_tracking(struct kvm *kvm)
 {
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
@@ -3378,21 +3378,20 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 * Check if this is the first shadow root being allocated before
 * taking the lock.
 */
-   if (kvm_shadow_root_allocated(kvm))
+   if (mmu_page_tracking_enabled(kvm))
return 0;
 
mutex_lock(&kvm->slots_arch_lock);
 
/* Recheck, under the lock, whether this is the first shadow root. */
-   if (kvm_shadow_root_allocated(kvm))
+   if (mmu_page_tracking_enabled(kvm))
goto out_unlock;
 
/*
 * Check if anything actually needs to be allocated, e.g. all metadata
 * will be allocated upfront if TDP is disabled.
 */
-   if (kvm_memslots_have_rmaps(kvm) &&
-   kvm_page_track_write_tracking_enabled(kvm))
+   if (kvm_memslots_have_rmaps(kvm) && mmu_page_tracking_enabled(kvm))
goto out_success;
 
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -3422,7 +3421,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 * all the related pointers are set.
 */
 out_success:
-   smp_store_release(&kvm->arch.shadow_root_allocated, true);
+   smp_store_release(&kvm->arch.mmu_page_tracking_enabled, true);
 
 out_unlock:
mutex_unlock(&kvm->slots_arch_lock);
@@ -3459,7 +3

[RFC PATCH v2 00/10] RFCv2: nested AVIC

2022-04-20 Thread Maxim Levitsky

This patch series implement everything that is needed to
use AMD's AVIC while a nested guest is running including
ability of the nested guest to use it, and brings feature
parity vs APICv.

Compared to v1 of the series, there are lot of fixes,
and refactoring.

This version still use unconditional read-only apic id,
which will be addressed in the next version.

For the last patch, which allows to avoid cleaning is_running
bit in physid pages as long as it is possible, I measured
what would happen in a worst case:

- A malicious guest runs with 2 vCPUs pinned,
its first vCPU pounds on ICR sending IPIs to the 2nd vCPU

and 2nd vCPU scheduled out forever and not halted
(something that should not happen with Qemu though)

- A normal guest with single vCPU is pinned to run
on the same CPU as the 2nd vCPU of the first guest.

The normal guest continued to run, but was observed to run
about 40% slower.

Therefore AVIC doorbel is strict by default but if the admin
policy is to pin guests and not allow them to share a physical
cpu, then strict doorbel can be set to false and that does
improves the nested (and not nested as well) AVIC perf futher.

Suggestions, comments are welcome.

Best regards,
    Maxim Levitsky

Maxim Levitsky (10):
  KVM: x86: mmu: allow to enable write tracking externally
  x86: KVMGT: use kvm_page_track_write_tracking_enable
  KVM: x86: mmu: add gfn_in_memslot helper
  KVM: x86: mmu: tweak fast path for emulation of access to nested NPT
pages
  KVM: x86: lapic: don't allow to change APIC ID when apic acceleration
is enabled
  KVM: x86: SVM: remove avic's broken code that updated APIC ID
  KVM: x86: SVM: move avic state to separate struct
  KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page
  KVM: nSVM: implement support for nested AVIC
  KVM: SVM: allow to avoid not needed updates to is_running

 arch/x86/include/asm/kvm-x86-ops.h|   2 +-
 arch/x86/include/asm/kvm_host.h   |   5 +-
 arch/x86/include/asm/kvm_page_track.h |   1 +
 arch/x86/kvm/Kconfig  |   3 -
 arch/x86/kvm/lapic.c  |  28 +-
 arch/x86/kvm/mmu.h|   8 +-
 arch/x86/kvm/mmu/mmu.c|  21 +-
 arch/x86/kvm/mmu/page_track.c |  10 +-
 arch/x86/kvm/svm/avic.c   | 949 --
 arch/x86/kvm/svm/nested.c | 131 +++-
 arch/x86/kvm/svm/svm.c|  31 +-
 arch/x86/kvm/svm/svm.h| 165 -
 arch/x86/kvm/trace.h  | 140 +++-
 arch/x86/kvm/vmx/vmx.c|   8 +-
 arch/x86/kvm/x86.c|  17 +-
 drivers/gpu/drm/i915/Kconfig  |   1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c  |   5 +
 include/linux/kvm_host.h  |  10 +-
 18 files changed, 1413 insertions(+), 122 deletions(-)

-- 
2.26.3

[PATCH v3 11/11] KVM: SVM: allow to avoid not needed updates to is_running

2022-03-01 Thread Maxim Levitsky

Allow optionally to make KVM not update is_running unless it is
functionally needed which is only when a vCPU halts,
or is in the guest mode.

This means security wise that if a vCPU is scheduled out,
other vCPUs could still send doorbell messages to the
last physical CPU where this vCPU was last running.

This in theory can be considered less secure, thus
this option is not enabled by default.

The option is avic_doorbell_strict and is true by
default, setting it to false allows this relaxed
non strict mode.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 39 +++
 arch/x86/kvm/svm/svm.c  |  7 +--
 arch/x86/kvm/svm/svm.h  |  1 +
 virt/kvm/kvm_main.c |  3 ++-
 4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index dd13fd3588e2b..1d690a9d3952e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -166,10 +166,13 @@ void avic_physid_shadow_table_update_vcpu_location(struct 
kvm_vcpu *vcpu, int cp
raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
 
list_for_each_entry(e, &vcpu_svm->nested.physid_ref_entries, link) {
-   u64 sentry = READ_ONCE(*e->sentry);
+   u64 old_sentry = READ_ONCE(*e->sentry);
+   u64 new_sentry = old_sentry;
 
-   physid_entry_set_apicid(&sentry, cpu);
-   WRITE_ONCE(*e->sentry, sentry);
+   physid_entry_set_apicid(&new_sentry, cpu);
+
+   if (new_sentry != old_sentry)
+   WRITE_ONCE(*e->sentry, new_sentry);
nentries++;
}
 
@@ -1507,7 +1510,7 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, 
int cpu, bool r)
 
 void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
-   u64 entry;
+   u64 old_entry, new_entry;
/* ID = 0xff (broadcast), ID > 0xff (reserved) */
int h_physical_id = kvm_cpu_get_apicid(cpu);
struct vcpu_svm *svm = to_svm(vcpu);
@@ -1531,14 +1534,16 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (kvm_vcpu_is_blocking(vcpu))
return;
 
-   entry = READ_ONCE(*(svm->avic_physical_id_cache));
-   WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
+   old_entry = READ_ONCE(*(svm->avic_physical_id_cache));
+   new_entry = old_entry;
 
-   entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
-   entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
-   entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+   new_entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+   new_entry |= (h_physical_id & 
AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
+   new_entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+
+   if (old_entry != new_entry)
+   WRITE_ONCE(*(svm->avic_physical_id_cache), new_entry);
 
-   WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 }
 
@@ -1549,14 +1554,24 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 
lockdep_assert_preemption_disabled();
 
+   avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
+
+   /*
+* It is only meaningful to intercept IPIs from the guest
+* when either vCPU is blocked, or in guest mode.
+* In all other cases (e.g userspace vmexit, or preemption
+* by other task, the vCPU is guaranteed to return to guest mode
+* as soon as it can
+*/
+   if (!avic_doorbell_strict && !kvm_vcpu_is_blocking(vcpu) && 
!is_guest_mode(vcpu))
+   return;
+
entry = READ_ONCE(*(svm->avic_physical_id_cache));
 
/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
return;
 
-   avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
-
entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
 }
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0d6b715375a69..463b756f665ae 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -202,6 +202,9 @@ module_param(tsc_scaling, int, 0444);
 static bool avic;
 module_param(avic, bool, 0444);
 
+bool avic_doorbell_strict = true;
+module_param(avic_doorbell_strict, bool, 0444);
+
 bool __read_mostly dump_invalid_vmcb;
 module_param(dump_invalid_vmcb, bool, 0644);
 
@@ -1340,7 +1343,8 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
svm->loaded = false;
 
if (svm->nested.initialized && svm->avic_enabled)
-   avic_physid_shadow_table_update_vcpu_location(vcpu, -1);
+   if (!avic_doorbell_strict || kvm_vcpu_is_blocking(vcpu))
+   avic_physid_shadow_table_update_vcpu

[PATCH v3 10/11] KVM: nSVM: implement support for nested AVIC

2022-03-01 Thread Maxim Levitsky

This implements initial support of using the AVIC in a nested guest

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/svm.h |   8 +-
 arch/x86/kvm/svm/avic.c| 640 -
 arch/x86/kvm/svm/nested.c  | 127 +++-
 arch/x86/kvm/svm/svm.c |  25 ++
 arch/x86/kvm/svm/svm.h | 133 
 arch/x86/kvm/trace.h   | 164 +-
 arch/x86/kvm/x86.c |  10 +
 7 files changed, 1096 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index bb2fb78523cee..634c0b80a9dd2 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -222,17 +222,19 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 
 
 /* AVIC */
-#define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK   (0xFF)
+#define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK   (0xFFULL)
 #define AVIC_LOGICAL_ID_ENTRY_VALID_BIT31
 #define AVIC_LOGICAL_ID_ENTRY_VALID_MASK   (1 << 31)
 
+/* TODO: support > 254 L1 APIC ID */
 #define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK   (0xFFULL)
 #define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK   (0xFFULL << 12)
 #define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62)
 #define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK  (1ULL << 63)
-#define AVIC_PHYSICAL_ID_TABLE_SIZE_MASK   (0xFF)
+#define AVIC_PHYSICAL_ID_TABLE_SIZE_MASK   (0xFFULL)
 
-#define AVIC_DOORBELL_PHYSICAL_ID_MASK (0xFF)
+/* TODO: support > 254 L1 APIC ID */
+#define AVIC_DOORBELL_PHYSICAL_ID_MASK (0xFFULL)
 
 #define AVIC_UNACCEL_ACCESS_WRITE_MASK 1
 #define AVIC_UNACCEL_ACCESS_OFFSET_MASK0xFF0
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 406cdb63646e0..dd13fd3588e2b 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -51,6 +51,423 @@ static u32 next_vm_id = 0;
 static bool next_vm_id_wrapped = 0;
 static DEFINE_SPINLOCK(svm_vm_data_hash_lock);
 
+
+static inline struct kvm_vcpu *avic_vcpu_by_l1_apicid(struct kvm *kvm,
+ int l1_apicid)
+{
+   WARN_ON(l1_apicid == -1);
+   return kvm_get_vcpu_by_id(kvm, l1_apicid);
+}
+
+static void avic_physid_shadow_entry_update_cpu(struct kvm *kvm,
+   struct avic_physid_table *t,
+   int n,
+   int l1_apicid)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   u64 sentry = READ_ONCE(*e->sentry);
+   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+   struct kvm_vcpu *new_vcpu = NULL;
+   int l0_apicid;
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
+
+   if (!list_empty(&e->link))
+   list_del_init(&e->link);
+
+   if (l1_apicid != -1)
+   new_vcpu = avic_vcpu_by_l1_apicid(kvm, l1_apicid);
+
+   if (new_vcpu)
+   list_add_tail(&e->link, 
&to_svm(new_vcpu)->nested.physid_ref_entries);
+
+   /* update the shadow entry */
+   sentry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+   if (new_vcpu && to_svm(new_vcpu)->loaded) {
+   l0_apicid = kvm_cpu_get_apicid(new_vcpu->cpu);
+   physid_entry_set_apicid(&sentry, l0_apicid);
+   }
+   WRITE_ONCE(*e->sentry, sentry);
+   raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
+}
+
+static void avic_physid_shadow_entry_erase(struct kvm *kvm,
+  struct avic_physid_table *t,
+  int n)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+   unsigned long old_hpa;
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
+
+   if (!test_and_clear_bit(n, t->valid_entires))
+   WARN_ON(1);
+
+   /* Release the old APIC backing page */
+   old_hpa = physid_entry_get_backing_table(*e->sentry);
+   kvm_release_pfn_dirty(old_hpa >> PAGE_SHIFT);
+
+   list_del_init(&e->link);
+   WRITE_ONCE(e->gentry, 0);
+   WRITE_ONCE(*e->sentry, 0);
+
+   raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
+}
+
+static void avic_physid_shadow_entry_create(struct kvm *kvm,
+   struct avic_physid_table *t,
+   int n,
+   u64 gentry)
+{
+   struct avic_physid_entry_descr *e = &t->entries[n];
+   struct page *backing_page = NULL;
+   u64 sentry = 0;
+
+   u64 backing_pag

[PATCH v3 09/11] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page

2022-03-01 Thread Maxim Levitsky

This will be used on SVM to reload shadow page of the AVIC physid table

No functional change intended

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm-x86-ops.h | 2 +-
 arch/x86/include/asm/kvm_host.h| 3 +--
 arch/x86/kvm/vmx/vmx.c | 8 
 arch/x86/kvm/x86.c | 6 +++---
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index eb16e32117610..6473b61d241e2 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -82,7 +82,7 @@ KVM_X86_OP_OPTIONAL(hwapic_isr_update)
 KVM_X86_OP_OPTIONAL_RET0(guest_apic_has_interrupt)
 KVM_X86_OP_OPTIONAL(load_eoi_exitmap)
 KVM_X86_OP_OPTIONAL(set_virtual_apic_mode)
-KVM_X86_OP_OPTIONAL(set_apic_access_page_addr)
+KVM_X86_OP_OPTIONAL(reload_apic_pages)
 KVM_X86_OP(deliver_interrupt)
 KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
 KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 83f734e201e24..c73f8415533a6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1403,7 +1403,7 @@ struct kvm_x86_ops {
bool (*guest_apic_has_interrupt)(struct kvm_vcpu *vcpu);
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu);
-   void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu);
+   void (*reload_apic_pages)(struct kvm_vcpu *vcpu);
void (*deliver_interrupt)(struct kvm_lapic *apic, int delivery_mode,
  int trig_mode, int vector);
int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
@@ -1877,7 +1877,6 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
-
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
unsigned long ipi_bitmap_high, u32 min,
unsigned long icr, int op_64_bit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b325f99b21774..4a9a4785b55e4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6353,7 +6353,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
vmx_update_msr_bitmap_x2apic(vcpu);
 }
 
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+static void vmx_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
struct page *page;
 
@@ -7778,7 +7778,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.enable_irq_window = vmx_enable_irq_window,
.update_cr8_intercept = vmx_update_cr8_intercept,
.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-   .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+   .reload_apic_pages = vmx_reload_apic_access_page,
.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
.apicv_post_state_restore = vmx_apicv_post_state_restore,
@@ -7942,12 +7942,12 @@ static __init int hardware_setup(void)
enable_vnmi = 0;
 
/*
-* set_apic_access_page_addr() is used to reload apic access
+* kvm_vcpu_reload_apic_pages() is used to reload apic access
 * page upon invalidation.  No need to do anything if not
 * using the APIC_ACCESS_ADDR VMCS field.
 */
if (!flexpriority_enabled)
-   vmx_x86_ops.set_apic_access_page_addr = NULL;
+   vmx_x86_ops.reload_apic_pages = NULL;
 
if (!cpu_has_vmx_tpr_shadow())
vmx_x86_ops.update_cr8_intercept = NULL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 14b964eb079e7..1a6cfc27c3b35 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9824,12 +9824,12 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm 
*kvm,
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 }
 
-static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
+static void kvm_vcpu_reload_apic_pages(struct kvm_vcpu *vcpu)
 {
if (!lapic_in_kernel(vcpu))
return;
 
-   static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
+   static_call_cond(kvm_x86_reload_apic_pages)(vcpu);
 }
 
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
@@ -9945,7 +9945,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu))
vcpu_load_eoi_exitmap(vcpu);
if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
-   kvm_vcpu_reload_apic_access_page(vcpu);
+   kvm_vcpu_reload_apic_pages(vcpu);
if (kvm_check_request(KVM_REQ_HV_CRASH, vcpu)) {
vcpu->run->exit_

[PATCH v3 08/11] KVM: x86: SVM: move avic state to separate struct

2022-03-01 Thread Maxim Levitsky

This will make the code a bit easier to read when nested AVIC support
is added.

No functional change intended.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 49 +++--
 arch/x86/kvm/svm/svm.h  | 14 +++-
 2 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 90f106d4af45e..406cdb63646e0 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -69,6 +69,8 @@ int avic_ga_log_notifier(u32 ga_tag)
unsigned long flags;
struct kvm_svm *kvm_svm;
struct kvm_vcpu *vcpu = NULL;
+   struct kvm_svm_avic *avic;
+
u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag);
u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag);
 
@@ -76,9 +78,13 @@ int avic_ga_log_notifier(u32 ga_tag)
trace_kvm_avic_ga_log(vm_id, vcpu_id);
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
-   hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) {
-   if (kvm_svm->avic_vm_id != vm_id)
+   hash_for_each_possible(svm_vm_data_hash, avic, hnode, vm_id) {
+
+
+   if (avic->vm_id != vm_id)
continue;
+
+   kvm_svm = container_of(avic, struct kvm_svm, avic);
vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id);
break;
}
@@ -98,18 +104,18 @@ int avic_ga_log_notifier(u32 ga_tag)
 void avic_vm_destroy(struct kvm *kvm)
 {
unsigned long flags;
-   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+   struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
 
if (!enable_apicv)
return;
 
-   if (kvm_svm->avic_logical_id_table_page)
-   __free_page(kvm_svm->avic_logical_id_table_page);
-   if (kvm_svm->avic_physical_id_table_page)
-   __free_page(kvm_svm->avic_physical_id_table_page);
+   if (avic->logical_id_table_page)
+   __free_page(avic->logical_id_table_page);
+   if (avic->physical_id_table_page)
+   __free_page(avic->physical_id_table_page);
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
-   hash_del(&kvm_svm->hnode);
+   hash_del(&avic->hnode);
spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 }
 
@@ -117,10 +123,9 @@ int avic_vm_init(struct kvm *kvm)
 {
unsigned long flags;
int err = -ENOMEM;
-   struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
-   struct kvm_svm *k2;
struct page *p_page;
struct page *l_page;
+   struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
u32 vm_id;
 
if (!enable_apicv)
@@ -131,14 +136,14 @@ int avic_vm_init(struct kvm *kvm)
if (!p_page)
goto free_avic;
 
-   kvm_svm->avic_physical_id_table_page = p_page;
+   avic->physical_id_table_page = p_page;
 
/* Allocating logical APIC ID table (4KB) */
l_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!l_page)
goto free_avic;
 
-   kvm_svm->avic_logical_id_table_page = l_page;
+   avic->logical_id_table_page = l_page;
 
spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
  again:
@@ -149,13 +154,15 @@ int avic_vm_init(struct kvm *kvm)
}
/* Is it still in use? Only possible if wrapped at least once */
if (next_vm_id_wrapped) {
-   hash_for_each_possible(svm_vm_data_hash, k2, hnode, vm_id) {
-   if (k2->avic_vm_id == vm_id)
+   struct kvm_svm_avic *avic2;
+
+   hash_for_each_possible(svm_vm_data_hash, avic2, hnode, vm_id) {
+   if (avic2->vm_id == vm_id)
goto again;
}
}
-   kvm_svm->avic_vm_id = vm_id;
-   hash_add(svm_vm_data_hash, &kvm_svm->hnode, kvm_svm->avic_vm_id);
+   avic->vm_id = vm_id;
+   hash_add(svm_vm_data_hash, &avic->hnode, avic->vm_id);
spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 
return 0;
@@ -170,8 +177,8 @@ void avic_init_vmcb(struct vcpu_svm *svm)
struct vmcb *vmcb = svm->vmcb;
struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page));
-   phys_addr_t lpa = 
__sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
-   phys_addr_t ppa = 
__sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
+   phys_addr_t lpa = 
__sme_set(page_to_phys(kvm_svm->avic.logical_id_table_page));
+   phys_addr_t ppa = 
__sme_set(page_to_phys(kvm_svm->avic.physical_id_table_page));
 
vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK;
vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK;
@@ -194,7 +201,7 @@ static u64 *avic_get_physical_id_

[PATCH v3 07/11] KVM: x86: SVM: remove avic's broken code that updated APIC ID

2022-03-01 Thread Maxim Levitsky

Now that KVM doesn't allow to change APIC ID in case AVIC is
enabled, remove buggy AVIC code that tried to do so.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 35 ---
 1 file changed, 35 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index d5ce0868c5a74..90f106d4af45e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -441,35 +441,6 @@ static int avic_handle_ldr_update(struct kvm_vcpu *vcpu)
return ret;
 }
 
-static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu)
-{
-   u64 *old, *new;
-   struct vcpu_svm *svm = to_svm(vcpu);
-   u32 id = kvm_xapic_id(vcpu->arch.apic);
-
-   if (vcpu->vcpu_id == id)
-   return 0;
-
-   old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id);
-   new = avic_get_physical_id_entry(vcpu, id);
-   if (!new || !old)
-   return 1;
-
-   /* We need to move physical_id_entry to new offset */
-   *new = *old;
-   *old = 0ULL;
-   to_svm(vcpu)->avic_physical_id_cache = new;
-
-   /*
-* Also update the guest physical APIC ID in the logical
-* APIC ID table entry if already setup the LDR.
-*/
-   if (svm->ldr_reg)
-   avic_handle_ldr_update(vcpu);
-
-   return 0;
-}
-
 static void avic_handle_dfr_update(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -488,10 +459,6 @@ static int avic_unaccel_trap_write(struct kvm_vcpu *vcpu)
AVIC_UNACCEL_ACCESS_OFFSET_MASK;
 
switch (offset) {
-   case APIC_ID:
-   if (avic_handle_apic_id_update(vcpu))
-   return 0;
-   break;
case APIC_LDR:
if (avic_handle_ldr_update(vcpu))
return 0;
@@ -583,8 +550,6 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
-   if (avic_handle_apic_id_update(vcpu) != 0)
-   return;
avic_handle_dfr_update(vcpu);
avic_handle_ldr_update(vcpu);
 }
-- 
2.26.3

[PATCH v3 06/11] KVM: x86: lapic: don't allow to change APIC ID when apic acceleration is enabled

2022-03-01 Thread Maxim Levitsky

No normal guest has any reason to change physical APIC IDs, and
allowing this introduces bugs into APIC acceleration code.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/lapic.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 80a2020c4db40..ffb5fc6449bc5 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2042,10 +2042,20 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, 
u32 reg, u32 val)
 
switch (reg) {
case APIC_ID:   /* Local APIC ID */
-   if (!apic_x2apic_mode(apic))
-   kvm_apic_set_xapic_id(apic, val >> 24);
-   else
+   if (apic_x2apic_mode(apic)) {
ret = 1;
+   break;
+   }
+   /*
+* Don't allow setting APIC ID with any APIC acceleration
+* enabled to avoid unexpected issues
+*/
+   if (enable_apicv && ((val >> 24) != apic->vcpu->vcpu_id)) {
+   kvm_vm_bugged(apic->vcpu->kvm);
+   break;
+   }
+
+   kvm_apic_set_xapic_id(apic, val >> 24);
break;
 
case APIC_TASKPRI:
@@ -2613,8 +2623,16 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s, bool set)
 {
-   if (apic_x2apic_mode(vcpu->arch.apic)) {
-   u32 *id = (u32 *)(s->regs + APIC_ID);
+   u32 *id = (u32 *)(s->regs + APIC_ID);
+
+   if (!apic_x2apic_mode(vcpu->arch.apic)) {
+   /* Don't allow setting APIC ID with any APIC acceleration
+* enabled to avoid unexpected issues
+*/
+   if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
+   return -EINVAL;
+   } else {
+
u32 *ldr = (u32 *)(s->regs + APIC_LDR);
u64 icr;
 
-- 
2.26.3

[PATCH v3 05/11] KVM: x86: mmu: add gfn_in_memslot helper

2022-03-01 Thread Maxim Levitsky

This is a tiny refactoring, and can be useful to check
if a GPA/GFN is within a memslot a bit more cleanly.

Signed-off-by: Maxim Levitsky 
---
 include/linux/kvm_host.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f11039944c08f..c32bfe0e22b80 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1574,6 +1574,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
 bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
 
+
+static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+   return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
+}
+
+
 /*
  * Returns a pointer to the memslot if it contains gfn.
  * Otherwise returns NULL.
@@ -1584,12 +1591,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
if (!slot)
return NULL;
 
-   if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
+   if (gfn_in_memslot(slot, gfn))
return slot;
else
return NULL;
 }
 
+
 /*
  * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
  *
-- 
2.26.3

[PATCH v3 04/11] x86: KVMGT: use kvm_page_track_write_tracking_enable

2022-03-01 Thread Maxim Levitsky

This allows to enable the write tracking only when KVMGT is
actually used and doesn't carry any penalty otherwise.

Tested by booting a VM with a kvmgt mdev device.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/Kconfig | 3 ---
 arch/x86/kvm/mmu/mmu.c   | 2 +-
 drivers/gpu/drm/i915/Kconfig | 1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c | 5 +
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd77061364..41341905d3734 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -126,7 +126,4 @@ config KVM_XEN
 
  If in doubt, say "N".
 
-config KVM_EXTERNAL_WRITE_TRACKING
-   bool
-
 endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0368ef3fe582e..ba98551f0026d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5692,7 +5692,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
 
-   if (IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || !tdp_enabled)
+   if (!tdp_enabled)
mmu_enable_write_tracking(kvm);
 }
 
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index a4c94dc2e2164..8bea99622dd58 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -126,7 +126,6 @@ config DRM_I915_GVT_KVMGT
depends on DRM_I915_GVT
depends on KVM
depends on VFIO_MDEV
-   select KVM_EXTERNAL_WRITE_TRACKING
default n
help
  Choose this option if you want to enable KVMGT support for
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 20b82fb036f8c..64ced3c2bc550 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1916,6 +1916,7 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
struct intel_vgpu *vgpu;
struct kvmgt_vdev *vdev;
struct kvm *kvm;
+   int ret;
 
vgpu = mdev_get_drvdata(mdev);
if (handle_valid(vgpu->handle))
@@ -1931,6 +1932,10 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
if (__kvmgt_vgpu_exist(vgpu, kvm))
return -EEXIST;
 
+   ret = kvm_page_track_write_tracking_enable(kvm);
+   if (ret)
+   return ret;
+
info = vzalloc(sizeof(struct kvmgt_guest_info));
if (!info)
return -ENOMEM;
-- 
2.26.3

[PATCH v3 03/11] KVM: x86: mmu: allow to enable write tracking externally

2022-03-01 Thread Maxim Levitsky

This will be used to enable write tracking from nested AVIC code
and can also be used to enable write tracking in GVT-g module
when it actually uses it as opposed to always enabling it,
when the module is compiled in the kernel.

No functional change intended.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h   |  2 +-
 arch/x86/include/asm/kvm_page_track.h |  1 +
 arch/x86/kvm/mmu.h|  8 +---
 arch/x86/kvm/mmu/mmu.c| 16 +---
 arch/x86/kvm/mmu/page_track.c | 10 --
 5 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index efe7414361de8..83f734e201e24 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1222,7 +1222,7 @@ struct kvm_arch {
 * is used as one input when determining whether certain memslot
 * related allocations are necessary.
 */
-   bool shadow_root_allocated;
+   bool mmu_page_tracking_enabled;
 
 #if IS_ENABLED(CONFIG_HYPERV)
hpa_t   hv_root_tdp;
diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a9..955a5ae07b10e 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -50,6 +50,7 @@ int kvm_page_track_init(struct kvm *kvm);
 void kvm_page_track_cleanup(struct kvm *kvm);
 
 bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
+int kvm_page_track_write_tracking_enable(struct kvm *kvm);
 int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 1d0c1904d69a3..023b192637078 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -268,7 +268,7 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 int kvm_mmu_post_init_vm(struct kvm *kvm);
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
 
-static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
+static inline bool mmu_page_tracking_enabled(struct kvm *kvm)
 {
/*
 * Read shadow_root_allocated before related pointers. Hence, threads
@@ -276,9 +276,11 @@ static inline bool kvm_shadow_root_allocated(struct kvm 
*kvm)
 * see the pointers. Pairs with smp_store_release in
 * mmu_first_shadow_root_alloc.
 */
-   return smp_load_acquire(&kvm->arch.shadow_root_allocated);
+   return smp_load_acquire(&kvm->arch.mmu_page_tracking_enabled);
 }
 
+int mmu_enable_write_tracking(struct kvm *kvm);
+
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return 
kvm->arch.tdp_mmu_enabled; }
 #else
@@ -287,7 +289,7 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { 
return false; }
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
-   return !is_tdp_mmu_enabled(kvm) || kvm_shadow_root_allocated(kvm);
+   return !is_tdp_mmu_enabled(kvm) || mmu_page_tracking_enabled(kvm);
 }
 
 static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2c1c4eb60070..0368ef3fe582e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3365,7 +3365,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
return r;
 }
 
-static int mmu_first_shadow_root_alloc(struct kvm *kvm)
+int mmu_enable_write_tracking(struct kvm *kvm)
 {
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
@@ -3375,21 +3375,20 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 * Check if this is the first shadow root being allocated before
 * taking the lock.
 */
-   if (kvm_shadow_root_allocated(kvm))
+   if (mmu_page_tracking_enabled(kvm))
return 0;
 
mutex_lock(&kvm->slots_arch_lock);
 
/* Recheck, under the lock, whether this is the first shadow root. */
-   if (kvm_shadow_root_allocated(kvm))
+   if (mmu_page_tracking_enabled(kvm))
goto out_unlock;
 
/*
 * Check if anything actually needs to be allocated, e.g. all metadata
 * will be allocated upfront if TDP is disabled.
 */
-   if (kvm_memslots_have_rmaps(kvm) &&
-   kvm_page_track_write_tracking_enabled(kvm))
+   if (kvm_memslots_have_rmaps(kvm) && mmu_page_tracking_enabled(kvm))
goto out_success;
 
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -3419,7 +3418,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 * all the related pointers are set.
 */
 out_success:
-   smp_store_release(&kvm->arch.shadow_root_allocated, true);
+   smp_store_release(&kvm->arch.mmu_page_tracking_enabled, true);
 
 out_unlock:
mutex_unlock(&kvm->slots_arch_lock);
@@ -3456,7 +3

[PATCH v3 02/11] KVM: x86: SVM: allow AVIC to co-exist with a nested guest running

2022-03-01 Thread Maxim Levitsky

Inhibit the AVIC of the vCPU that is running nested for the duration of the
nested run, so that all interrupts arriving from both its vCPU siblings
and from KVM are delivered using normal IPIs and cause that vCPU to vmexit.

Note that unlike normal AVIC inhibition, there is no need to
update the AVIC mmio memslot, because the nested guest uses its
own set of paging tables.
That also means that AVIC doesn't need to be inhibited VM wide.

Note that in the theory when a nested guest doesn't intercept
physical interrupts, we could continue using AVIC to deliver them
to it but don't bother doing so for now. Plus when nested AVIC
is implemented, the nested guest will likely use it, which will
not allow this optimization to be used

(can't use real AVIC to support both L1 and L2 at the same time)

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h|  7 ++-
 arch/x86/kvm/svm/avic.c|  6 +-
 arch/x86/kvm/svm/nested.c  | 15 ++-
 arch/x86/kvm/svm/svm.c | 31 +++---
 arch/x86/kvm/svm/svm.h |  1 +
 arch/x86/kvm/x86.c | 15 +--
 7 files changed, 56 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 29affccb353cd..eb16e32117610 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -126,6 +126,7 @@ KVM_X86_OP_OPTIONAL(migrate_timers)
 KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
+KVM_X86_OP_OPTIONAL_RET0(vcpu_has_apicv_inhibit_condition);
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ccec837e520d8..efe7414361de8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1039,7 +1039,6 @@ struct kvm_x86_msr_filter {
 
 #define APICV_INHIBIT_REASON_DISABLE0
 #define APICV_INHIBIT_REASON_HYPERV 1
-#define APICV_INHIBIT_REASON_NESTED 2
 #define APICV_INHIBIT_REASON_IRQWIN 3
 #define APICV_INHIBIT_REASON_PIT_REINJ  4
 #define APICV_INHIBIT_REASON_X2APIC5
@@ -1490,6 +1489,12 @@ struct kvm_x86_ops {
int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
 
void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+
+   /*
+* Returns true if for some reason APICv (e.g guest mode)
+* must be inhibited on this vCPU
+*/
+   bool (*vcpu_has_apicv_inhibit_condition)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index aea0b13773fd3..d5ce0868c5a74 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -357,6 +357,11 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
return 1;
 }
 
+bool avic_has_vcpu_inhibit_condition(struct kvm_vcpu *vcpu)
+{
+   return is_guest_mode(vcpu);
+}
+
 static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool 
flat)
 {
struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
@@ -859,7 +864,6 @@ bool avic_check_apicv_inhibit_reasons(ulong bit)
ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
  BIT(APICV_INHIBIT_REASON_ABSENT) |
  BIT(APICV_INHIBIT_REASON_HYPERV) |
- BIT(APICV_INHIBIT_REASON_NESTED) |
  BIT(APICV_INHIBIT_REASON_IRQWIN) |
  BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
  BIT(APICV_INHIBIT_REASON_X2APIC) |
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 62cda8ae71bbc..6dffa6c661493 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -575,11 +575,6 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm 
*svm)
 * exit_int_info, exit_int_info_err, next_rip, insn_len, insn_bytes.
 */
 
-   /*
-* Also covers avic_vapic_bar, avic_backing_page, avic_logical_id,
-* avic_physical_id.
-*/
-   WARN_ON(kvm_apicv_activated(svm->vcpu.kvm));
 
/* Copied from vmcb01.  msrpm_base can be overwritten later.  */
svm->vmcb->control.nested_ctl = svm->vmcb01.ptr->control.nested_ctl;
@@ -683,6 +678,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 
vmcb12_gpa,
 
svm_set_gif(svm, true);
 
+   if (kvm_vcpu_apicv_active(vcpu))
+   kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
+
return 0;
 }
 
@@ -947,6 +945,13 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
if (unlikely(svm->vmcb->save.rflags & X86_EFLAGS_TF))
kvm_queue_exception(&(svm->vcpu), DB_VECTOR);
 
+   /*
+* Un-inhibit the AVIC right away, so that other vCPUs can start
+* to benefit

[PATCH v3 01/11] KVM: x86: SVM: move nested_npt_enabled to svm.h

2022-03-01 Thread Maxim Levitsky

It will be used in other places

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 5 -
 arch/x86/kvm/svm/svm.h| 9 +
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 96bab464967f2..62cda8ae71bbc 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -454,11 +454,6 @@ static void nested_save_pending_event_to_vmcb12(struct 
vcpu_svm *svm,
vmcb12->control.exit_int_info = exit_int_info;
 }
 
-static inline bool nested_npt_enabled(struct vcpu_svm *svm)
-{
-   return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
-}
-
 static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
 {
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 70850cbe5bcb5..c8dedc4a068d2 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -509,6 +509,11 @@ void svm_complete_interrupt_delivery(struct kvm_vcpu 
*vcpu, int delivery_mode,
 #define NESTED_EXIT_DONE   1   /* Exit caused nested vmexit  */
 #define NESTED_EXIT_CONTINUE   2   /* Further checks needed  */
 
+static inline bool nested_npt_enabled(struct vcpu_svm *svm)
+{
+   return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
+}
+
 static inline bool nested_svm_virtualize_tpr(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -626,4 +631,8 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 void __svm_sev_es_vcpu_run(unsigned long vmcb_pa);
 void __svm_vcpu_run(unsigned long vmcb_pa, unsigned long *regs);
 
+ /* svm.c */
+ #define MSR_INVALID0xU
+
+
 #endif
-- 
2.26.3

[PATCH v3 00/11] RFC: nested AVIC

2022-03-01 Thread Maxim Levitsky

This patch series implement everything that is needed to
use AMD's AVIC while a nested guest is running including
ability of the nested guest to use it, and brings feature
parity vs APICv.

I already posted patch 2, and patch 1 is extracted from another
patch I posted today ‘KVM: x86: nSVM: implement nested VMLOAD/VMSAVE’
To make this series not depend on anything.

This is RFC. There still corner cases that need to be fixed in regard to 
locking,
especially around RCU use, locking is IMHO a bit ugly and inefficient.

I did test this with nested guests (even 3 level of nesting, all with AVIC 
enabled),
I also did a light test with VFIO passthrough.

Suggestions, comments are welcome.

Best regards,
    Maxim Levitsky

Maxim Levitsky (11):
  KVM: x86: SVM: move nested_npt_enabled to svm.h
  KVM: x86: SVM: allow AVIC to co-exist with a nested guest running
  KVM: x86: mmu: allow to enable write tracking externally
  x86: KVMGT: use kvm_page_track_write_tracking_enable
  KVM: x86: mmu: add gfn_in_memslot helper
  KVM: x86: lapic: don't allow to change APIC ID when apic acceleration
is enabled
  KVM: x86: SVM: remove avic's broken code that updated APIC ID
  KVM: x86: SVM: move avic state to separate struct
  KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page
  KVM: nSVM: implement support for nested AVIC
  KVM: SVM: allow to avoid not needed updates to is_running

 arch/x86/include/asm/kvm-x86-ops.h|   3 +-
 arch/x86/include/asm/kvm_host.h   |  12 +-
 arch/x86/include/asm/kvm_page_track.h |   1 +
 arch/x86/include/asm/svm.h|   8 +-
 arch/x86/kvm/Kconfig  |   3 -
 arch/x86/kvm/lapic.c  |  28 +-
 arch/x86/kvm/mmu.h|   8 +-
 arch/x86/kvm/mmu/mmu.c|  16 +-
 arch/x86/kvm/mmu/page_track.c |  10 +-
 arch/x86/kvm/svm/avic.c   | 763 +++---
 arch/x86/kvm/svm/nested.c | 147 -
 arch/x86/kvm/svm/svm.c|  61 +-
 arch/x86/kvm/svm/svm.h| 158 +-
 arch/x86/kvm/trace.h  | 164 +-
 arch/x86/kvm/vmx/vmx.c|   8 +-
 arch/x86/kvm/x86.c|  31 +-
 drivers/gpu/drm/i915/Kconfig  |   1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c  |   5 +
 include/linux/kvm_host.h  |  10 +-
 virt/kvm/kvm_main.c   |   3 +-
 20 files changed, 1298 insertions(+), 142 deletions(-)

-- 
2.26.3

[PATCH 30/30] KVM: x86: get rid of KVM_REQ_GET_NESTED_STATE_PAGES

2022-02-07 Thread Maxim Levitsky

As it turned out this request isn't really needed,
and it complicates the nested migration.

In theory this patch can break userspace if
userspace relies on updating KVM's memslots
after setting nested state but there is little reason
for it to rely on this.

However this is undocumented and there is a good chance
that no userspace relies on this, thus
just try to remove this code.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h |  5 +-
 arch/x86/kvm/hyperv.c   |  4 ++
 arch/x86/kvm/svm/nested.c   | 50 -
 arch/x86/kvm/svm/svm.c  |  2 +-
 arch/x86/kvm/svm/svm.h  |  2 +-
 arch/x86/kvm/vmx/nested.c   | 99 +
 arch/x86/kvm/x86.c  |  6 --
 7 files changed, 45 insertions(+), 123 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 446ee29e6cc99..fc2d5628ad930 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -92,7 +92,6 @@
 #define KVM_REQ_HV_EXITKVM_ARCH_REQ(21)
 #define KVM_REQ_HV_STIMER  KVM_ARCH_REQ(22)
 #define KVM_REQ_LOAD_EOI_EXITMAP   KVM_ARCH_REQ(23)
-#define KVM_REQ_GET_NESTED_STATE_PAGES KVM_ARCH_REQ(24)
 #define KVM_REQ_APICV_UPDATE \
KVM_ARCH_REQ_FLAGS(25, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_TLB_FLUSH_CURRENT  KVM_ARCH_REQ(26)
@@ -1519,12 +1518,14 @@ struct kvm_x86_nested_ops {
int (*set_state)(struct kvm_vcpu *vcpu,
 struct kvm_nested_state __user *user_kvm_nested_state,
 struct kvm_nested_state *kvm_state);
-   bool (*get_nested_state_pages)(struct kvm_vcpu *vcpu);
int (*write_log_dirty)(struct kvm_vcpu *vcpu, gpa_t l2_gpa);
 
int (*enable_evmcs)(struct kvm_vcpu *vcpu,
uint16_t *vmcs_version);
uint16_t (*get_evmcs_version)(struct kvm_vcpu *vcpu);
+
+   bool (*get_evmcs_page)(struct kvm_vcpu *vcpu);
+
 };
 
 struct kvm_x86_init_ops {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index dac41784f2b87..d297d102c0910 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1497,6 +1497,10 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 
msr, u64 data, bool host)
gfn_to_gpa(gfn) | KVM_MSR_ENABLED,
sizeof(struct hv_vp_assist_page)))
return 1;
+
+   if (host && kvm_x86_ops.nested_ops->get_evmcs_page)
+   if (!kvm_x86_ops.nested_ops->get_evmcs_page(vcpu))
+   return 1;
break;
}
case HV_X64_MSR_EOI:
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index a426d4d3dcd82..ac813ad83d784 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -670,7 +670,7 @@ static void nested_svm_copy_common_state(struct vmcb 
*from_vmcb, struct vmcb *to
 }
 
 int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
-struct vmcb *vmcb12, bool from_vmrun)
+struct vmcb *vmcb12)
 {
struct vcpu_svm *svm = to_svm(vcpu);
int ret;
@@ -700,15 +700,13 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 
vmcb12_gpa,
nested_vmcb02_prepare_save(svm, vmcb12);
 
ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
- nested_npt_enabled(svm), from_vmrun);
+ nested_npt_enabled(svm), true);
if (ret)
return ret;
 
if (!npt_enabled)
vcpu->arch.mmu->inject_page_fault = 
svm_inject_page_fault_nested;
 
-   if (!from_vmrun)
-   kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 
svm_set_gif(svm, true);
 
@@ -779,7 +777,7 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 
svm->nested.nested_run_pending = 1;
 
-   if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, true))
+   if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12))
goto out_exit_err;
 
if (nested_svm_vmrun_msrpm(svm))
@@ -863,8 +861,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
svm->nested.vmcb12_gpa = 0;
WARN_ON_ONCE(svm->nested.nested_run_pending);
 
-   kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
-
/* in case we halted in L2 */
svm->vcpu.arch.mp_state = KVM_MP_STATE_RUNNABLE;
 
@@ -1069,8 +1065,6 @@ void svm_leave_nested(struct kvm_vcpu *vcpu)
nested_svm_uninit_mmu_context(vcpu);
vmcb_mark_all_dirty(svm->vmcb);
}
-
-   kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 }
 
 static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
@@ -1562,53 +1556,31 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,

[PATCH 29/30] KVM: VMX: implement force_intercept_exceptions_mask

2022-02-07 Thread Maxim Levitsky

All exceptions are supported. Some bugs might remain in regard to KVM own
interception of #PF but since this is strictly
debug feature this should be OK.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/vmx/nested.c |  8 +++
 arch/x86/kvm/vmx/vmcs.h   |  6 +
 arch/x86/kvm/vmx/vmx.c| 47 +--
 3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c73e4d938ddc3..e89b32b1d9efb 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5902,6 +5902,14 @@ static bool nested_vmx_l0_wants_exit(struct kvm_vcpu 
*vcpu,
switch ((u16)exit_reason.basic) {
case EXIT_REASON_EXCEPTION_NMI:
intr_info = vmx_get_intr_info(vcpu);
+
+   if (is_exception(intr_info)) {
+   int ex_no = intr_info & INTR_INFO_VECTOR_MASK;
+
+   if (kvm_is_exception_force_intercepted(vcpu->kvm, 
ex_no))
+   return true;
+   }
+
if (is_nmi(intr_info))
return true;
else if (is_page_fault(intr_info))
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index e325c290a8162..d5aac5abe5cdd 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -94,6 +94,12 @@ static inline bool is_exception_n(u32 intr_info, u8 vector)
return is_intr_type_n(intr_info, INTR_TYPE_HARD_EXCEPTION, vector);
 }
 
+static inline bool is_exception(u32 intr_info)
+{
+   return is_intr_type(intr_info, INTR_TYPE_HARD_EXCEPTION) ||
+  is_intr_type(intr_info, INTR_TYPE_SOFT_EXCEPTION);
+}
+
 static inline bool is_debug(u32 intr_info)
 {
return is_exception_n(intr_info, DB_VECTOR);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fc9c4eca90a78..aec2b962707a0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -719,6 +719,7 @@ static u32 vmx_read_guest_seg_ar(struct vcpu_vmx *vmx, 
unsigned seg)
 void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
 {
u32 eb;
+   int exc;
 
eb = (1u << PF_VECTOR) | (1u << UD_VECTOR) | (1u << MC_VECTOR) |
 (1u << DB_VECTOR) | (1u << AC_VECTOR);
@@ -749,7 +750,8 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
 else {
int mask = 0, match = 0;
 
-   if (enable_ept && (eb & (1u << PF_VECTOR))) {
+   if (enable_ept && (eb & (1u << PF_VECTOR)) &&
+   !kvm_is_exception_force_intercepted(vcpu->kvm, PF_VECTOR)) {
/*
 * If EPT is enabled, #PF is currently only intercepted
 * if MAXPHYADDR is smaller on the guest than on the
@@ -772,6 +774,10 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
if (vcpu->arch.xfd_no_write_intercept)
eb |= (1u << NM_VECTOR);
 
+   for (exc = 0 ; exc < 32 ; ++exc)
+   if (kvm_is_exception_force_intercepted(vcpu->kvm, exc) && exc 
!= NMI_VECTOR)
+   eb |= (1u << exc);
+
vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -4867,18 +4873,23 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
 
if (!vmx->rmode.vm86_active && is_gp_fault(intr_info)) {
-   WARN_ON_ONCE(!enable_vmware_backdoor);
-
/*
 * VMware backdoor emulation on #GP interception only handles
 * IN{S}, OUT{S}, and RDPMC, none of which generate a non-zero
 * error code on #GP.
 */
-   if (error_code) {
+
+   if (enable_vmware_backdoor && !error_code)
+   return kvm_emulate_instruction(vcpu, 
EMULTYPE_VMWARE_GP);
+
+   if (!kvm_is_exception_force_intercepted(vcpu->kvm, GP_VECTOR))
+   WARN_ON_ONCE(!enable_vmware_backdoor);
+
+   if (intr_info & INTR_INFO_DELIVER_CODE_MASK)
kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
-   return 1;
-   }
-   return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP);
+   else
+   kvm_queue_exception(vcpu, GP_VECTOR);
+   return 1;
}
 
/*
@@ -4887,6 +4898,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 * See the comments in vmx_handle_exit.
 */
if ((vect_info & VECTORING_INFO_VALID_MASK) &&
+   !kvm_is_exception_force_intercepted(vcpu->kvm, PF_VECTOR) &&
!(is_page_fault(intr_info) && !(error_code & PFERR_RSVD_MASK))) {
vcpu->run->exit_reason = KVM_EXIT_I

[PATCH 28/30] KVM: SVM: implement force_intercept_exceptions_mask

2022-02-07 Thread Maxim Levitsky

Currently #TS interception is only done once.
Also exception interception is not enabled for SEV guests.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/svm/svm.c  | 92 -
 arch/x86/kvm/svm/svm.h  |  5 +-
 arch/x86/kvm/svm/svm_onhyperv.c |  1 +
 arch/x86/kvm/x86.c  |  5 +-
 6 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fa498612839a0..446ee29e6cc99 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1750,6 +1750,8 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu);
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long 
payload);
+void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
+u32 error_code, unsigned long payload);
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 
error_code);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index bf6e96011dfed..d462b4808e893 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -32,6 +32,7 @@
 #define MC_VECTOR 18
 #define XM_VECTOR 19
 #define VE_VECTOR 20
+#define CP_VECTOR 21
 
 /* Select x86 specific features in  */
 #define __KVM_HAVE_PIT
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1cf682d1553cc..afa4116ea938c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -245,6 +245,8 @@ static const u32 msrpm_ranges[] = {0, 0xc000, 
0xc001};
 #define MSRS_RANGE_SIZE 2048
 #define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2)
 
+static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code);
+
 u32 svm_msrpm_offset(u32 msr)
 {
u32 offset;
@@ -1035,6 +1037,16 @@ static void svm_recalc_instruction_intercepts(struct 
kvm_vcpu *vcpu,
}
 }
 
+static void svm_init_force_exceptions_intercepts(struct kvm_vcpu *vcpu)
+{
+   int exc;
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   for (exc = 0 ; exc < 32 ; ++exc)
+   if (kvm_is_exception_force_intercepted(vcpu->kvm, exc))
+   set_exception_intercept(svm, exc);
+}
+
 static inline void init_vmcb_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -1235,6 +1247,8 @@ static void __svm_vcpu_reset(struct kvm_vcpu *vcpu)
 
if (sev_es_guest(vcpu->kvm))
sev_es_vcpu_reset(svm);
+   else
+   svm_init_force_exceptions_intercepts(vcpu);
 }
 
 static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -1865,6 +1879,19 @@ static int pf_interception(struct kvm_vcpu *vcpu)
u64 fault_address = svm->vmcb->control.exit_info_2;
u64 error_code = svm->vmcb->control.exit_info_1;
 
+
+   if (kvm_is_exception_force_intercepted(vcpu->kvm, PF_VECTOR)) {
+   if (npt_enabled && !vcpu->arch.apf.host_apf_flags) {
+   /* If the #PF was only intercepted for debug, inject
+* it directly to the guest, since the kvm's mmu code
+* is not ready to deal with such page faults.
+*/
+   kvm_queue_exception_e_p(vcpu, PF_VECTOR,
+   error_code, fault_address);
+   return 1;
+   }
+   }
+
return kvm_handle_page_fault(vcpu, error_code, fault_address,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
@@ -1940,6 +1967,46 @@ static int ac_interception(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int gen_exc_interception(struct kvm_vcpu *vcpu)
+{
+   /*
+* Generic exception intercept handler which forwards a guest exception
+* as-is to the guest.
+* For exceptions that don't have a special intercept handler.
+*
+* Used only for 'force_intercept_exceptions_mask' KVM debug feature.
+*/
+   struct vcpu_svm *svm = to_svm(vcpu);
+   int exc = svm->vmcb->control.exit_code - SVM_EXIT_EXCP_BASE;
+
+   if (!kvm_is_exception_force_intercepted(vcpu->kvm, exc))
+   return svm_handle_invalid_exit(vcpu, 
svm->vmcb->control.exit_code);
+
+   if (x86_exception_has_error_code(exc)) {
+
+   if (exc == TS_VECTOR) {
+   /*
+* SVM doesn't provide us with an error code to be abl

[PATCH 27/30] KVM: x86: add force_intercept_exceptions_mask

2022-02-07 Thread Maxim Levitsky

This parameter will be used by VMX and SVM code to force
interception of a set of exceptions, given by a bitmask
for guest debug and/or kvm debug.

This is based on an idea first shown here:
https://patchwork.kernel.org/project/kvm/patch/20160301192822.gd22...@pd.tnic/

CC: Borislav Petkov 
Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h | 7 +++
 arch/x86/kvm/x86.c  | 9 +
 arch/x86/kvm/x86.h  | 5 +
 3 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 428ab1cc7dd34..fa498612839a0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1168,6 +1168,13 @@ struct kvm_arch {
struct kvm_pmu_event_filter __rcu *pmu_event_filter;
struct task_struct *nx_lpage_recovery_thread;
 
+   /*
+* Bitmask of exceptions that KVM will intercept
+* and forward to the guest, even if that is not needed
+* for normal operation. Debug feature.
+*/
+   u32 force_intercept_exceptions_bitmask;
+
 #ifdef CONFIG_X86_64
/*
 * Whether the TDP MMU is enabled for this VM. This contains a
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 63d84c373e465..202c34697852f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -193,6 +193,13 @@ module_param(enable_pmu, bool, 0444);
 bool __read_mostly eager_page_split = true;
 module_param(eager_page_split, bool, 0644);
 
+/*
+ * force_intercept_exceptions_mask is a writable param and its value
+ * is snapshotted when a VM is created
+ */
+static uint force_intercept_exceptions_mask;
+module_param(force_intercept_exceptions_mask, uint, S_IRUGO | S_IWUSR);
+
 /*
  * Restoring the host value for MSRs that are only consumed when running in
  * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
@@ -11646,6 +11653,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long 
type)
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
kvm->arch.guest_can_read_msr_platform_info = true;
+   kvm->arch.force_intercept_exceptions_bitmask = 
force_intercept_exceptions_mask;
 
 #if IS_ENABLED(CONFIG_HYPERV)
spin_lock_init(&kvm->arch.hv_root_tdp_lock);
@@ -12886,6 +12894,7 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, 
unsigned int size,
 }
 EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e9b303b21f173..34f96f483c7e5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -91,6 +91,11 @@ static inline bool kvm_exception_is_soft(unsigned int nr)
return (nr == BP_VECTOR) || (nr == OF_VECTOR);
 }
 
+static inline bool kvm_is_exception_force_intercepted(struct kvm *kvm, int 
exception)
+{
+   return kvm->arch.force_intercept_exceptions_bitmask & BIT(exception);
+}
+
 static inline bool is_protmode(struct kvm_vcpu *vcpu)
 {
return kvm_read_cr0_bits(vcpu, X86_CR0_PE);
-- 
2.26.3

[PATCH 26/30] KVM: x86: nSVM: implement nested vGIF

2022-02-07 Thread Maxim Levitsky

In case L1 enables vGIF for L2, the L2 cannot affect L1's GIF, regardless
of STGI/CLGI intercepts, and since VM entry enables GIF, this means
that L1's GIF is always 1 while L2 is running.

Thus in this case leave L1's vGIF in vmcb01, while letting L2
control the vGIF thus implementing nested vGIF.

Also allow KVM to toggle L1's GIF during nested entry/exit
by always using vmcb01.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 17 +
 arch/x86/kvm/svm/svm.c|  5 +
 arch/x86/kvm/svm/svm.h| 25 +
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 601d38ae05cc6..a426d4d3dcd82 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -408,6 +408,10 @@ void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
 */
mask &= ~V_IRQ_MASK;
}
+
+   if (nested_vgif_enabled(svm))
+   mask |= V_GIF_MASK;
+
svm->nested.ctl.int_ctl&= ~mask;
svm->nested.ctl.int_ctl|= svm->vmcb->control.int_ctl & mask;
 }
@@ -573,10 +577,8 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm 
*svm, struct vmcb *vmcb12
 
 static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
 {
-   const u32 int_ctl_vmcb01_bits =
-   V_INTR_MASKING_MASK | V_GIF_MASK | V_GIF_ENABLE_MASK;
-
-   const u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
+   u32 int_ctl_vmcb01_bits = V_INTR_MASKING_MASK;
+   u32 int_ctl_vmcb12_bits = V_TPR_MASK | V_IRQ_INJECTION_BITS_MASK;
 
struct kvm_vcpu *vcpu = &svm->vcpu;
 
@@ -586,6 +588,13 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm 
*svm)
 */
 
 
+
+
+   if (svm->vgif_enabled && (svm->nested.ctl.int_ctl & V_GIF_ENABLE_MASK))
+   int_ctl_vmcb12_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK);
+   else
+   int_ctl_vmcb01_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK);
+
/* Copied from vmcb01.  msrpm_base can be overwritten later.  */
svm->vmcb->control.nested_ctl = svm->vmcb01.ptr->control.nested_ctl;
svm->vmcb->control.iopm_base_pa = svm->vmcb01.ptr->control.iopm_base_pa;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e49043807ec44..1cf682d1553cc 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4008,6 +4008,8 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu 
*vcpu)
svm->pause_threshold_enabled = false;
}
 
+   svm->vgif_enabled = vgif && guest_cpuid_has(vcpu, X86_FEATURE_VGIF);
+
svm_recalc_instruction_intercepts(vcpu, svm);
 
/* For sev guests, the memory encryption bit is not reserved in CR3.  */
@@ -4823,6 +4825,9 @@ static __init void svm_set_cpu_caps(void)
if (pause_filter_thresh)
kvm_cpu_cap_set(X86_FEATURE_PFTHRESHOLD);
 
+   if (vgif)
+   kvm_cpu_cap_set(X86_FEATURE_VGIF);
+
/* Nested VM can receive #VMEXIT instead of triggering #GP */
kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 297ec57f9941c..73cc9d3e784bd 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -224,6 +224,7 @@ struct vcpu_svm {
bool v_vmload_vmsave_enabled  : 1;
bool pause_filter_enabled : 1;
bool pause_threshold_enabled  : 1;
+   bool vgif_enabled : 1;
 
u32 ldr_reg;
u32 dfr_reg;
@@ -442,31 +443,47 @@ static inline bool svm_is_intercept(struct vcpu_svm *svm, 
int bit)
return vmcb_is_intercept(&svm->vmcb->control, bit);
 }
 
+static bool nested_vgif_enabled(struct vcpu_svm *svm)
+{
+   if (!is_guest_mode(&svm->vcpu) || !svm->vgif_enabled)
+   return false;
+   return svm->nested.ctl.int_ctl & V_GIF_ENABLE_MASK;
+}
+
 static inline bool vgif_enabled(struct vcpu_svm *svm)
 {
-   return !!(svm->vmcb->control.int_ctl & V_GIF_ENABLE_MASK);
+   struct vmcb *vmcb = nested_vgif_enabled(svm) ? svm->vmcb01.ptr : 
svm->vmcb;
+
+   return !!(vmcb->control.int_ctl & V_GIF_ENABLE_MASK);
 }
 
 static inline void enable_gif(struct vcpu_svm *svm)
 {
+   struct vmcb *vmcb = nested_vgif_enabled(svm) ? svm->vmcb01.ptr : 
svm->vmcb;
+
if (vgif_enabled(svm))
-   svm->vmcb->control.int_ctl |= V_GIF_MASK;
+   vmcb->control.int_ctl |= V_GIF_MASK;
else
svm->vcpu.arch.hflags |= HF_GIF_MASK;
 }
 
 static inline void disable_gif(struct vcpu_svm *svm)
 {
+   struct vmcb *vmcb = nested_vgif_enabled(svm) ? svm->vmcb01.ptr : 
svm->vmcb;
+
if (vgif_enabled(svm)

[PATCH 25/30] KVM: x86: nSVM: support PAUSE filter threshold and count when cpu_pm=on

2022-02-07 Thread Maxim Levitsky

Allow L1 to use these settings if L0 disables PAUSE interception
(AKA cpu_pm=on)

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c |  6 ++
 arch/x86/kvm/svm/svm.c| 17 +
 arch/x86/kvm/svm/svm.h|  2 ++
 3 files changed, 25 insertions(+)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bdcb23c76e89e..601d38ae05cc6 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -630,6 +630,12 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm 
*svm)
if (!nested_vmcb_needs_vls_intercept(svm))
svm->vmcb->control.virt_ext |= 
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
 
+   if (svm->pause_filter_enabled)
+   svm->vmcb->control.pause_filter_count = 
svm->nested.ctl.pause_filter_count;
+
+   if (svm->pause_threshold_enabled)
+   svm->vmcb->control.pause_filter_thresh = 
svm->nested.ctl.pause_filter_thresh;
+
nested_svm_transition_tlb_flush(vcpu);
 
/* Enter Guest-Mode */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0f068da098d9f..e49043807ec44 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3997,6 +3997,17 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu 
*vcpu)
 
svm->v_vmload_vmsave_enabled = vls && guest_cpuid_has(vcpu, 
X86_FEATURE_V_VMSAVE_VMLOAD);
 
+   if (kvm_pause_in_guest(vcpu->kvm)) {
+   svm->pause_filter_enabled = pause_filter_count > 0 &&
+   guest_cpuid_has(vcpu, 
X86_FEATURE_PAUSEFILTER);
+
+   svm->pause_threshold_enabled = pause_filter_thresh > 0 &&
+   guest_cpuid_has(vcpu, 
X86_FEATURE_PFTHRESHOLD);
+   } else {
+   svm->pause_filter_enabled = false;
+   svm->pause_threshold_enabled = false;
+   }
+
svm_recalc_instruction_intercepts(vcpu, svm);
 
/* For sev guests, the memory encryption bit is not reserved in CR3.  */
@@ -4806,6 +4817,12 @@ static __init void svm_set_cpu_caps(void)
if (vls)
kvm_cpu_cap_set(X86_FEATURE_V_VMSAVE_VMLOAD);
 
+   if (pause_filter_count)
+   kvm_cpu_cap_set(X86_FEATURE_PAUSEFILTER);
+
+   if (pause_filter_thresh)
+   kvm_cpu_cap_set(X86_FEATURE_PFTHRESHOLD);
+
/* Nested VM can receive #VMEXIT instead of triggering #GP */
kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e8ffd458a5575..297ec57f9941c 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -222,6 +222,8 @@ struct vcpu_svm {
bool tsc_scaling_enabled  : 1;
bool lbrv_enabled : 1;
bool v_vmload_vmsave_enabled  : 1;
+   bool pause_filter_enabled : 1;
+   bool pause_threshold_enabled  : 1;
 
u32 ldr_reg;
u32 dfr_reg;
-- 
2.26.3

[PATCH 24/30] KVM: x86: nSVM: implement nested VMLOAD/VMSAVE

2022-02-07 Thread Maxim Levitsky

This was tested by booting L1,L2,L3 (all Linux) and checking
that no VMLOAD/VMSAVE vmexits happened.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 35 +--
 arch/x86/kvm/svm/svm.c|  7 +++
 arch/x86/kvm/svm/svm.h|  8 +++-
 3 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 4a228a76b27d7..bdcb23c76e89e 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -120,6 +120,20 @@ static void nested_svm_uninit_mmu_context(struct kvm_vcpu 
*vcpu)
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
 }
 
+static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm)
+{
+   if (!svm->v_vmload_vmsave_enabled)
+   return true;
+
+   if (!nested_npt_enabled(svm))
+   return true;
+
+   if (!(svm->nested.ctl.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK))
+   return true;
+
+   return false;
+}
+
 void recalc_intercepts(struct vcpu_svm *svm)
 {
struct vmcb_control_area *c, *h;
@@ -161,8 +175,17 @@ void recalc_intercepts(struct vcpu_svm *svm)
if (!intercept_smi)
vmcb_clr_intercept(c, INTERCEPT_SMI);
 
-   vmcb_set_intercept(c, INTERCEPT_VMLOAD);
-   vmcb_set_intercept(c, INTERCEPT_VMSAVE);
+   if (nested_vmcb_needs_vls_intercept(svm)) {
+   /*
+* If the virtual VMLOAD/VMSAVE is not enabled for the L2,
+* we must intercept these instructions to correctly
+* emulate them in case L1 doesn't intercept them.
+*/
+   vmcb_set_intercept(c, INTERCEPT_VMLOAD);
+   vmcb_set_intercept(c, INTERCEPT_VMSAVE);
+   } else {
+   WARN_ON(!(c->virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK));
+   }
 }
 
 static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
@@ -426,10 +449,7 @@ static void nested_save_pending_event_to_vmcb12(struct 
vcpu_svm *svm,
vmcb12->control.exit_int_info = exit_int_info;
 }
 
-static inline bool nested_npt_enabled(struct vcpu_svm *svm)
-{
-   return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
-}
+
 
 static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu)
 {
@@ -607,6 +627,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm 
*svm)
svm->vmcb->control.virt_ext  |=
(svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK);
 
+   if (!nested_vmcb_needs_vls_intercept(svm))
+   svm->vmcb->control.virt_ext |= 
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
+
nested_svm_transition_tlb_flush(vcpu);
 
/* Enter Guest-Mode */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 76aa6054d9db2..0f068da098d9f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1051,6 +1051,8 @@ static inline void init_vmcb_after_set_cpuid(struct 
kvm_vcpu *vcpu)
 
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_EIP, 
0, 0);
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_ESP, 
0, 0);
+
+   svm->v_vmload_vmsave_enabled = false;
} else {
/*
 * If hardware supports Virtual VMLOAD VMSAVE then enable it
@@ -3993,6 +3995,8 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu 
*vcpu)
svm->tsc_scaling_enabled = tsc_scaling && guest_cpuid_has(vcpu, 
X86_FEATURE_TSCRATEMSR);
svm->lbrv_enabled = lbrv && guest_cpuid_has(vcpu, X86_FEATURE_LBRV);
 
+   svm->v_vmload_vmsave_enabled = vls && guest_cpuid_has(vcpu, 
X86_FEATURE_V_VMSAVE_VMLOAD);
+
svm_recalc_instruction_intercepts(vcpu, svm);
 
/* For sev guests, the memory encryption bit is not reserved in CR3.  */
@@ -4799,6 +4803,9 @@ static __init void svm_set_cpu_caps(void)
if (lbrv)
kvm_cpu_cap_set(X86_FEATURE_LBRV);
 
+   if (vls)
+   kvm_cpu_cap_set(X86_FEATURE_V_VMSAVE_VMLOAD);
+
/* Nested VM can receive #VMEXIT instead of triggering #GP */
kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 0012ba5affcba..e8ffd458a5575 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -217,10 +217,11 @@ struct vcpu_svm {
unsigned int3_injected;
unsigned long int3_rip;
 
-   /* cached guest cpuid flags for faster access */
+   /* optional nested SVM features that are enabled for this guest  */
bool nrips_enabled: 1;
bool tsc_scaling_enabled  : 1;
bool lbrv_enabled : 1;
+   bool v_vmload_vmsave_enabled  : 1;
 
u32 ldr_reg;
u32 dfr_reg;
@@ -468,6 +469,11 @@ static inline bool gif_set(struct

[PATCH 22/30] KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running

2022-02-07 Thread Maxim Levitsky

When L2 is running without LBR virtualization, we should ensure
that L1's LBR msrs continue to update as usual.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 11 +
 arch/x86/kvm/svm/svm.c| 98 +++
 arch/x86/kvm/svm/svm.h|  2 +
 3 files changed, 92 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ac9159b0618c7..9f7bc7db08dd3 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -535,6 +535,9 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm 
*svm, struct vmcb *vmcb12
svm->vcpu.arch.dr6  = svm->nested.save.dr6 | DR6_ACTIVE_LOW;
vmcb_mark_dirty(svm->vmcb, VMCB_DR);
}
+
+   if (unlikely(svm->vmcb01.ptr->control.virt_ext & LBR_CTL_ENABLE_MASK))
+   svm_copy_lbrs(svm->vmcb01.ptr, svm->vmcb);
 }
 
 static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
@@ -587,6 +590,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm 
*svm)
svm->vmcb->control.event_inj   = svm->nested.ctl.event_inj;
svm->vmcb->control.event_inj_err   = svm->nested.ctl.event_inj_err;
 
+   svm->vmcb->control.virt_ext= 
svm->vmcb01.ptr->control.virt_ext &
+LBR_CTL_ENABLE_MASK;
+
nested_svm_transition_tlb_flush(vcpu);
 
/* Enter Guest-Mode */
@@ -852,6 +858,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 
svm_switch_vmcb(svm, &svm->vmcb01);
 
+   if (unlikely(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK)) {
+   svm_copy_lbrs(svm->nested.vmcb02.ptr, svm->vmcb);
+   svm_update_lbrv(vcpu);
+   }
+
/*
 * On vmexit the  GIF is set to false and
 * no event can be injected in L1.
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b88ca7f07a0fc..294e016f575a8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -805,6 +805,17 @@ static void init_msrpm_offsets(void)
}
 }
 
+void svm_copy_lbrs(struct vmcb *from_vmcb, struct vmcb *to_vmcb)
+{
+   to_vmcb->save.dbgctl= from_vmcb->save.dbgctl;
+   to_vmcb->save.br_from   = from_vmcb->save.br_from;
+   to_vmcb->save.br_to = from_vmcb->save.br_to;
+   to_vmcb->save.last_excp_from= from_vmcb->save.last_excp_from;
+   to_vmcb->save.last_excp_to  = from_vmcb->save.last_excp_to;
+
+   vmcb_mark_dirty(to_vmcb, VMCB_LBR);
+}
+
 static void svm_enable_lbrv(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -814,6 +825,10 @@ static void svm_enable_lbrv(struct kvm_vcpu *vcpu)
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1);
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
+
+   /* Move the LBR msrs to the vmcb02 so that the guest can see them. */
+   if (is_guest_mode(vcpu))
+   svm_copy_lbrs(svm->vmcb01.ptr, svm->vmcb);
 }
 
 static void svm_disable_lbrv(struct kvm_vcpu *vcpu)
@@ -825,6 +840,63 @@ static void svm_disable_lbrv(struct kvm_vcpu *vcpu)
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0);
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 0, 0);
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 0, 0);
+
+   /*
+* Move the LBR msrs back to the vmcb01 to avoid copying them
+* on nested guest entries.
+*/
+   if (is_guest_mode(vcpu))
+   svm_copy_lbrs(svm->vmcb, svm->vmcb01.ptr);
+}
+
+static int svm_get_lbr_msr(struct vcpu_svm *svm, u32 index)
+{
+   /*
+* If the LBR virtualization is disabled, the LBR msrs are always
+* kept in the vmcb01 to avoid copying them on nested guest entries.
+*
+* If nested, and the LBR virtualization is enabled/disabled, the msrs
+* are moved between the vmcb01 and vmcb02 as needed.
+*/
+   struct vmcb *vmcb =
+   (svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) ?
+   svm->vmcb : svm->vmcb01.ptr;
+
+   switch (index) {
+   case MSR_IA32_DEBUGCTLMSR:
+   return vmcb->save.dbgctl;
+   case MSR_IA32_LASTBRANCHFROMIP:
+   return vmcb->save.br_from;
+   case MSR_IA32_LASTBRANCHTOIP:
+   return vmcb->save.br_to;
+   case MSR_IA32_LASTINTFROMIP:
+   return vmcb->save.last_excp_from;
+   case MSR_IA32_LASTINTTOIP:
+   return vmcb->save.last_excp_to;
+   default:
+   KVM_BUG(false, svm->vcpu.kvm,
+   "%s: Unknown MSR

[PATCH 23/30] KVM: x86: nSVM: implement nested LBR virtualization

2022-02-07 Thread Maxim Levitsky

This was tested with kvm-unit-test that was developed
for this purpose.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 21 +++--
 arch/x86/kvm/svm/svm.c|  8 
 arch/x86/kvm/svm/svm.h|  1 +
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 9f7bc7db08dd3..4a228a76b27d7 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -536,8 +536,19 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm 
*svm, struct vmcb *vmcb12
vmcb_mark_dirty(svm->vmcb, VMCB_DR);
}
 
-   if (unlikely(svm->vmcb01.ptr->control.virt_ext & LBR_CTL_ENABLE_MASK))
+   if (unlikely(svm->lbrv_enabled && (svm->nested.ctl.virt_ext & 
LBR_CTL_ENABLE_MASK))) {
+
+   /* Copy LBR related registers from vmcb12,
+* but make sure that we only pick LBR enable bit from the 
guest.
+*/
+
+   svm_copy_lbrs(vmcb12, svm->vmcb);
+   svm->vmcb->save.dbgctl &= LBR_CTL_ENABLE_MASK;
+   svm_update_lbrv(&svm->vcpu);
+
+   } else if (unlikely(svm->vmcb01.ptr->control.virt_ext & 
LBR_CTL_ENABLE_MASK)) {
svm_copy_lbrs(svm->vmcb01.ptr, svm->vmcb);
+   }
 }
 
 static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
@@ -592,6 +603,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm 
*svm)
 
svm->vmcb->control.virt_ext= 
svm->vmcb01.ptr->control.virt_ext &
 LBR_CTL_ENABLE_MASK;
+   if (svm->lbrv_enabled)
+   svm->vmcb->control.virt_ext  |=
+   (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK);
 
nested_svm_transition_tlb_flush(vcpu);
 
@@ -858,7 +872,10 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 
svm_switch_vmcb(svm, &svm->vmcb01);
 
-   if (unlikely(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK)) {
+   if (unlikely(svm->lbrv_enabled && (svm->nested.ctl.virt_ext & 
LBR_CTL_ENABLE_MASK))) {
+   svm_copy_lbrs(svm->nested.vmcb02.ptr, vmcb12);
+   svm_update_lbrv(vcpu);
+   } else if (unlikely(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK)) 
{
svm_copy_lbrs(svm->nested.vmcb02.ptr, svm->vmcb);
svm_update_lbrv(vcpu);
}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 294e016f575a8..76aa6054d9db2 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -890,6 +890,10 @@ void svm_update_lbrv(struct kvm_vcpu *vcpu)
bool current_enable_lbrv = !!(svm->vmcb->control.virt_ext &
  LBR_CTL_ENABLE_MASK);
 
+   if (unlikely(is_guest_mode(vcpu) && svm->lbrv_enabled))
+   if (unlikely(svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))
+   enable_lbrv = true;
+
if (enable_lbrv == current_enable_lbrv)
return;
 
@@ -3987,6 +3991,7 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu 
*vcpu)
 guest_cpuid_has(vcpu, X86_FEATURE_NRIPS);
 
svm->tsc_scaling_enabled = tsc_scaling && guest_cpuid_has(vcpu, 
X86_FEATURE_TSCRATEMSR);
+   svm->lbrv_enabled = lbrv && guest_cpuid_has(vcpu, X86_FEATURE_LBRV);
 
svm_recalc_instruction_intercepts(vcpu, svm);
 
@@ -4791,6 +4796,9 @@ static __init void svm_set_cpu_caps(void)
if (tsc_scaling)
kvm_cpu_cap_set(X86_FEATURE_TSCRATEMSR);
 
+   if (lbrv)
+   kvm_cpu_cap_set(X86_FEATURE_LBRV);
+
/* Nested VM can receive #VMEXIT instead of triggering #GP */
kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index b83e06d5d942a..0012ba5affcba 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -220,6 +220,7 @@ struct vcpu_svm {
/* cached guest cpuid flags for faster access */
bool nrips_enabled: 1;
bool tsc_scaling_enabled  : 1;
+   bool lbrv_enabled : 1;
 
u32 ldr_reg;
u32 dfr_reg;
-- 
2.26.3

[PATCH 21/30] x86: KVMGT: use kvm_page_track_write_tracking_enable

2022-02-07 Thread Maxim Levitsky

This allows to enable the write tracking only when KVMGT is
actually used and doesn't carry any penalty otherwise.

Tested by booting a VM with a kvmgt mdev device.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/Kconfig | 3 ---
 arch/x86/kvm/mmu/mmu.c   | 2 +-
 drivers/gpu/drm/i915/Kconfig | 1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c | 5 +
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ebc8ce9ec9173..169f8833cd0d1 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -132,7 +132,4 @@ config KVM_MMU_AUDIT
 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 auditing of KVM MMU events at runtime.
 
-config KVM_EXTERNAL_WRITE_TRACKING
-   bool
-
 endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 431e02ba73690..e4e2fc8e7d7a5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5712,7 +5712,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
 
-   if (IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || !tdp_enabled)
+   if (!tdp_enabled)
mmu_enable_write_tracking(kvm);
 }
 
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 84b6fc70cbf52..bf041b26ffec3 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -126,7 +126,6 @@ config DRM_I915_GVT_KVMGT
depends on DRM_I915_GVT
depends on KVM
depends on VFIO_MDEV
-   select KVM_EXTERNAL_WRITE_TRACKING
default n
help
  Choose this option if you want to enable KVMGT support for
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 20b82fb036f8c..64ced3c2bc550 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1916,6 +1916,7 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
struct intel_vgpu *vgpu;
struct kvmgt_vdev *vdev;
struct kvm *kvm;
+   int ret;
 
vgpu = mdev_get_drvdata(mdev);
if (handle_valid(vgpu->handle))
@@ -1931,6 +1932,10 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
if (__kvmgt_vgpu_exist(vgpu, kvm))
return -EEXIST;
 
+   ret = kvm_page_track_write_tracking_enable(kvm);
+   if (ret)
+   return ret;
+
info = vzalloc(sizeof(struct kvmgt_guest_info));
if (!info)
return -ENOMEM;
-- 
2.26.3

[PATCH 20/30] KVM: x86: mmu: allow to enable write tracking externally

2022-02-07 Thread Maxim Levitsky

This will be used to enable write tracking from nested AVIC code
and can also be used to enable write tracking in GVT-g module
when it actually uses it as opposed to always enabling it,
when the module is compiled in the kernel.

No functional change intended.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h   |  2 +-
 arch/x86/include/asm/kvm_page_track.h |  1 +
 arch/x86/kvm/mmu.h|  8 +---
 arch/x86/kvm/mmu/mmu.c| 16 +---
 arch/x86/kvm/mmu/page_track.c | 10 --
 5 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 256539c0481c5..428ab1cc7dd34 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1225,7 +1225,7 @@ struct kvm_arch {
 * is used as one input when determining whether certain memslot
 * related allocations are necessary.
 */
-   bool shadow_root_allocated;
+   bool mmu_page_tracking_enabled;
 
 #if IS_ENABLED(CONFIG_HYPERV)
hpa_t   hv_root_tdp;
diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a9..955a5ae07b10e 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -50,6 +50,7 @@ int kvm_page_track_init(struct kvm *kvm);
 void kvm_page_track_cleanup(struct kvm *kvm);
 
 bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
+int kvm_page_track_write_tracking_enable(struct kvm *kvm);
 int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 51faa2c76ca5f..48cc042f17466 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -267,7 +267,7 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 int kvm_mmu_post_init_vm(struct kvm *kvm);
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
 
-static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
+static inline bool mmu_page_tracking_enabled(struct kvm *kvm)
 {
/*
 * Read shadow_root_allocated before related pointers. Hence, threads
@@ -275,9 +275,11 @@ static inline bool kvm_shadow_root_allocated(struct kvm 
*kvm)
 * see the pointers. Pairs with smp_store_release in
 * mmu_first_shadow_root_alloc.
 */
-   return smp_load_acquire(&kvm->arch.shadow_root_allocated);
+   return smp_load_acquire(&kvm->arch.mmu_page_tracking_enabled);
 }
 
+int mmu_enable_write_tracking(struct kvm *kvm);
+
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return 
kvm->arch.tdp_mmu_enabled; }
 #else
@@ -286,7 +288,7 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { 
return false; }
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
-   return !is_tdp_mmu_enabled(kvm) || kvm_shadow_root_allocated(kvm);
+   return !is_tdp_mmu_enabled(kvm) || mmu_page_tracking_enabled(kvm);
 }
 
 static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fa2da6990703f..431e02ba73690 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3384,7 +3384,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
return r;
 }
 
-static int mmu_first_shadow_root_alloc(struct kvm *kvm)
+int mmu_enable_write_tracking(struct kvm *kvm)
 {
struct kvm_memslots *slots;
struct kvm_memory_slot *slot;
@@ -3394,21 +3394,20 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 * Check if this is the first shadow root being allocated before
 * taking the lock.
 */
-   if (kvm_shadow_root_allocated(kvm))
+   if (mmu_page_tracking_enabled(kvm))
return 0;
 
mutex_lock(&kvm->slots_arch_lock);
 
/* Recheck, under the lock, whether this is the first shadow root. */
-   if (kvm_shadow_root_allocated(kvm))
+   if (mmu_page_tracking_enabled(kvm))
goto out_unlock;
 
/*
 * Check if anything actually needs to be allocated, e.g. all metadata
 * will be allocated upfront if TDP is disabled.
 */
-   if (kvm_memslots_have_rmaps(kvm) &&
-   kvm_page_track_write_tracking_enabled(kvm))
+   if (kvm_memslots_have_rmaps(kvm) && mmu_page_tracking_enabled(kvm))
goto out_success;
 
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -3438,7 +3437,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 * all the related pointers are set.
 */
 out_success:
-   smp_store_release(&kvm->arch.shadow_root_allocated, true);
+   smp_store_release(&kvm->arch.mmu_page_tracking_enabled, true);
 
 out_unlock:
mutex_unlock(&kvm->slots_arch_lock);
@@ -3475,7 +3

[PATCH 19/30] KVM: x86: mmu: add gfn_in_memslot helper

2022-02-07 Thread Maxim Levitsky

This is a tiny refactoring, and can be useful to check
if a GPA/GFN is within a memslot a bit more cleanly.

Signed-off-by: Maxim Levitsky 
---
 include/linux/kvm_host.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b3810976a27f8..483681c6e322e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1564,6 +1564,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
 bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
 
+
+static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+   return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
+}
+
+
 /*
  * Returns a pointer to the memslot if it contains gfn.
  * Otherwise returns NULL.
@@ -1574,12 +1581,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
if (!slot)
return NULL;
 
-   if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
+   if (gfn_in_memslot(slot, gfn))
return slot;
else
return NULL;
 }
 
+
 /*
  * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
  *
-- 
2.26.3

[PATCH 18/30] KVM: x86: mmu: add strict mmu mode

2022-02-07 Thread Maxim Levitsky

Add an (mostly debug) option to force KVM's shadow mmu
to never have unsync pages.

This is useful in some cases to debug it.

It is also useful for some legacy guest OSes which don't
flush TLBs correctly, and thus don't work on modern
CPUs which have speculative MMUs.

Using this option together with legacy paging (npt/ept=0)
allows to correctly simulate such old MMU while still
getting most of the benefits of the virtualization.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/mmu/mmu.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 43c7abdd6b70f..fa2da6990703f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -91,6 +91,10 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
 static bool __read_mostly force_flush_and_sync_on_reuse;
 module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
 
+
+bool strict_mmu;
+module_param(strict_mmu, bool, 0644);
+
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
  * where the hardware walks 2 page tables:
@@ -2703,7 +2707,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *slot,
}
 
wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, 
prefetch,
-  true, host_writable, &spte);
+  !strict_mmu, host_writable, &spte);
 
if (*sptep == spte) {
ret = RET_PF_SPURIOUS;
@@ -5139,6 +5143,11 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu 
*vcpu, gpa_t *gpa,
  */
 static bool detect_write_flooding(struct kvm_mmu_page *sp)
 {
+   /*
+* When using non speculating MMU, use a bit higher threshold
+* for write flood detection
+*/
+   int threshold = strict_mmu ? 10 :  3;
/*
 * Skip write-flooding detected for the sp whose level is 1, because
 * it can become unsync, then the guest page is not write-protected.
@@ -5147,7 +5156,7 @@ static bool detect_write_flooding(struct kvm_mmu_page *sp)
return false;
 
atomic_inc(&sp->write_flooding_count);
-   return atomic_read(&sp->write_flooding_count) >= 3;
+   return atomic_read(&sp->write_flooding_count) >= threshold;
 }
 
 /*
-- 
2.26.3

[PATCH 17/30] KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set

2022-02-07 Thread Maxim Levitsky

It makes more sense to print new SPTE value than the
old value.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 296f8723f9ae9..43c7abdd6b70f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2708,8 +2708,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *slot,
if (*sptep == spte) {
ret = RET_PF_SPURIOUS;
} else {
-   trace_kvm_mmu_set_spte(level, gfn, sptep);
flush |= mmu_spte_update(sptep, spte);
+   trace_kvm_mmu_set_spte(level, gfn, sptep);
}
 
if (wrprot) {
-- 
2.26.3

[PATCH 16/30] KVM: x86: SVM: allow to force AVIC to be enabled

2022-02-07 Thread Maxim Levitsky

Apparently on some systems AVIC is disabled in CPUID but still usable.

Allow the user to override the CPUID if the user is willing to
take the risk.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 85035324ed762..b88ca7f07a0fc 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -202,6 +202,9 @@ module_param(tsc_scaling, int, 0444);
 static bool avic;
 module_param(avic, bool, 0444);
 
+static bool force_avic;
+module_param_unsafe(force_avic, bool, 0444);
+
 bool __read_mostly dump_invalid_vmcb;
 module_param(dump_invalid_vmcb, bool, 0644);
 
@@ -4839,10 +4842,14 @@ static __init int svm_hardware_setup(void)
nrips = false;
}
 
-   enable_apicv = avic = avic && npt_enabled && 
boot_cpu_has(X86_FEATURE_AVIC);
+   enable_apicv = avic = avic && npt_enabled && 
(boot_cpu_has(X86_FEATURE_AVIC) || force_avic);
 
if (enable_apicv) {
-   pr_info("AVIC enabled\n");
+   if (!boot_cpu_has(X86_FEATURE_AVIC)) {
+   pr_warn("AVIC is not supported in CPUID but force 
enabled");
+   pr_warn("Your system might crash and burn");
+   } else
+   pr_info("AVIC enabled\n");
 
amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier);
} else {
-- 
2.26.3

[PATCH 15/30] KVM: x86: SVM: remove avic's broken code that updated APIC ID

2022-02-07 Thread Maxim Levitsky

Now that KVM doesn't allow to change APIC ID in case AVIC is
enabled, remove buggy AVIC code that tried to do so.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 35 ---
 1 file changed, 35 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 8f23e7d239097..768252b3dfee6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -440,35 +440,6 @@ static int avic_handle_ldr_update(struct kvm_vcpu *vcpu)
return ret;
 }
 
-static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu)
-{
-   u64 *old, *new;
-   struct vcpu_svm *svm = to_svm(vcpu);
-   u32 id = kvm_xapic_id(vcpu->arch.apic);
-
-   if (vcpu->vcpu_id == id)
-   return 0;
-
-   old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id);
-   new = avic_get_physical_id_entry(vcpu, id);
-   if (!new || !old)
-   return 1;
-
-   /* We need to move physical_id_entry to new offset */
-   *new = *old;
-   *old = 0ULL;
-   to_svm(vcpu)->avic_physical_id_cache = new;
-
-   /*
-* Also update the guest physical APIC ID in the logical
-* APIC ID table entry if already setup the LDR.
-*/
-   if (svm->ldr_reg)
-   avic_handle_ldr_update(vcpu);
-
-   return 0;
-}
-
 static void avic_handle_dfr_update(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -488,10 +459,6 @@ static int avic_unaccel_trap_write(struct vcpu_svm *svm)
AVIC_UNACCEL_ACCESS_OFFSET_MASK;
 
switch (offset) {
-   case APIC_ID:
-   if (avic_handle_apic_id_update(&svm->vcpu))
-   return 0;
-   break;
case APIC_LDR:
if (avic_handle_ldr_update(&svm->vcpu))
return 0;
@@ -584,8 +551,6 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
-   if (avic_handle_apic_id_update(vcpu) != 0)
-   return;
avic_handle_dfr_update(vcpu);
avic_handle_ldr_update(vcpu);
 }
-- 
2.26.3

[PATCH 14/30] KVM: x86: lapic: don't allow to change local apic id when using older x2apic api

2022-02-07 Thread Maxim Levitsky

KVM allowed to set non boot apic id via setting apic state
if using older non x2apic 32 bit apic id userspace api.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/lapic.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 7ff695cab27b2..aeddd68d31181 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2592,15 +2592,15 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
return -EINVAL;
} else {
-   if (vcpu->kvm->arch.x2apic_format) {
-   if (*id != vcpu->vcpu_id)
-   return -EINVAL;
-   } else {
-   if (set)
-   *id >>= 24;
-   else
-   *id <<= 24;
-   }
+
+   if (!vcpu->kvm->arch.x2apic_format && set)
+   *id >>= 24;
+
+   if (*id != vcpu->vcpu_id)
+   return -EINVAL;
+
+   if (!vcpu->kvm->arch.x2apic_format && !set)
+   *id <<= 24;
 
/* In x2APIC mode, the LDR is fixed and based on the id */
if (set)
-- 
2.26.3

[PATCH 13/30] KVM: x86: lapic: don't allow to change APIC ID when apic acceleration is enabled

2022-02-07 Thread Maxim Levitsky

No normal guest has any reason to change physical APIC IDs, and
allowing this introduces bugs into APIC acceleration code.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/lapic.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index dd4e2888c244b..7ff695cab27b2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2002,10 +2002,20 @@ int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 
reg, u32 val)
 
switch (reg) {
case APIC_ID:   /* Local APIC ID */
-   if (!apic_x2apic_mode(apic))
-   kvm_apic_set_xapic_id(apic, val >> 24);
-   else
+   if (apic_x2apic_mode(apic)) {
ret = 1;
+   break;
+   }
+   /*
+* Don't allow setting APIC ID with any APIC acceleration
+* enabled to avoid unexpected issues
+*/
+   if (enable_apicv && ((val >> 24) != apic->vcpu->vcpu_id)) {
+   kvm_vm_bugged(apic->vcpu->kvm);
+   break;
+   }
+
+   kvm_apic_set_xapic_id(apic, val >> 24);
break;
 
case APIC_TASKPRI:
@@ -2572,10 +2582,16 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s, bool set)
 {
-   if (apic_x2apic_mode(vcpu->arch.apic)) {
-   u32 *id = (u32 *)(s->regs + APIC_ID);
-   u32 *ldr = (u32 *)(s->regs + APIC_LDR);
+   u32 *id = (u32 *)(s->regs + APIC_ID);
+   u32 *ldr = (u32 *)(s->regs + APIC_LDR);
 
+   if (!apic_x2apic_mode(vcpu->arch.apic)) {
+   /* Don't allow setting APIC ID with any APIC acceleration
+* enabled to avoid unexpected issues
+*/
+   if (enable_apicv && (*id >> 24) != vcpu->vcpu_id)
+   return -EINVAL;
+   } else {
if (vcpu->kvm->arch.x2apic_format) {
if (*id != vcpu->vcpu_id)
return -EINVAL;
-- 
2.26.3

[PATCH 11/30] KVM: x86: SVM: use vmcb01 in avic_init_vmcb

2022-02-07 Thread Maxim Levitsky

Out of precation use vmcb01 when enabling host AVIC.

No functional change intended.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 4c2d622b3b9f0..c6072245f7fbb 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -167,7 +167,7 @@ int avic_vm_init(struct kvm *kvm)
 
 void avic_init_vmcb(struct vcpu_svm *svm)
 {
-   struct vmcb *vmcb = svm->vmcb;
+   struct vmcb *vmcb = svm->vmcb01.ptr;
struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page));
phys_addr_t lpa = 
__sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
-- 
2.26.3

[PATCH 12/30] KVM: x86: SVM: allow AVIC to co-exist with a nested guest running

2022-02-07 Thread Maxim Levitsky

Inhibit the AVIC of the vCPU that is running nested for the duration of the
nested run, so that all interrupts arriving from both its vCPU siblings
and from KVM are delivered using normal IPIs and cause that vCPU to vmexit.

Note that unlike normal AVIC inhibition, there is no need to
update the AVIC mmio memslot, because the nested guest uses its
own set of paging tables.
That also means that AVIC doesn't need to be inhibited VM wide.

Note that in the theory when a nested guest doesn't intercept
physical interrupts, we could continue using AVIC to deliver them
to it but don't bother doing so for now. Plus when nested AVIC
is implemented, the nested guest will likely use it, which will
not allow this optimization to be used

(can't use real AVIC to support both L1 and L2 at the same time)

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h|  8 +++-
 arch/x86/kvm/svm/avic.c|  7 ++-
 arch/x86/kvm/svm/nested.c  | 15 ++-
 arch/x86/kvm/svm/svm.c | 31 +++---
 arch/x86/kvm/svm/svm.h |  1 +
 arch/x86/kvm/x86.c | 18 +++--
 7 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 9e37dc3d88636..c0d8f351dcbc0 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -125,6 +125,7 @@ KVM_X86_OP_NULL(migrate_timers)
 KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP_NULL(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
+KVM_X86_OP_NULL(vcpu_has_apicv_inhibit_condition);
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_NULL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c371ee7e45f78..256539c0481c5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1039,7 +1039,6 @@ struct kvm_x86_msr_filter {
 
 #define APICV_INHIBIT_REASON_DISABLE0
 #define APICV_INHIBIT_REASON_HYPERV 1
-#define APICV_INHIBIT_REASON_NESTED 2
 #define APICV_INHIBIT_REASON_IRQWIN 3
 #define APICV_INHIBIT_REASON_PIT_REINJ  4
 #define APICV_INHIBIT_REASON_X2APIC5
@@ -1494,6 +1493,12 @@ struct kvm_x86_ops {
int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
 
void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+
+   /*
+* Returns true if for some reason APICv (e.g guest mode)
+* must be inhibited on this vCPU
+*/
+   bool (*vcpu_has_apicv_inhibit_condition)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
@@ -1784,6 +1789,7 @@ gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, 
gva_t gva,
 
 bool kvm_apicv_activated(struct kvm *kvm);
 void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu);
+bool vcpu_has_apicv_inhibit_condition(struct kvm_vcpu *vcpu);
 void kvm_request_apicv_update(struct kvm *kvm, bool activate,
  unsigned long bit);
 
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c6072245f7fbb..8f23e7d239097 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -677,6 +677,12 @@ bool avic_dy_apicv_has_pending_interrupt(struct kvm_vcpu 
*vcpu)
return false;
 }
 
+bool avic_has_vcpu_inhibit_condition(struct kvm_vcpu *vcpu)
+{
+   return is_guest_mode(vcpu);
+}
+
+
 static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
 {
unsigned long flags;
@@ -888,7 +894,6 @@ bool avic_check_apicv_inhibit_reasons(ulong bit)
ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
  BIT(APICV_INHIBIT_REASON_ABSENT) |
  BIT(APICV_INHIBIT_REASON_HYPERV) |
- BIT(APICV_INHIBIT_REASON_NESTED) |
  BIT(APICV_INHIBIT_REASON_IRQWIN) |
  BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
  BIT(APICV_INHIBIT_REASON_X2APIC) |
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 39d280e7e80ef..ac9159b0618c7 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -551,11 +551,6 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm 
*svm)
 * exit_int_info, exit_int_info_err, next_rip, insn_len, insn_bytes.
 */
 
-   /*
-* Also covers avic_vapic_bar, avic_backing_page, avic_logical_id,
-* avic_physical_id.
-*/
-   WARN_ON(kvm_apicv_activated(svm->vcpu.kvm));
 
/* Copied from vmcb01.  msrpm_base can be overwritten later.  */
svm->vmcb->control.nested_ctl = svm->vmcb01.ptr->control.nested_ctl;
@@ -659,6 +654,9 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 
vmcb12_gpa,
 
svm_set_gif(svm, true);
 
+   if (kvm_vcpu_apicv_active(vcpu))
+   kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
+
retu

[PATCH 10/30] KVM: x86: SVM: fix race between interrupt delivery and AVIC inhibition

2022-02-07 Thread Maxim Levitsky

If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
inhibited, it might read a stale value of vcpu->arch.apicv_active
which can lead to the target vCPU not noticing the interrupt.

To fix this use load-acquire/store-release so that, if the target vCPU
is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
KVM_REQ_EVENT-based delivery.

All this complicated logic is actually exactly how we can handle an
incomplete IPI vmexit; the only difference lies in who sets IRR, whether
KVM or the processor.

Also incomplete IPI vmexit also has the same races as
svm_deliver_avic_intr.
Therefore use the avic_kick_target_vcpu there as well.

Co-developed-by: Paolo Bonzini 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/avic.c | 73 ++---
 arch/x86/kvm/svm/svm.c  | 65 
 arch/x86/kvm/svm/svm.h  |  3 ++
 arch/x86/kvm/x86.c  |  4 ++-
 4 files changed, 82 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index fabfc337e1c35..4c2d622b3b9f0 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -269,6 +269,24 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
return 0;
 }
 
+
+void avic_ring_doorbell(struct kvm_vcpu *vcpu)
+{
+   /*
+* Note, the vCPU could get migrated to a different pCPU at any
+* point, which could result in signalling the wrong/previous
+* pCPU.  But if that happens the vCPU is guaranteed to do a
+* VMRUN (after being migrated) and thus will process pending
+* interrupts, i.e. a doorbell is not needed (and the spurious
+* one is harmless).
+*/
+   int cpu = READ_ONCE(vcpu->cpu);
+
+   if (cpu != get_cpu())
+   wrmsrl(MSR_AMD64_SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
+   put_cpu();
+}
+
 static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
   u32 icrl, u32 icrh)
 {
@@ -284,8 +302,13 @@ static void avic_kick_target_vcpus(struct kvm *kvm, struct 
kvm_lapic *source,
kvm_for_each_vcpu(i, vcpu, kvm) {
if (kvm_apic_match_dest(vcpu, source, icrl & APIC_SHORT_MASK,
GET_APIC_DEST_FIELD(icrh),
-   icrl & APIC_DEST_MASK))
-   kvm_vcpu_wake_up(vcpu);
+   icrl & APIC_DEST_MASK)) {
+   vcpu->arch.apic->irr_pending = true;
+   svm_complete_interrupt_delivery(vcpu,
+   icrl & APIC_MODE_MASK,
+   icrl & 
APIC_INT_LEVELTRIG,
+   icrl & 
APIC_VECTOR_MASK);
+   }
}
 }
 
@@ -649,52 +672,6 @@ void avic_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 
*eoi_exit_bitmap)
return;
 }
 
-int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
-{
-   if (!vcpu->arch.apicv_active)
-   return -1;
-
-   kvm_lapic_set_irr(vec, vcpu->arch.apic);
-
-   /*
-* Pairs with the smp_mb_*() after setting vcpu->guest_mode in
-* vcpu_enter_guest() to ensure the write to the vIRR is ordered before
-* the read of guest_mode, which guarantees that either VMRUN will see
-* and process the new vIRR entry, or that the below code will signal
-* the doorbell if the vCPU is already running in the guest.
-*/
-   smp_mb__after_atomic();
-
-   /*
-* Signal the doorbell to tell hardware to inject the IRQ if the vCPU
-* is in the guest.  If the vCPU is not in the guest, hardware will
-* automatically process AVIC interrupts at VMRUN.
-*/
-   if (vcpu->mode == IN_GUEST_MODE) {
-   int cpu = READ_ONCE(vcpu->cpu);
-
-   /*
-* Note, the vCPU could get migrated to a different pCPU at any
-* point, which could result in signalling the wrong/previous
-* pCPU.  But if that happens the vCPU is guaranteed to do a
-* VMRUN (after being migrated) and thus will process pending
-* interrupts, i.e. a doorbell is not needed (and the spurious
-* one is harmless).
-*/
-   if (cpu != get_cpu())
-   wrmsrl(MSR_AMD64_SVM_AVIC_DOORBELL, 
kvm_cpu_get_apicid(cpu));
-   put_cpu();
-   } else {
-   /*
-* Wake the vCPU if it was blocking.  KVM will then detect the
-* pending IRQ when checking if the vCPU has a wake event.
-*/
-   kvm

[PATCH 09/30] KVM: x86: SVM: move avic definitions from AMD's spec to svm.h

2022-02-07 Thread Maxim Levitsky

asm/svm.h is the correct place for all values that are defined in
the SVM spec, and that includes AVIC.

Also add some values from the spec that were not defined before
and will be soon useful.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/msr-index.h |  1 +
 arch/x86/include/asm/svm.h   | 36 
 arch/x86/kvm/svm/avic.c  | 22 +--
 arch/x86/kvm/svm/svm.h   | 11 --
 4 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 01e2650b95859..552ff8a5ea023 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -476,6 +476,7 @@
 #define MSR_AMD64_ICIBSEXTDCTL 0xc001103c
 #define MSR_AMD64_IBSOPDATA4   0xc001103d
 #define MSR_AMD64_IBS_REG_COUNT_MAX8 /* includes MSR_AMD64_IBSBRTARGET */
+#define MSR_AMD64_SVM_AVIC_DOORBELL0xc001011b
 #define MSR_AMD64_VM_PAGE_FLUSH0xc001011e
 #define MSR_AMD64_SEV_ES_GHCB  0xc0010130
 #define MSR_AMD64_SEV  0xc0010131
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index b00dbc5fac2b2..bb2fb78523cee 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -220,6 +220,42 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define SVM_NESTED_CTL_SEV_ENABLE  BIT(1)
 #define SVM_NESTED_CTL_SEV_ES_ENABLE   BIT(2)
 
+
+/* AVIC */
+#define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK   (0xFF)
+#define AVIC_LOGICAL_ID_ENTRY_VALID_BIT31
+#define AVIC_LOGICAL_ID_ENTRY_VALID_MASK   (1 << 31)
+
+#define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK   (0xFFULL)
+#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK   (0xFFULL << 12)
+#define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK (1ULL << 62)
+#define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK  (1ULL << 63)
+#define AVIC_PHYSICAL_ID_TABLE_SIZE_MASK   (0xFF)
+
+#define AVIC_DOORBELL_PHYSICAL_ID_MASK (0xFF)
+
+#define AVIC_UNACCEL_ACCESS_WRITE_MASK 1
+#define AVIC_UNACCEL_ACCESS_OFFSET_MASK0xFF0
+#define AVIC_UNACCEL_ACCESS_VECTOR_MASK0x
+
+enum avic_ipi_failure_cause {
+   AVIC_IPI_FAILURE_INVALID_INT_TYPE,
+   AVIC_IPI_FAILURE_TARGET_NOT_RUNNING,
+   AVIC_IPI_FAILURE_INVALID_TARGET,
+   AVIC_IPI_FAILURE_INVALID_BACKING_PAGE,
+};
+
+
+/*
+ * 0xff is broadcast, so the max index allowed for physical APIC ID
+ * table is 0xfe.  APIC IDs above 0xff are reserved.
+ */
+#define AVIC_MAX_PHYSICAL_ID_COUNT 0xff
+
+#define AVIC_HPA_MASK  ~((0xFFFULL << 52) | 0xFFF)
+#define VMCB_AVIC_APIC_BAR_MASK0xFF000ULL
+
+
 struct vmcb_seg {
u16 selector;
u16 attrib;
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 99f907ec5aa8f..fabfc337e1c35 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -27,20 +27,6 @@
 #include "irq.h"
 #include "svm.h"
 
-#define SVM_AVIC_DOORBELL  0xc001011b
-
-#define AVIC_HPA_MASK  ~((0xFFFULL << 52) | 0xFFF)
-
-/*
- * 0xff is broadcast, so the max index allowed for physical APIC ID
- * table is 0xfe.  APIC IDs above 0xff are reserved.
- */
-#define AVIC_MAX_PHYSICAL_ID_COUNT 255
-
-#define AVIC_UNACCEL_ACCESS_WRITE_MASK 1
-#define AVIC_UNACCEL_ACCESS_OFFSET_MASK0xFF0
-#define AVIC_UNACCEL_ACCESS_VECTOR_MASK0x
-
 /* AVIC GATAG is encoded using VM and VCPU IDs */
 #define AVIC_VCPU_ID_BITS  8
 #define AVIC_VCPU_ID_MASK  ((1 << AVIC_VCPU_ID_BITS) - 1)
@@ -73,12 +59,6 @@ struct amd_svm_iommu_ir {
void *data; /* Storing pointer to struct amd_ir_data */
 };
 
-enum avic_ipi_failure_cause {
-   AVIC_IPI_FAILURE_INVALID_INT_TYPE,
-   AVIC_IPI_FAILURE_TARGET_NOT_RUNNING,
-   AVIC_IPI_FAILURE_INVALID_TARGET,
-   AVIC_IPI_FAILURE_INVALID_BACKING_PAGE,
-};
 
 /* Note:
  * This function is called from IOMMU driver to notify
@@ -702,7 +682,7 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
 * one is harmless).
 */
if (cpu != get_cpu())
-   wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
+   wrmsrl(MSR_AMD64_SVM_AVIC_DOORBELL, 
kvm_cpu_get_apicid(cpu));
put_cpu();
} else {
/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 852b12aee03d7..6343558982c73 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -555,17 +555,6 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
 
 /* avic.c */
 
-#define AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK   (0xFF)
-#define AVIC_LOGICAL_ID_ENTRY_VALID_BIT31
-#define AVIC_

[PATCH 08/30] KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it

2022-02-07 Thread Maxim Levitsky

kvm_apic_update_apicv is called when AVIC is still active, thus IRR bits
can be set by the CPU after it is called, and don't cause the irr_pending
to be set to true.

Also logic in avic_kick_target_vcpu doesn't expect a race with this
function so to make it simple, just keep irr_pending set to true and
let the next interrupt injection to the guest clear it.


Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/lapic.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0da7d0960fcb5..dd4e2888c244b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2307,7 +2307,12 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
apic->irr_pending = true;
apic->isr_count = 1;
} else {
-   apic->irr_pending = (apic_search_irr(apic) != -1);
+   /*
+* Don't clear irr_pending, searching the IRR can race with
+* updates from the CPU as APICv is still active from hardware's
+* perspective.  The flag will be cleared as appropriate when
+* KVM injects the interrupt.
+*/
apic->isr_count = count_vectors(apic->regs + APIC_ISR);
}
 }
-- 
2.26.3

[PATCH 07/30] KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them

2022-02-07 Thread Maxim Levitsky

Fix a corner case in which the L1 hypervisor intercepts
interrupts (INTERCEPT_INTR) and either doesn't set
virtual interrupt masking (V_INTR_MASKING) or enters a
nested guest with EFLAGS.IF disabled prior to the entry.

In this case, despite the fact that L1 intercepts the interrupts,
KVM still needs to set up an interrupt window to wait before
injecting the INTR vmexit.

Currently the KVM instead enters an endless loop of 'req_immediate_exit'.

Exactly the same issue also happens for SMIs and NMI.
Fix this as well.

Note that on VMX, this case is impossible as there is only
'vmexit on external interrupts' execution control which either set,
in which case both host and guest's EFLAGS.IF
are ignored, or not set, in which case no VMexits are delivered.


Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9a4e299ed5673..22e614008cf59 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3372,11 +3372,13 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool 
for_injection)
if (svm->nested.nested_run_pending)
return -EBUSY;
 
+   if (svm_nmi_blocked(vcpu))
+   return 0;
+
/* An NMI must not be injected into L2 if it's supposed to VM-Exit.  */
if (for_injection && is_guest_mode(vcpu) && nested_exit_on_nmi(svm))
return -EBUSY;
-
-   return !svm_nmi_blocked(vcpu);
+   return 1;
 }
 
 static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
@@ -3428,9 +3430,13 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
 static int svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
struct vcpu_svm *svm = to_svm(vcpu);
+
if (svm->nested.nested_run_pending)
return -EBUSY;
 
+   if (svm_interrupt_blocked(vcpu))
+   return 0;
+
/*
 * An IRQ must not be injected into L2 if it's supposed to VM-Exit,
 * e.g. if the IRQ arrived asynchronously after checking nested events.
@@ -3438,7 +3444,7 @@ static int svm_interrupt_allowed(struct kvm_vcpu *vcpu, 
bool for_injection)
if (for_injection && is_guest_mode(vcpu) && nested_exit_on_intr(svm))
return -EBUSY;
 
-   return !svm_interrupt_blocked(vcpu);
+   return 1;
 }
 
 static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
@@ -4169,11 +4175,14 @@ static int svm_smi_allowed(struct kvm_vcpu *vcpu, bool 
for_injection)
if (svm->nested.nested_run_pending)
return -EBUSY;
 
+   if (svm_smi_blocked(vcpu))
+   return 0;
+
/* An SMI must not be injected into L2 if it's supposed to VM-Exit.  */
if (for_injection && is_guest_mode(vcpu) && nested_exit_on_smi(svm))
return -EBUSY;
 
-   return !svm_smi_blocked(vcpu);
+   return 1;
 }
 
 static int svm_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
-- 
2.26.3

[PATCH 06/30] KVM: x86: mark syntethic SMM vmexit as SVM_EXIT_SW

2022-02-07 Thread Maxim Levitsky

Use a dummy unused vmexit reason to mark the 'VM exit'
that is happening when we exit to handle SMM,
which is not a real VM exit.

This makes it a bit easier to read the KVM trace,
and avoids other potential problems.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8013be9edf27c..9a4e299ed5673 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4194,7 +4194,7 @@ static int svm_enter_smm(struct kvm_vcpu *vcpu, char 
*smstate)
svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
 
-   ret = nested_svm_vmexit(svm);
+   ret = nested_svm_simple_vmexit(svm, SVM_EXIT_SW);
if (ret)
return ret;
 
-- 
2.26.3

[PATCH 05/30] KVM: x86: nSVM: expose clean bit support to the guest

2022-02-07 Thread Maxim Levitsky

KVM already honours few clean bits thus it makes sense
to let the nested guest know about it.

Note that KVM also doesn't check if the hardware supports
clean bits, and therefore nested KVM was
already setting clean bits and L0 KVM
was already honouring them.


Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 71bfa52121622..8013be9edf27c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4663,6 +4663,7 @@ static __init void svm_set_cpu_caps(void)
/* CPUID 0x8001 and 0x800A (SVM features) */
if (nested) {
kvm_cpu_cap_set(X86_FEATURE_SVM);
+   kvm_cpu_cap_set(X86_FEATURE_VMCBCLEAN);
 
if (nrips)
kvm_cpu_cap_set(X86_FEATURE_NRIPS);
-- 
2.26.3

[PATCH 04/30] KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM

2022-02-07 Thread Maxim Levitsky

While RSM induced VM entries are not full VM entries,
they still need to be followed by actual VM entry to complete it,
unlike setting the nested state.

This patch fixes boot of hyperv and SMM enabled
windows VM running nested on KVM, which fail due
to this issue combined with lack of dirty bit setting.

Signed-off-by: Maxim Levitsky 
Cc: sta...@vger.kernel.org
---
 arch/x86/kvm/svm/svm.c | 5 +
 arch/x86/kvm/vmx/vmx.c | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3f1d11e652123..71bfa52121622 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4274,6 +4274,11 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const 
char *smstate)
nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
ret = enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, false);
 
+   if (ret)
+   goto unmap_save;
+
+   svm->nested.nested_run_pending = 1;
+
 unmap_save:
kvm_vcpu_unmap(vcpu, &map_save, true);
 unmap_map:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8ac5a6fa77203..fc9c4eca90a78 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7659,6 +7659,7 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const 
char *smstate)
if (ret)
return ret;
 
+   vmx->nested.nested_run_pending = 1;
vmx->nested.smm.guest_mode = false;
}
return 0;
-- 
2.26.3

[PATCH 03/30] KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state

2022-02-07 Thread Maxim Levitsky

While usually, restoring the smm state makes the KVM enter
the nested guest thus a different vmcb (vmcb02 vs vmcb01),
KVM should still mark it as dirty, since hardware
can in theory cache multiple vmcbs.

Failure to do so, combined with lack of setting the
nested_run_pending (which is fixed in the next patch),
might make KVM re-enter vmcb01, which was just exited from,
with completely different set of guest state registers
(SMM vs non SMM) and without proper dirty bits set,
which results in the CPU reusing stale IDTR pointer
which leads to a guest shutdown on any interrupt.

On the real hardware this usually doesn't happen,
but when running nested, L0's KVM does check and
honour few dirty bits, causing this issue to happen.

This patch fixes boot of hyperv and SMM enabled
windows VM running nested on KVM.

Signed-off-by: Maxim Levitsky 
Cc: sta...@vger.kernel.org
---
 arch/x86/kvm/svm/svm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 995c203a62fd9..3f1d11e652123 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4267,6 +4267,8 @@ static int svm_leave_smm(struct kvm_vcpu *vcpu, const 
char *smstate)
 * Enter the nested guest now
 */
 
+   vmcb_mark_all_dirty(svm->vmcb01.ptr);
+
vmcb12 = map.hva;
nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
-- 
2.26.3

[PATCH 02/30] KVM: x86: nSVM: fix potential NULL derefernce on nested migration

2022-02-07 Thread Maxim Levitsky

Turns out that due to review feedback and/or rebases
I accidentally moved the call to nested_svm_load_cr3 to be too early,
before the NPT is enabled, which is very wrong to do.

KVM can't even access guest memory at that point as nested NPT
is needed for that, and of course it won't initialize the walk_mmu,
which is main issue the patch was addressing.

Fix this for real.

Fixes: 232f75d3b4b5 ("KVM: nSVM: call nested_svm_load_cr3 on nested state load")
Cc: sta...@vger.kernel.org

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 1218b5a342fc8..39d280e7e80ef 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1457,18 +1457,6 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
!__nested_vmcb_check_save(vcpu, &save_cached))
goto out_free;
 
-   /*
-* While the nested guest CR3 is already checked and set by
-* KVM_SET_SREGS, it was set when nested state was yet loaded,
-* thus MMU might not be initialized correctly.
-* Set it again to fix this.
-*/
-
-   ret = nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
- nested_npt_enabled(svm), false);
-   if (WARN_ON_ONCE(ret))
-   goto out_free;
-
 
/*
 * All checks done, we can enter guest mode. Userspace provides
@@ -1494,6 +1482,20 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 
svm_switch_vmcb(svm, &svm->nested.vmcb02);
nested_vmcb02_prepare_control(svm);
+
+   /*
+* While the nested guest CR3 is already checked and set by
+* KVM_SET_SREGS, it was set when nested state was yet loaded,
+* thus MMU might not be initialized correctly.
+* Set it again to fix this.
+*/
+
+   ret = nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
+ nested_npt_enabled(svm), false);
+   if (WARN_ON_ONCE(ret))
+   goto out_free;
+
+
kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
ret = 0;
 out_free:
-- 
2.26.3

[PATCH 01/30] KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case

2022-02-07 Thread Maxim Levitsky

When the guest doesn't enable paging, and NPT/EPT is disabled, we
use guest't paging CR3's as KVM's shadow paging pointer and
we are technically in direct mode as if we were to use NPT/EPT.

In direct mode we create SPTEs with user mode permissions
because usually in the direct mode the NPT/EPT doesn't
need to restrict access based on guest CPL
(there are MBE/GMET extenstions for that but KVM doesn't use them).

In this special "use guest paging as direct" mode however,
and if CR4.SMAP/CR4.SMEP are enabled, that will make the CPU
fault on each access and KVM will enter endless loop of page faults.

Since page protection doesn't have any meaning in !PG case,
just don't passthrough these bits.

The fix is the same as was done for VMX in commit:
commit 656ec4a4928a ("KVM: VMX: fix SMEP and SMAP without EPT")

This fixes the boot of windows 10 without NPT for good.
(Without this patch, BSP boots, but APs were stuck in endless
loop of page faults, causing the VM boot with 1 CPU)

Signed-off-by: Maxim Levitsky 
Cc: sta...@vger.kernel.org
---
 arch/x86/kvm/svm/svm.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 975be872cd1a3..995c203a62fd9 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1596,6 +1596,7 @@ void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 {
struct vcpu_svm *svm = to_svm(vcpu);
u64 hcr0 = cr0;
+   bool old_paging = is_paging(vcpu);
 
 #ifdef CONFIG_X86_64
if (vcpu->arch.efer & EFER_LME && !vcpu->arch.guest_state_protected) {
@@ -1612,8 +1613,11 @@ void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long 
cr0)
 #endif
vcpu->arch.cr0 = cr0;
 
-   if (!npt_enabled)
+   if (!npt_enabled) {
hcr0 |= X86_CR0_PG | X86_CR0_WP;
+   if (old_paging != is_paging(vcpu))
+   svm_set_cr4(vcpu, kvm_read_cr4(vcpu));
+   }
 
/*
 * re-enable caching here because the QEMU bios
@@ -1657,8 +1661,12 @@ void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long 
cr4)
svm_flush_tlb_current(vcpu);
 
vcpu->arch.cr4 = cr4;
-   if (!npt_enabled)
+   if (!npt_enabled) {
cr4 |= X86_CR4_PAE;
+
+   if (!is_paging(vcpu))
+   cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
+   }
cr4 |= host_cr4_mce;
to_svm(vcpu)->vmcb->save.cr4 = cr4;
vmcb_mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR);
-- 
2.26.3

[PATCH 00/30] My patch queue

2022-02-07 Thread Maxim Levitsky

This is set of various patches that are stuck in my patch queue.

KVM_REQ_GET_NESTED_STATE_PAGES patch is mostly RFC, but it does seem
to work for me.

Read-only APIC ID is also somewhat RFC.

Some of these patches are preparation for support for nested AVIC
which I almost done developing, and will start testing very soon.

Best regards,
Maxim Levitsky

Maxim Levitsky (30):
  KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT &&
!gCR0.PG case
  KVM: x86: nSVM: fix potential NULL derefernce on nested migration
  KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state
  KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a
result of RSM
  KVM: x86: nSVM: expose clean bit support to the guest
  KVM: x86: mark syntethic SMM vmexit as SVM_EXIT_SW
  KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but
lets L2 control them
  KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when
inhibiting it
  KVM: x86: SVM: move avic definitions from AMD's spec to svm.h
  KVM: x86: SVM: fix race between interrupt delivery and AVIC inhibition
  KVM: x86: SVM: use vmcb01 in avic_init_vmcb
  KVM: x86: SVM: allow AVIC to co-exist with a nested guest running
  KVM: x86: lapic: don't allow to change APIC ID when apic acceleration
is enabled
  KVM: x86: lapic: don't allow to change local apic id when using older
x2apic api
  KVM: x86: SVM: remove avic's broken code that updated APIC ID
  KVM: x86: SVM: allow to force AVIC to be enabled
  KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
  KVM: x86: mmu: add strict mmu mode
  KVM: x86: mmu: add gfn_in_memslot helper
  KVM: x86: mmu: allow to enable write tracking externally
  x86: KVMGT: use kvm_page_track_write_tracking_enable
  KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running
  KVM: x86: nSVM: implement nested LBR virtualization
  KVM: x86: nSVM: implement nested VMLOAD/VMSAVE
  KVM: x86: nSVM: support PAUSE filter threshold and count when
cpu_pm=on
  KVM: x86: nSVM: implement nested vGIF
  KVM: x86: add force_intercept_exceptions_mask
  KVM: SVM: implement force_intercept_exceptions_mask
  KVM: VMX: implement force_intercept_exceptions_mask
  KVM: x86: get rid of KVM_REQ_GET_NESTED_STATE_PAGES

 arch/x86/include/asm/kvm-x86-ops.h|   1 +
 arch/x86/include/asm/kvm_host.h   |  24 +-
 arch/x86/include/asm/kvm_page_track.h |   1 +
 arch/x86/include/asm/msr-index.h  |   1 +
 arch/x86/include/asm/svm.h|  36 +++
 arch/x86/include/uapi/asm/kvm.h   |   1 +
 arch/x86/kvm/Kconfig  |   3 -
 arch/x86/kvm/hyperv.c |   4 +
 arch/x86/kvm/lapic.c  |  53 ++--
 arch/x86/kvm/mmu.h|   8 +-
 arch/x86/kvm/mmu/mmu.c|  31 ++-
 arch/x86/kvm/mmu/page_track.c |  10 +-
 arch/x86/kvm/svm/avic.c   | 135 +++---
 arch/x86/kvm/svm/nested.c | 167 +++-
 arch/x86/kvm/svm/svm.c| 375 ++
 arch/x86/kvm/svm/svm.h|  60 +++--
 arch/x86/kvm/svm/svm_onhyperv.c   |   1 +
 arch/x86/kvm/vmx/nested.c | 107 +++-
 arch/x86/kvm/vmx/vmcs.h   |   6 +
 arch/x86/kvm/vmx/vmx.c|  48 +++-
 arch/x86/kvm/x86.c|  42 ++-
 arch/x86/kvm/x86.h|   5 +
 drivers/gpu/drm/i915/Kconfig  |   1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c  |   5 +
 include/linux/kvm_host.h  |  10 +-
 25 files changed, 764 insertions(+), 371 deletions(-)

-- 
2.26.3

Re: Couple of issues with amdgpu on my WX4100

2021-01-06 Thread Maxim Levitsky

On Mon, 2021-01-04 at 12:34 +0100, Christian König wrote:
> Hi Maxim,
> 
> I can't help with the display related stuff. Probably best approach to get 
> this fixes would be to open up a bug tracker for this on FDO.

Done, bugs are opened
https://gitlab.freedesktop.org/drm/amd/-/issues/1429
https://gitlab.freedesktop.org/drm/amd/-/issues/1430

About the EDID issue, there do seem to be few open bugs about it,
but what differs in my case I think is that EDID failure happens
only once in a while, rather that always, and it seems to bring
the whole device down.

Best regards,
    Maxim Levitsky

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: Couple of issues with amdgpu on my WX4100

2021-01-06 Thread Maxim Levitsky

On Mon, 2021-01-04 at 09:45 -0700, Alex Williamson wrote:
> On Mon, 4 Jan 2021 12:34:34 +0100
> Christian König  wrote:
> 
> > Hi Maxim,
> > 
> > I can't help with the display related stuff. Probably best approach to 
> > get this fixes would be to open up a bug tracker for this on FDO.
> > 
> > But I'm the one who implemented the resizeable BAR support and your 
> > analysis of the problem sounds about correct to me.
> > 
> > The reason why this works on Linux is most likely because we restore the 
> > BAR size on resume (and maybe during initial boot as well).
> > 
> > See this patch for reference:
> > 
> > commit d3252ace0bc652a1a24446b6a549f969bf99
> > Author: Christian König 
> > Date:   Fri Jun 29 19:54:55 2018 -0500
> > 
> >  PCI: Restore resized BAR state on resume
> > 
> >  Resize BARs after resume to the expected size again.
> > 
> >  BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199959
> >  Fixes: d6895ad39f3b ("drm/amdgpu: resize VRAM BAR for CPU access v6")
> >  Fixes: 276b738deb5b ("PCI: Add resizable BAR infrastructure")
> >  Signed-off-by: Christian König 
> >  Signed-off-by: Bjorn Helgaas 
> >  CC: sta...@vger.kernel.org  # v4.15+
> > 
Hi!
Thanks for the feedback!

So I went over qemu code and indeed the qemu (as opposed to the kernel
where I tried to hide the PCI_EXT_CAP_ID_REBAR) indeed does hide this
pci capability from the guest.

However exactly as Alex mentioned the kernel does indeed restore
the rebar state, and even with that code patched out I found out that
rebar state persists across the reset that the vendor_reset module 
does (BACO I think).

Therefore the Linux guest sees the full 4G bar and happily uses it, 
while the windows guest's driver apparently has a bug when the bar
is that large.

I patched the amdgpu to resize the bar to various other sizes, and
the windows driver apparently works up to a 2GB bar.

So pretty much other than a bug in the windows driver, and fact
that VFIO doesn't support resizable bars there is nothing wrong here.

Since my system does support above 4G decoding and I do have a nice
vfio friendly device that does support a resizable bar, I do volunteer
to add support for this to VFIO as time and resources permit.

Also it would be nice if it was either possible to make amdgpu 
(or the whole system) optionally avoid resizing bars when a 
kernel command line / module param is given,
or even better let the amdgpu resize the bar to its original
size when it is unloaded which IMHO is the best solution 
for this problem.

I think I can prepare a patch to make amdgpu restore 
the bar size on unload if you think that
this is the right solution.

> > 
> > It should be trivial to add this to the reset module as well. Most 
> > likely even completely vendor independent since I'm not sure what a bus 
> > reset will do to this configuration and restoring it all the time should 
> > be the most defensive approach.

> 
> Hmm, this should already be used by the bus/slot reset path:
> 
> pci_bus_restore_locked()/pci_slot_restore_locked()
>  pci_dev_restore()
>   pci_restore_state()
>pci_restore_rebar_state()
> 
> VFIO support for resizeable BARs has been on my todo list, but I don't
> have access to any systems that have both a capable device and >4G
> decoding enabled in the BIOS.  If we have a consistent view of the BAR
> size after the BARs are expanded, I'm not sure why it doesn't just
> work.  FWIW, QEMU currently hides the REBAR capability to the guest
> because the kernel driver doesn't support emulation through config
> space (ie. it's read-only, which the spec doesn't support).
> 
> AIUI, resource allocation can fail when enabling REBAR support, which
> is a problem if the failure occurs on the host but not the guest since
> we have no means via the hardware protocol to expose such a condition.
> Therefore the model I was considering for vfio-pci would be to simply
> pre-enable REBAR at the max size.  It might be sufficiently safe to
> test BAR expansion on initialization and then allow user control, but
> I'm concerned that resource availability could change while already in
> use by the user.  Thanks,

As mentioned in other replies in this thread and what my first
thought about this, this will indeed will break on devices which
don't accurately report the maximum bar size that they actually need.
Even the spec itself says that it is vendor specific to determine the
optimal bar size.

We can also allow guest to resize the bar and if that fails,
expose the error via a virtual AER message on the root port
where the device is attached? 

I personally don't know if this is possible/worth it.

Best regards,
Maxim Levitsky

> 
> Alex

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: Kernel issues with Radeon Pro WX4100 and DP->HDMI dongles

2020-06-28 Thread Maxim Levitsky

On Thu, 2020-06-25 at 10:14 +0300, Maxim Levitsky wrote:
> Hi,
> 
> I recently tried to connect my TV and WX4100 via two different DP->HDMI 
> dongles.
> One of them makes my main monitor to go dark, and system to lockup (I haven't 
> yet debugged this futher), and the other one seems to work,
> most of the time, but sometimes causes a kernel panic on 5.8.0-rc1:
> 
> 
> [  +0.00] ---[ end trace 0ce8685fac3db6b5 ]---
> [  +2.142125] [drm:dc_link_detect_helper [amdgpu]] *ERROR* No EDID read.
> [  +0.065348] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [  +0.001002] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [  +0.006310] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [  +0.102119] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [  +0.000679] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [ +22.037707] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [ +16.202833] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [  +0.000685] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [  +0.053875] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [  +0.000351] [drm] amdgpu_dm_irq_schedule_work FAILED src 8
> [  +0.031764] [ cut here ]
> [  +0.01] WARNING: CPU: 58 PID: 504 at 
> drivers/gpu/drm/amd/amdgpu/../display/dc/gpio/gpio_base.c:66 
> dal_gpio_open_ex+0x1b/0x40 [amdgpu]
> [  +0.01] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio 
> xfs rfcomm xt_MASQUERADE xt_conntrack ipt_REJECT iptable_mangle iptable_nat 
> nf_nat ebtable_filter ebtables ip6table_filter
> ip6_tables tun bridge pmbus cmac pmbus_core ee1004 jc42 bnep sunrpc vfat fat 
> dm_mirror dm_region_hash dm_log iwlmvm wmi_bmof mac80211 kvm_amd kvm libarc4 
> uvcvideo iwlwifi btusb btrtl btbcm btintel
> videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_hda_codec_hdmi 
> videobuf2_common snd_usb_audio bluetooth videodev input_leds snd_hda_intel 
> cfg80211 snd_usbmidi_lib joydev snd_intel_dspcfg
> snd_rawmidi mc snd_hda_codec xpad ff_memless snd_hwdep thunderbolt 
> ecdh_generic snd_seq ecc snd_hda_core irqbypass rfkill i2c_nvidia_gpu 
> efi_pstore pcspkr snd_seq_device bfq snd_pcm snd_timer zenpower
> snd i2c_piix4 rtc_cmos tpm_crb tpm_tis tpm_tis_core tpm wmi button 
> binfmt_misc dm_crypt sd_mod uas usb_storage hid_generic usbhid hid ext4 
> mbcache jbd2 amdgpu gpu_sched ttm drm_kms_helper syscopyarea
> sysfillrect
> [  +0.18]  sysimgblt crc32_pclmul ahci crc32c_intel fb_sys_fops libahci 
> igb ccp cec xhci_pci libata i2c_algo_bit rng_core nvme xhci_hcd drm nvme_core 
> t10_pi nbd usbmon it87 hwmon_vid fuse i2c_dev
> i2c_core ipv6 autofs4 [last unloaded: nvidia]
> [  +0.05] CPU: 58 PID: 504 Comm: kworker/58:1 Tainted: PW  O  
> 5.8.0-rc1.stable #118
> [  +0.01] Hardware name: Gigabyte Technology Co., Ltd. TRX40 
> DESIGNARE/TRX40 DESIGNARE, BIOS F4c 03/05/2020
> [  +0.00] Workqueue: events dm_irq_work_func [amdgpu]
> [  +0.01] RIP: 0010:dal_gpio_open_ex+0x1b/0x40 [amdgpu]
> [  +0.01] Code: 08 89 47 10 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
> 00 48 83 7f 08 00 75 0f 48 83 7f 18 00 74 15 89 77 20 e9 65 07 00 00 <0f> 0b 
> e8 ae 5b 8a e0 b8 05 00 00 00 c3 0f 0b e8 a1
> 5b 8a e0 b8 06
> [  +0.00] RSP: 0018:c90002e93b90 EFLAGS: 00010282
> [  +0.01] RAX:  RBX: 889fa4736ca0 RCX: 
> 
> [  +0.00] RDX:  RSI: 0003 RDI: 
> 889fa011ff00
> [  +0.01] RBP: 0003 R08: 0001 R09: 
> 0231
> [  +0.00] R10: 017f R11: 889fbeea4b84 R12: 
> c90002e93c74
> [  +0.00] R13:  R14: 889fa4736ca0 R15: 
> 889fb0e2c100
> [  +0.01] FS:  () GS:889fbee8() 
> knlGS:
> [  +0.00] CS:  0010 DS:  ES:  CR0: 80050033
> [  +0.01] CR2: 1ee62a52b000 CR3: 00174d175000 CR4: 
> 00340ea0
> [  +0.00] Call Trace:
> [  +0.00]  dal_ddc_open+0x2d/0xe0 [amdgpu]
> [  +0.01]  ? dm_read_reg_func+0x33/0xa0 [amdgpu]
> [  +0.00]  dce_aux_transfer_raw+0xb4/0xa30 [amdgpu]
> [  +0.00]  ? hrtimer_try_to_cancel+0x28/0x100
> [  +0.01]  dm_dp_aux_transfer+0x8f/0xf0 [amdgpu]
> [  +0.00]  drm_dp_dpcd_access+0x6b/0x110 [drm_kms_helper]
> [  +0.00]  drm_dp_dpcd_read+0xb6/0xf0 [drm_kms_helper]
> [  +0.01]  dm_helpers_dp_read_dpcd+0x28/0x50 [amdgpu]
> [  +0.00]  core_link_read_dpcd.part.0+0x1f/0x30 [amdgpu]
> [  +0.00]  read_hpd_rx_irq_data+0x39/0x90 [amdgpu]
> [  +0.01]  dc_link_handle_hpd_rx_irq+0x74/0x7c0 [amdgpu]
> [  +0.00]  handle_hpd_rx_irq+0x62/0x2e0 [amdgpu]
> [  +0.00]  ? __schedule+0x252/0x6a0
> [  +0.01]  ?

Kernel issues with Radeon Pro WX4100 and DP->HDMI dongles

2020-06-25 Thread Maxim Levitsky

d dongle and it does appear to work flawlessly 
(no messages in dmesg).

Best regards,
Maxim Levitsky

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[bisected] nouveau: "Failed to idle channel x" after resume

2012-08-13 Thread Maxim Levitsky

On Mon, 2012-08-13 at 18:22 +0200, Sven Joachim wrote: 
> On 2012-08-08 08:18 +0200, Sven Joachim wrote:
> 
> > On 2012-08-08 08:08 +0200, Ben Skeggs wrote:
> >
> >> On Wed, Aug 08, 2012 at 08:00:21AM +0200, Sven Joachim wrote:
> >>> Not for me on my GeForce 8500 GT, and I still cannot suspend more than
> >>> once, subsequent attempts fail:
> >>> 
> >>> ,
> >>> | Aug 8 07:49:16 turtle kernel: [ 91.697068] nouveau W[
> >>> | PGRAPH][:01:00.0][0x0200502d][880037be1d40] parent failed
> >>> | suspend, -16
> >>> | Aug  8 07:49:16 turtle kernel: [   91.697078] nouveau  [ 
> >>> DRM][:01:00.0] resuming display...
> >>> `
> >> Interesting.  Were there any messages prior to that?
> >
> > Nothing interesting:
> >
> > ,
> > | Aug  8 07:49:16 turtle kernel: [   89.655362] nouveau  [ 
> > DRM][:01:00.0] suspending fbcon...
> > | Aug  8 07:49:16 turtle kernel: [   89.655367] nouveau  [ 
> > DRM][:01:00.0] suspending display...
> > | Aug  8 07:49:16 turtle kernel: [   89.696888] nouveau  [ 
> > DRM][:01:00.0] unpinning framebuffer(s)...
> > | Aug  8 07:49:16 turtle kernel: [   89.696909] nouveau  [ 
> > DRM][:01:00.0] evicting buffers...
> > | Aug  8 07:49:16 turtle kernel: [   89.696913] nouveau  [ 
> > DRM][:01:00.0] suspending client object trees...
> > `
> >
> >> I guess the the fifo
> >> code detected a timeout when trying to save the graphics context, I have
> >> I have other patches in my tree (I'll push them soon, tied up with other
> >> work atm) that might help here.
> >
> > Thanks, I'll try them when they are available.
> 
> With current nouveau master ("drm/nouveau: fix find/replace bug in
> license header") suspending works again, thanks!  However, it is a bit
> slow, taking between two and five seconds:
> 
> ,
> | Aug 13 18:17:56 turtle kernel: [  678.524814] PM: Syncing filesystems ... 
> done.
> | Aug 13 18:18:09 turtle kernel: [  678.639202] Freezing user space processes 
> ... (elapsed 0.01 seconds) done.
> | Aug 13 18:18:09 turtle kernel: [  678.649954] Freezing remaining freezable 
> tasks ... (elapsed 0.01 seconds) done.
> | Aug 13 18:18:09 turtle kernel: [  678.663298] Suspending console(s) (use 
> no_console_suspend to debug)
> | Aug 13 18:18:09 turtle kernel: [  678.680884] sd 0:0:0:0: [sda] 
> Synchronizing SCSI cache
> | Aug 13 18:18:09 turtle kernel: [  678.681000] sd 0:0:0:0: [sda] Stopping 
> disk
> | Aug 13 18:18:09 turtle kernel: [  678.695141] parport_pc 00:07: disabled
> | Aug 13 18:18:09 turtle kernel: [  678.695204] serial 00:06: disabled
> | Aug 13 18:18:09 turtle kernel: [  678.695209] serial 00:06: wake-up 
> capability disabled by ACPI
> | Aug 13 18:18:09 turtle kernel: [  678.695235] nouveau  [ 
> DRM][:01:00.0] suspending fbcon...
> | Aug 13 18:18:09 turtle kernel: [  678.695239] nouveau  [ 
> DRM][:01:00.0] suspending display...
> | Aug 13 18:18:09 turtle kernel: [  678.742111] nouveau  [ 
> DRM][:01:00.0] unpinning framebuffer(s)...
> | Aug 13 18:18:09 turtle kernel: [  678.742189] nouveau  [ 
> DRM][:01:00.0] evicting buffers...
> | Aug 13 18:18:09 turtle kernel: [  682.357319] nouveau  [ 
> DRM][:01:00.0] suspending client object trees...
> | Aug 13 18:18:09 turtle kernel: [  683.526646] PM: suspend of devices 
> complete after 4863.181 msecs
> `
> 
> With the 3.4.8 kernel, suspending takes little more than one second.
> 
> Cheers,
>Sven
I confirm exactly the same thing.

Here suspend takes more that 10 seconds:

[ 2165.363878] nouveau  [ DRM][:01:00.0] suspending fbcon...
[ 2165.363885] nouveau  [ DRM][:01:00.0] suspending display...
[ 2165.475791] sd 0:0:0:0: [sda] Stopping disk
[ 2166.396877] nouveau  [ DRM][:01:00.0] unpinning
framebuffer(s)...
[ 2166.396926] nouveau  [ DRM][:01:00.0] evicting buffers...
[ 2174.809084] nouveau  [ DRM][:01:00.0] suspending client
object trees...
[ 2177.950222] nouveau :01:00.0: power state changed by ACPI to D3


Best regards,
Maxim Levitsky

Re: [bisected] nouveau: "Failed to idle channel x" after resume

2012-08-13 Thread Maxim Levitsky

On Mon, 2012-08-13 at 18:22 +0200, Sven Joachim wrote: 
> On 2012-08-08 08:18 +0200, Sven Joachim wrote:
> 
> > On 2012-08-08 08:08 +0200, Ben Skeggs wrote:
> >
> >> On Wed, Aug 08, 2012 at 08:00:21AM +0200, Sven Joachim wrote:
> >>> Not for me on my GeForce 8500 GT, and I still cannot suspend more than
> >>> once, subsequent attempts fail:
> >>> 
> >>> ,
> >>> | Aug 8 07:49:16 turtle kernel: [ 91.697068] nouveau W[
> >>> | PGRAPH][:01:00.0][0x0200502d][880037be1d40] parent failed
> >>> | suspend, -16
> >>> | Aug  8 07:49:16 turtle kernel: [   91.697078] nouveau  [ 
> >>> DRM][:01:00.0] resuming display...
> >>> `
> >> Interesting.  Were there any messages prior to that?
> >
> > Nothing interesting:
> >
> > ,
> > | Aug  8 07:49:16 turtle kernel: [   89.655362] nouveau  [ 
> > DRM][:01:00.0] suspending fbcon...
> > | Aug  8 07:49:16 turtle kernel: [   89.655367] nouveau  [ 
> > DRM][:01:00.0] suspending display...
> > | Aug  8 07:49:16 turtle kernel: [   89.696888] nouveau  [ 
> > DRM][:01:00.0] unpinning framebuffer(s)...
> > | Aug  8 07:49:16 turtle kernel: [   89.696909] nouveau  [ 
> > DRM][:01:00.0] evicting buffers...
> > | Aug  8 07:49:16 turtle kernel: [   89.696913] nouveau  [ 
> > DRM][:01:00.0] suspending client object trees...
> > `
> >
> >> I guess the the fifo
> >> code detected a timeout when trying to save the graphics context, I have
> >> I have other patches in my tree (I'll push them soon, tied up with other
> >> work atm) that might help here.
> >
> > Thanks, I'll try them when they are available.
> 
> With current nouveau master ("drm/nouveau: fix find/replace bug in
> license header") suspending works again, thanks!  However, it is a bit
> slow, taking between two and five seconds:
> 
> ,
> | Aug 13 18:17:56 turtle kernel: [  678.524814] PM: Syncing filesystems ... 
> done.
> | Aug 13 18:18:09 turtle kernel: [  678.639202] Freezing user space processes 
> ... (elapsed 0.01 seconds) done.
> | Aug 13 18:18:09 turtle kernel: [  678.649954] Freezing remaining freezable 
> tasks ... (elapsed 0.01 seconds) done.
> | Aug 13 18:18:09 turtle kernel: [  678.663298] Suspending console(s) (use 
> no_console_suspend to debug)
> | Aug 13 18:18:09 turtle kernel: [  678.680884] sd 0:0:0:0: [sda] 
> Synchronizing SCSI cache
> | Aug 13 18:18:09 turtle kernel: [  678.681000] sd 0:0:0:0: [sda] Stopping 
> disk
> | Aug 13 18:18:09 turtle kernel: [  678.695141] parport_pc 00:07: disabled
> | Aug 13 18:18:09 turtle kernel: [  678.695204] serial 00:06: disabled
> | Aug 13 18:18:09 turtle kernel: [  678.695209] serial 00:06: wake-up 
> capability disabled by ACPI
> | Aug 13 18:18:09 turtle kernel: [  678.695235] nouveau  [ 
> DRM][:01:00.0] suspending fbcon...
> | Aug 13 18:18:09 turtle kernel: [  678.695239] nouveau  [ 
> DRM][:01:00.0] suspending display...
> | Aug 13 18:18:09 turtle kernel: [  678.742111] nouveau  [ 
> DRM][:01:00.0] unpinning framebuffer(s)...
> | Aug 13 18:18:09 turtle kernel: [  678.742189] nouveau  [ 
> DRM][:01:00.0] evicting buffers...
> | Aug 13 18:18:09 turtle kernel: [  682.357319] nouveau  [ 
> DRM][:01:00.0] suspending client object trees...
> | Aug 13 18:18:09 turtle kernel: [  683.526646] PM: suspend of devices 
> complete after 4863.181 msecs
> `
> 
> With the 3.4.8 kernel, suspending takes little more than one second.
> 
> Cheers,
>Sven
I confirm exactly the same thing.

Here suspend takes more that 10 seconds:

[ 2165.363878] nouveau  [ DRM][:01:00.0] suspending fbcon...
[ 2165.363885] nouveau  [ DRM][:01:00.0] suspending display...
[ 2165.475791] sd 0:0:0:0: [sda] Stopping disk
[ 2166.396877] nouveau  [ DRM][:01:00.0] unpinning
framebuffer(s)...
[ 2166.396926] nouveau  [ DRM][:01:00.0] evicting buffers...
[ 2174.809084] nouveau  [ DRM][:01:00.0] suspending client
object trees...
[ 2177.950222] nouveau :01:00.0: power state changed by ACPI to D3


Best regards,
Maxim Levitsky

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[bisected] nouveau: "Failed to idle channel x" after resume

2012-08-06 Thread Maxim Levitsky

On Sat, 2012-08-04 at 17:41 +0300, Maxim Levitsky wrote: 
> On Mon, 2012-07-23 at 18:25 +0300, Aioanei Rares wrote: 
> > On Thu, Jul 5, 2012 at 11:24 PM, Martin Nyhus  
> > wrote:
> > >
> > > On Mon, 11 Jun 2012 23:18:42 +0200 Martin Nyhus wrote:
> > > > after resuming from suspend nouveau starts writing Failed to idle
> > > > channel x (where x is 2 or 3) to the log and X appears to stop and
> > > > then restart only to stop again. Starting Firefox after resuming
> > > > triggers the bugs every time, and bisecting leads to 03bd6efa
> > > > ("drm/nv50/fifo: use hardware channel kickoff functionality").
> > >
> > > Hi Ben,
> > > I'm still seeing this bug with the latest from Linus
> > > (v3.5-rc5-98-g9e85a6f) and linux-next (next-20120705).
> > >
> > > lspci output:
> > > 01:00.0 VGA compatible controller: nVidia Corporation G86 [GeForce
> > > 8400M GS] (rev a1)
> > >
> > > Sorry I haven't followed up on this earlier,
> > > Martin
> > 
> > I can confirm this with 3.5.0, Chromium and Arch Linux. It's a HP
> > Pavilion laptop with a G86 [GeForce 8400 M GS] video card .
> > Seems related to this bug:
> > http://lists.freedesktop.org/archives/nouveau/2011-January/007358.html
> > . If I can do anything else
> > to help, I will be glad to.
> Added nouveau at lists.freedesktop.org>
> 
> I confirm the same issue here.
> will try to do dig it.
Nope,can't dig this :-(



-- 
Best regards,
Maxim Levitsky

1 2 >

1 - 100 of 119 matches

Mail list logo