Re: [PATCH 1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-27 Thread Sean Christopherson
On Fri, Sep 27, 2019 at 05:44:53PM +0300, Liran Alon wrote:
> 
> > On 27 Sep 2019, at 17:27, Sean Christopherson 
> >  wrote:
> > 
> > On Fri, Sep 27, 2019 at 03:06:02AM +0300, Liran Alon wrote:
> >> 
> >>> On 27 Sep 2019, at 0:43, Sean Christopherson 
> >>>  wrote:
> >>> 
> >>> + /*
> >>> +  * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> >>> +  * on nested VM-Exit, which can occur without actually running L2, e.g.
> >>> +  * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
> >>> +  */
> >> 
> >> If I understand correctly, it’s not exactly if L2 is entering HLT state in
> >> general.  (E.g. issue doesn’t occur if L2 runs HLT directly which is not
> >> configured to be intercepted by vmcs12).  It’s specifically when L1 enters 
> >> L2
> >> with a HLT guest-activity-state. I suggest rephrasing comment.
> > 
> > I deliberately worded the comment so that it remains valid if there are
> > more conditions in the future that cause KVM to skip running L2.  What if
> > I split the difference and make the changelog more explicit, but leave the
> > comment as is?
> 
> I think what is confusing in comment is that it seems to also refer to the 
> case
> where L2 directly enters HLT state without L1 intercept. Which isn’t related.
> So I would explicitly mention it’s when L1 enters L2 but don’t physically 
> enter guest
> with vmcs02 because L2 is in HLT state.

Ah, gotcha, I'll tweak the wording.


Re: [PATCH 1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-27 Thread Liran Alon



> On 27 Sep 2019, at 17:27, Sean Christopherson 
>  wrote:
> 
> On Fri, Sep 27, 2019 at 03:06:02AM +0300, Liran Alon wrote:
>> 
>> 
>>> On 27 Sep 2019, at 0:43, Sean Christopherson 
>>>  wrote:
>>> 
>>> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
>>> isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
>>> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
>>> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
>>> VM-Exit occurs without actually entering L2, e.g. if the nested run
>>> is squashed because L2 is being put into HLT.
>> 
>> I would rephrase to “If an emulated VMEntry is squashed because L1 sets
>> vmcs12->guest_activity_state to HLT”.  I think it’s a bit more explicit.
>> 
>>> 
>>> In an ideal world where EPT *requires* unrestricted guest (and vice
>>> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
>>> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
>>> the unrestricted guest silliness complicates the dirty tracking logic
>>> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
>>> VM-Enter is a simpler overall implementation.
>>> 
>>> Cc: sta...@vger.kernel.org
>>> Reported-by: Reto Buerki 
>>> Signed-off-by: Sean Christopherson 
>>> ---
>>> arch/x86/kvm/vmx/nested.c | 8 
>>> arch/x86/kvm/vmx/vmx.c| 9 ++---
>>> 2 files changed, 14 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>> index 41abc62c9a8a..971a24134081 100644
>>> --- a/arch/x86/kvm/vmx/nested.c
>>> +++ b/arch/x86/kvm/vmx/nested.c
>>> @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
>>> struct vmcs12 *vmcs12,
>>> entry_failure_code))
>>> return -EINVAL;
>>> 
>>> +   /*
>>> +* Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
>>> +* on nested VM-Exit, which can occur without actually running L2, e.g.
>>> +* if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
>>> +*/
>> 
>> If I understand correctly, it’s not exactly if L2 is entering HLT state in
>> general.  (E.g. issue doesn’t occur if L2 runs HLT directly which is not
>> configured to be intercepted by vmcs12).  It’s specifically when L1 enters L2
>> with a HLT guest-activity-state. I suggest rephrasing comment.
> 
> I deliberately worded the comment so that it remains valid if there are
> more conditions in the future that cause KVM to skip running L2.  What if
> I split the difference and make the changelog more explicit, but leave the
> comment as is?

I think what is confusing in comment is that it seems to also refer to the case
where L2 directly enters HLT state without L1 intercept. Which isn’t related.
So I would explicitly mention it’s when L1 enters L2 but don’t physically enter 
guest
with vmcs02 because L2 is in HLT state.

-Liran

> 
>>> +   if (enable_ept)
>>> +   vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
>>> +
>>> /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
>>> if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
>>> is_pae_paging(vcpu)) {
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index d4575ffb3cec..b530950a9c2b 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
>>> cr3)
>>> {
>>> struct kvm *kvm = vcpu->kvm;
>>> unsigned long guest_cr3;
>>> +   bool skip_cr3 = false;
>>> u64 eptp;
>>> 
>>> guest_cr3 = cr3;
>>> @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned 
>>> long cr3)
>>> spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
>>> }
>>> 
>>> -   if (enable_unrestricted_guest || is_paging(vcpu) ||
>>> -   is_guest_mode(vcpu))
>>> +   if (is_guest_mode(vcpu))
>>> +   skip_cr3 = true;
>>> +   else if (enable_unrestricted_guest || is_paging(vcpu))
>>> guest_cr3 = kvm_read_cr3(vcpu);
>>> else
>>> guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
>>> ept_load_pdptrs(vcpu);
>>> }
>>> 
>>> -   vmcs_writel(GUEST_CR3, guest_cr3);
>>> +   if (!skip_cr3)
>> 
>> Nit: It’s a matter of taste, but I prefer positive conditions. i.e. “bool
>> write_guest_cr3”.
>> 
>> Anyway, code seems valid to me. Nice catch.
>> Reviewed-by: Liran Alon 
>> 
>> -Liran
>> 
>>> +   vmcs_writel(GUEST_CR3, guest_cr3);
>>> }
>>> 
>>> int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>> -- 
>>> 2.22.0
>>> 
>> 



Re: [PATCH 1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-27 Thread Sean Christopherson
On Fri, Sep 27, 2019 at 03:06:02AM +0300, Liran Alon wrote:
> 
> 
> > On 27 Sep 2019, at 0:43, Sean Christopherson 
> >  wrote:
> > 
> > Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> > isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
> > is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> > refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> > VM-Exit occurs without actually entering L2, e.g. if the nested run
> > is squashed because L2 is being put into HLT.
> 
> I would rephrase to “If an emulated VMEntry is squashed because L1 sets
> vmcs12->guest_activity_state to HLT”.  I think it’s a bit more explicit.
> 
> > 
> > In an ideal world where EPT *requires* unrestricted guest (and vice
> > versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> > e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
> > the unrestricted guest silliness complicates the dirty tracking logic
> > to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> > VM-Enter is a simpler overall implementation.
> > 
> > Cc: sta...@vger.kernel.org
> > Reported-by: Reto Buerki 
> > Signed-off-by: Sean Christopherson 
> > ---
> > arch/x86/kvm/vmx/nested.c | 8 
> > arch/x86/kvm/vmx/vmx.c| 9 ++---
> > 2 files changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 41abc62c9a8a..971a24134081 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> > struct vmcs12 *vmcs12,
> > entry_failure_code))
> > return -EINVAL;
> > 
> > +   /*
> > +* Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> > +* on nested VM-Exit, which can occur without actually running L2, e.g.
> > +* if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
> > +*/
> 
> If I understand correctly, it’s not exactly if L2 is entering HLT state in
> general.  (E.g. issue doesn’t occur if L2 runs HLT directly which is not
> configured to be intercepted by vmcs12).  It’s specifically when L1 enters L2
> with a HLT guest-activity-state. I suggest rephrasing comment.

I deliberately worded the comment so that it remains valid if there are
more conditions in the future that cause KVM to skip running L2.  What if
I split the difference and make the changelog more explicit, but leave the
comment as is?

> > +   if (enable_ept)
> > +   vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
> > +
> > /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
> > if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
> > is_pae_paging(vcpu)) {
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index d4575ffb3cec..b530950a9c2b 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
> > cr3)
> > {
> > struct kvm *kvm = vcpu->kvm;
> > unsigned long guest_cr3;
> > +   bool skip_cr3 = false;
> > u64 eptp;
> > 
> > guest_cr3 = cr3;
> > @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned 
> > long cr3)
> > spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
> > }
> > 
> > -   if (enable_unrestricted_guest || is_paging(vcpu) ||
> > -   is_guest_mode(vcpu))
> > +   if (is_guest_mode(vcpu))
> > +   skip_cr3 = true;
> > +   else if (enable_unrestricted_guest || is_paging(vcpu))
> > guest_cr3 = kvm_read_cr3(vcpu);
> > else
> > guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
> > ept_load_pdptrs(vcpu);
> > }
> > 
> > -   vmcs_writel(GUEST_CR3, guest_cr3);
> > +   if (!skip_cr3)
> 
> Nit: It’s a matter of taste, but I prefer positive conditions. i.e. “bool
> write_guest_cr3”.
> 
> Anyway, code seems valid to me. Nice catch.
> Reviewed-by: Liran Alon 
> 
> -Liran
> 
> > +   vmcs_writel(GUEST_CR3, guest_cr3);
> > }
> > 
> > int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> > -- 
> > 2.22.0
> > 
> 


Re: [PATCH 1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-27 Thread Sean Christopherson
On Thu, Sep 26, 2019 at 04:39:28PM -0700, Jim Mattson wrote:
> On Thu, Sep 26, 2019 at 2:43 PM Sean Christopherson
>  wrote:
> > -   vmcs_writel(GUEST_CR3, guest_cr3);
> > +   if (!skip_cr3)
> > +   vmcs_writel(GUEST_CR3, guest_cr3);
> 
> Is this part of the change necessary, or is it just an optimization to
> save a redundant VMWRITE?

Save the redundant VMWRITE.  I also wanted to convey the idea that the
nested code is responsible for setting GUEST_CR3 to the correct value.
I'd be happy to add a comment to make the latter point explicit.

> >  }
> >
> >  int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> > --
> > 2.22.0
> >


Re: [PATCH 1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-26 Thread Liran Alon



> On 27 Sep 2019, at 0:43, Sean Christopherson 
>  wrote:
> 
> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> VM-Exit occurs without actually entering L2, e.g. if the nested run
> is squashed because L2 is being put into HLT.

I would rephrase to “If an emulated VMEntry is squashed because L1 sets 
vmcs12->guest_activity_state to HLT”.
I think it’s a bit more explicit.

> 
> In an ideal world where EPT *requires* unrestricted guest (and vice
> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
> the unrestricted guest silliness complicates the dirty tracking logic
> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> VM-Enter is a simpler overall implementation.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Reto Buerki 
> Signed-off-by: Sean Christopherson 
> ---
> arch/x86/kvm/vmx/nested.c | 8 
> arch/x86/kvm/vmx/vmx.c| 9 ++---
> 2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 41abc62c9a8a..971a24134081 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
>   entry_failure_code))
>   return -EINVAL;
> 
> + /*
> +  * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> +  * on nested VM-Exit, which can occur without actually running L2, e.g.
> +  * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
> +  */

If I understand correctly, it’s not exactly if L2 is entering HLT state in 
general.
(E.g. issue doesn’t occur if L2 runs HLT directly which is not configured to be 
intercepted by vmcs12).
It’s specifically when L1 enters L2 with a HLT guest-activity-state. I suggest 
rephrasing comment.

> + if (enable_ept)
> + vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
> +
>   /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
>   if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
>   is_pae_paging(vcpu)) {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d4575ffb3cec..b530950a9c2b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
> cr3)
> {
>   struct kvm *kvm = vcpu->kvm;
>   unsigned long guest_cr3;
> + bool skip_cr3 = false;
>   u64 eptp;
> 
>   guest_cr3 = cr3;
> @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
> cr3)
>   spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
>   }
> 
> - if (enable_unrestricted_guest || is_paging(vcpu) ||
> - is_guest_mode(vcpu))
> + if (is_guest_mode(vcpu))
> + skip_cr3 = true;
> + else if (enable_unrestricted_guest || is_paging(vcpu))
>   guest_cr3 = kvm_read_cr3(vcpu);
>   else
>   guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
>   ept_load_pdptrs(vcpu);
>   }
> 
> - vmcs_writel(GUEST_CR3, guest_cr3);
> + if (!skip_cr3)

Nit: It’s a matter of taste, but I prefer positive conditions. i.e. “bool 
write_guest_cr3”.

Anyway, code seems valid to me. Nice catch.
Reviewed-by: Liran Alon 

-Liran

> + vmcs_writel(GUEST_CR3, guest_cr3);
> }
> 
> int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> -- 
> 2.22.0
> 



Re: [PATCH 1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

2019-09-26 Thread Jim Mattson
On Thu, Sep 26, 2019 at 2:43 PM Sean Christopherson
 wrote:
>
> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> VM-Exit occurs without actually entering L2, e.g. if the nested run
> is squashed because L2 is being put into HLT.
>
> In an ideal world where EPT *requires* unrestricted guest (and vice
> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
> the unrestricted guest silliness complicates the dirty tracking logic
> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> VM-Enter is a simpler overall implementation.
>
> Cc: sta...@vger.kernel.org
> Reported-by: Reto Buerki 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/vmx/nested.c | 8 
>  arch/x86/kvm/vmx/vmx.c| 9 ++---
>  2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 41abc62c9a8a..971a24134081 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
> entry_failure_code))
> return -EINVAL;
>
> +   /*
> +* Immediately write vmcs02.GUEST_CR3.  It will be propagated to 
> vmcs12
> +* on nested VM-Exit, which can occur without actually running L2, 
> e.g.
> +* if L2 is entering HLT state, and thus without hitting 
> vmx_set_cr3().
> +*/
> +   if (enable_ept)
> +   vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
> +

This part of the change I can definitely confirm, since we have a
similar change in our fetid pile of commits that have yet to be
upstreamed. Thanks for taking care of this one!

> /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
> if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
> is_pae_paging(vcpu)) {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d4575ffb3cec..b530950a9c2b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
> cr3)
>  {
> struct kvm *kvm = vcpu->kvm;
> unsigned long guest_cr3;
> +   bool skip_cr3 = false;
> u64 eptp;
>
> guest_cr3 = cr3;
> @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long 
> cr3)
> spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
> }
>
> -   if (enable_unrestricted_guest || is_paging(vcpu) ||
> -   is_guest_mode(vcpu))
> +   if (is_guest_mode(vcpu))
> +   skip_cr3 = true;
> +   else if (enable_unrestricted_guest || is_paging(vcpu))
> guest_cr3 = kvm_read_cr3(vcpu);
> else
> guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
> ept_load_pdptrs(vcpu);
> }
>
> -   vmcs_writel(GUEST_CR3, guest_cr3);
> +   if (!skip_cr3)
> +   vmcs_writel(GUEST_CR3, guest_cr3);

Is this part of the change necessary, or is it just an optimization to
save a redundant VMWRITE?

>  }
>
>  int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> --
> 2.22.0
>