RE: [PATCH] KVM: x86: always set accessed bit in shadow PTEs

2015-11-16 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-11-13:
> Commit 7a1638ce4220 ("nEPT: Redefine EPT-specific link_shadow_page()",
> 2013-08-05) says:
> 
> Since nEPT doesn't support A/D bit, we should not set those bit
> when building the shadow page table.
> but this is not necessary.  Even though nEPT doesn't support A/D
> bits, and hence the vmcs12 EPT pointer will never enable them,
> we always use them for shadow page tables if available (see
> construct_eptp in vmx.c).  So we can set the A/D bits freely
> in the shadow page table.
> 
> This patch hence basically reverts commit 7a1638ce4220.
> 
> Cc: Yang Zhang 
> Cc: Takuya Yoshikawa 
> Signed-off-by: Paolo Bonzini 

Looks good to me.

Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] kvm/vmx: EPTP switching test

2015-11-16 Thread Zhang, Yang Z
Michael S. Tsirkin wrote on 2015-11-16:
> This patch adds a new parameter: eptp_switching_test, which enables
> testing EPT switching on VMX if supported by hardware.  All EPT
> entries are initialized to the same value so this adds no useful
> functionality by itself, but can be used to test VMFUNC performance,
> and serve as a basis for future features based on EPTP switching.
> 
> Support for nested virt is not enabled.
> 
> This was tested using the following code within guest:
>   #define VMX_VMFUNC ".byte 0x0f,0x01,0xd4"
>   static void vmfunc(unsigned int nr, unsigned int ept)
>   {
>   asm volatile(VMX_VMFUNC
>:
>: "a"(nr), "c"(ept)
>: "memory");
>   }
> 
> VMFUNC instruction cost was measured at ~122 cycles.
> (Note: recent versions of gnu toolchain support  the vmfunc
> instruction - removing the need for writing  the bytecode manually).
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> I think I'd like to put this upstream so future eptp switching work
> can be implemented on top. Comments?

We have a different version in hand which is using separate EPTP. As you known, 
the patch will be more complex if using separate EPTP. And there are still lots 
of thing need to do in our version. We will send out for comments soon.

Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

RE: [PATCH v3] KVM: x86: INIT and reset sequences are different

2015-10-09 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-10-01:

Hi Paolo

Sorry for the late reply. I am just back from vacation.

> 
> 
> On 13/04/2015 13:34, Nadav Amit wrote:
>> x86 architecture defines differences between the reset and INIT
>> sequences. INIT does not initialize the FPU (including MMX, XMM, YMM,
>> etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC
>> arbitration ID and BSP.
>> 
>> References (from Intel SDM):
>> 
>> "If the MP protocol has completed and a BSP is chosen, subsequent INITs
>> (either to a specific processor or system wide) do not cause the MP
>> protocol to be repeated." [8.4.2: MP Initialization Protocol
>> Requirements and Restrictions]
>> 
>> [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
>> 
>> "If the processor is reset by asserting the INIT# pin, the x87 FPU state is 
>> not
>> changed." [9.2: X87 FPU INITIALIZATION]
>> 
>> "The state of the local APIC following an INIT reset is the same as it is 
>> after
>> a power-up or hardware reset, except that the APIC ID and arbitration ID
>> registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
>> (“Wait-for-SIPI” State)]
>> 
>> Signed-off-by: Nadav Amit 
>> 
>> ---
>> 
>> v3:
>> 
>> - Leave EFER unchanged on INIT. Instead, set cr0 correctly so vmx_set_cr0
> would
>>   recognize that paging was changed from on to off and clear LMA.
> 
> I wonder if this change from v2 to v3 was correct.
> 
> It means that a 32-bit firmware cannot enter paging mode without
> clearing EFER.LME first (which it should not know about).
> 
> Yang, can you check what real hardware does to EFER on an INIT?  Perhaps
> it only clears EFER.LME (in addition of course to EFER.LMA, which is
> cleared as a side effect of writing CR0).

Sure, I will check it with our hardware expert.

> 
> Thanks,
> 
> Paolo


Best regards,
Yang



RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-13 Thread Zhang, Yang Z
Zhang, Yang Z wrote on 2015-08-04:
 Paolo Bonzini wrote on 2015-08-04:
 
 
 On 04/08/2015 02:46, Zhang, Yang Z wrote:
 It is a problem for split irqchip, where the EOI exit bitmap can
 be inferred from the IOAPIC routes but the TMR cannot.  The
 hardware behavior on the other hand can be implemented purely within the 
 LAPIC.
 
 So updating the TMR within LAPIC is the only solution to handle it?
 
 It's the simplest and the one that makes most sense.  Considering
 that TMR is a pretty obscure feature, it's unlikely that it will be
 accelerated in the future.
 
 You may be right. It is safe if no future hardware plans to use it.
 Let me check with our hardware team to see whether it will be used or not in 
 future.

After checking with Jun, there is no guarantee that the guest running on 
another CPU will operate properly if hypervisor modify the vTMR from another 
CPU. So the hypervisor should not to do it.

 
 
 Paolo
 
 
 Best regards,
 Yang



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-04 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-08-04:
 
 
 On 04/08/2015 02:46, Zhang, Yang Z wrote:
 It is a problem for split irqchip, where the EOI exit bitmap can be
 inferred from the IOAPIC routes but the TMR cannot.  The hardware
 behavior on the other hand can be implemented purely within the LAPIC.
 
 So updating the TMR within LAPIC is the only solution to handle it?
 
 It's the simplest and the one that makes most sense.  Considering that
 TMR is a pretty obscure feature, it's unlikely that it will be
 accelerated in the future.

You may be right. It is safe if no future hardware plans to use it. Let me 
check with our hardware team to see whether it will be used or not in future. 

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-03 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-08-03:
 
 
 On 03/08/2015 12:23, Zhang, Yang Z wrote:
 In any case, the TMR behavior introduced by the APICv patches is
 completely different from the hardware behavior, so it has to be fixed.
 
 But any real problem with it?
 
 It is a problem for split irqchip, where the EOI exit bitmap can be
 inferred from the IOAPIC routes but the TMR cannot.  The hardware
 behavior on the other hand can be implemented purely within the LAPIC.

So updating the TMR within LAPIC is the only solution to handle it?

 
  The alternative is to inject level-triggered interrupts
 synchronously, without using posted interrupts.
 
 I'll write some testcases to understand the functioning of TMR in
 the virtual-APIC page, but the manual seems clear to me.
 
 Currently, no existing hardware will use TMR and will not cause any
 problem.(That's the reason why we leave it in Xen).But we don't know
 whether future hardware will use it or not(SDM always keeps changing
 :)).
 
 But that would be covered by a different execution control (for
 backwards compatibility).  We'll get there when such a feature is introduced.

Yes, we can leave it in future. But one concern is that it may hard to handle 
it at that time if someone also develops feature which rely on it (like current 
patch to split irqchip). 

 
 And per 24.11.4's description, the perfect solution is don't modify
 it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS
 accesses are issued in right VMCS context.
 
 Yes, that's correct.  It's just the TMR.
 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-03 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-08-03:
 
 
 On 03/08/2015 04:37, Zhang, Yang Z wrote:
 Only virtualized APIC register reads use the virtual TMR
 registers (SDM
 29.4.2 or 29.5), but these just read data from the corresponding
 field in the virtual APIC page.
 
 24.11.4 Software Access to Related Structures In addition to data in
 the VMCS region itself, VMX non-root operation can be controlled by
 data structures that are referenced by pointers in a VMCS (for
 example, the I/O bitmaps). While the pointers to these data
 structures are parts of the VMCS, the data structures themselves are
 not. They are not accessible using VMREAD and VMWRITE but by
 ordinary memory writes.
 
 I don't think the TMR fields of the virtual-APIC page are among the
 data structures that _controls_ VMX non-root operations, because:
 
 * it is not part of the virtualized APIC state is listed in 29.1.1
 
 * read accesses from the APIC-access page simply return data from the
 corresponding page offset on the virtual-APIC page using the memory
 access type stored in IA32_VMX_BASIC_MSR.  I think this explicitly
 says that the effects of 24.11.1 (especially non-deterministic
 behavior after a write) do not apply here.
 
 In any case, the TMR behavior introduced by the APICv patches is
 completely different from the hardware behavior, so it has to be fixed.

But any real problem with it?

  The alternative is to inject level-triggered interrupts
 synchronously, without using posted interrupts.
 
 I'll write some testcases to understand the functioning of TMR in the
 virtual-APIC page, but the manual seems clear to me.

Currently, no existing hardware will use TMR and will not cause any 
problem.(That's the reason why we leave it in Xen).But we don't know whether 
future hardware will use it or not(SDM always keeps changing :)).And per 
24.11.4's description, the perfect solution is don't modify it. 
btw, IIRC, only TMR doesn't follow the rule. All other VMCS accesses are issued 
in right VMCS context.

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-02 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-07-31:
 
 
 On 31/07/2015 01:26, Zhang, Yang Z wrote:
 Do not compute TMR in advance.  Instead, set the TMR just before
 the interrupt is accepted into the IRR.  This limits the coupling
 between IOAPIC and LAPIC.
 
 Uh.., it back to original way which is wrong. You cannot modify the
 apic page(here is the TMR reg) directly when the corresponding VMCS
 may be used at same time.
 
 Where is this documented?  The TMR is not part of the set of virtualized
 APIC registers (TPR, PPR, EOI, ISR, IRR, ICR+ICR2; SDM 29.1.1).
 
 Only virtualized APIC register reads use the virtual TMR registers (SDM
 29.4.2 or 29.5), but these just read data from the corresponding field
 in the virtual APIC page.

It's not only for virtual apic page. All data structures that is used by 
VMCS(either directly inside in VMCS or referenced by pointer in a VMCS) need 
follow this rule:

24.11.4 Software Access to Related Structures
In addition to data in the VMCS region itself, VMX non-root operation can be 
controlled by data structures that are
referenced by pointers in a VMCS (for example, the I/O bitmaps). While the 
pointers to these data structures are
parts of the VMCS, the data structures themselves are not. They are not 
accessible using VMREAD and VMWRITE
but by ordinary memory writes.
Software should ensure that each such data structure is modified only when no 
logical processor with a current
VMCS that references it is in VMX non-root operation. Doing otherwise may lead 
to unpredictable behavior
(including behaviors identified in Section 24.11.1).

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-02 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-07-31:
 
 
 On 31/07/2015 04:49, Steve Rutherford wrote:
 Oh... Yeah. That's a damn good point, given that the interrupt can be
 injected from another thread while one is in that guest vcpu.
 
 Easiest time to update the TMR should be on guest entry through
 vcpu_scan_ioapic, as before.
 
 The best way to go is probably to ditch the new per vcpu EOI exit
 bitmap, and just update/use the TMR. There's no reason to duplicate
 that data in the representation of the apic (I suspect that the
 duplication was inspired by my mistaken notion of the TMR). The
 IOAPIC exit check can use the TMR instead.
 
 Based upon my reading of the SDM, the only reason that the eoi exit
 bitmaps are not the exact same as the TMR is that it is possible to
 have virtual-interrupt delivery enabled /without/ an apic access page
 (Note: V-ID = EOI exit bitmap enabled).
 
 Yang, do you happen to know if that is the case?
 
 [Note: Just looked back at the old code for updating the EOI exit
 bitmaps, which for some reason was configured to trigger EOI exits
 for all IOAPIC interrupts, not just level-triggered IOAPIC
 interrupts. Which is weird, and I believe totally unecessary.]
 
 The RTC interrupt needs to trigger an EOI exit with the in-kernel
 IOAPIC, in order to detect coalesced interrupts.  This is necessary to
 avoid severe time drift in Windows guest.

Correct! EOI exit bitmap is not absolutely same with TMR. Some interrupts 
(especially timer interrupt) which is edge trigger but need a notification when 
it got injected into guest. 

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-07-30 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-07-29:
 Do not compute TMR in advance.  Instead, set the TMR just before the
 interrupt is accepted into the IRR.  This limits the coupling between
 IOAPIC and LAPIC.
 

Uh.., it back to original way which is wrong. You cannot modify the apic 
page(here is the TMR reg) directly when the corresponding VMCS may be used at 
same time.


 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/ioapic.c |  9 ++---
  arch/x86/kvm/ioapic.h |  3 +--
  arch/x86/kvm/lapic.c  | 19 ++-
  arch/x86/kvm/lapic.h  |  1 -
  arch/x86/kvm/x86.c|  5 +
  5 files changed, 14 insertions(+), 23 deletions(-)
 diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
 index 856f79105bb5..eaf4ec38d980 100644
 --- a/arch/x86/kvm/ioapic.c
 +++ b/arch/x86/kvm/ioapic.c
 @@ -246,8 +246,7 @@ static void update_handled_vectors(struct kvm_ioapic
 *ioapic)
   smp_wmb();
  }
 -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
 - u32 *tmr)
 +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
  {
   struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic;
   union kvm_ioapic_redirect_entry *e;
 @@ -260,13 +259,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
 u64 *eoi_exit_bitmap,
   kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, 
 index) ||
   index == RTC_GSI) { if 
 (kvm_apic_match_dest(vcpu, NULL, 0,
 - e-fields.dest_id, e-fields.dest_mode)) {
 + e-fields.dest_id, e-fields.dest_mode))
   __set_bit(e-fields.vector,
   (unsigned long *)eoi_exit_bitmap);
 - if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG)
 - __set_bit(e-fields.vector, -   
 (unsigned long *)tmr); -  
   }
   }
   }
   spin_unlock(ioapic-lock);
 diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
 index ca0b0b4e6256..3dbd0e2aac4e 100644
 --- a/arch/x86/kvm/ioapic.h
 +++ b/arch/x86/kvm/ioapic.h
 @@ -120,7 +120,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct
 kvm_lapic *src,
   struct kvm_lapic_irq *irq, unsigned long *dest_map);
  int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
  int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
 -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
 - u32 *tmr);
 +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
 
  #endif
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 2a5ca97c263b..9be64c77d6db 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -551,15 +551,6 @@ static void pv_eoi_clr_pending(struct kvm_vcpu
 *vcpu)
   __clear_bit(KVM_APIC_PV_EOI_PENDING, vcpu-arch.apic_attention);
  }
 -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr)
 -{
 - struct kvm_lapic *apic = vcpu-arch.apic;
 - int i;
 -
 - for (i = 0; i  8; i++)
 - apic_set_reg(apic, APIC_TMR + 0x10 * i, tmr[i]);
 -}
 -
  static void apic_update_ppr(struct kvm_lapic *apic)
  {
   u32 tpr, isrv, ppr, old_ppr;
 @@ -781,6 +772,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
 delivery_mode,
   case APIC_DM_LOWEST:
   vcpu-arch.apic_arb_prio++;
   case APIC_DM_FIXED:
 + if (unlikely(trig_mode  !level))
 + break;
 +
   /* FIXME add logic for vcpu on reset */
   if (unlikely(!apic_enabled(apic)))
   break;
 @@ -790,6 +784,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic,
 int delivery_mode,
   if (dest_map)
   __set_bit(vcpu-vcpu_id, dest_map);
 + if (apic_test_vector(vector, apic-regs + APIC_TMR) != 
 !!trig_mode) {
 + if (trig_mode)
 + apic_set_vector(vector, apic-regs + APIC_TMR);
 + else
 + apic_clear_vector(vector, apic-regs + 
 APIC_TMR);
 + }
 +
   if (kvm_x86_ops-deliver_posted_interrupt)
   kvm_x86_ops-deliver_posted_interrupt(vcpu, vector);
   else {
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 764037991d26..eb46d6bcaa75 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -57,7 +57,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64
 value);
  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr);
  void __kvm_apic_update_irr(u32 *pir, void *regs);
  void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
  int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 

RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-04-24:
 
 
 On 24/04/2015 09:46, Zhang, Yang Z wrote:
 On the other hand vmexit is lighter and lighter on newer
 processors; a Sandy Bridge has less than half the vmexit cost of a
 Core 2 (IIRC
 1000 vs. 2500 clock cycles approximately).
 
 1000 cycles? I remember it takes about 4000 cycle even in HSW server.
 
 I was going from memory, but I now measured it with the vmexit test of
 kvm-unit-tests.  With both SNB Xeon E5 and IVB Core i7, returns about
 1400 clock cycles for a vmcall exit.  This includes the overhead of
 doing the cpuid itself.
 
 Thus the vmexit cost is around 1300 cycles.  Of this the vmresume
 instruction is probably around 800 cycles, and the rest is introduced
 by KVM.  There are at least 4-5 memory barriers and locked instructions.

Yes, that's make sense. The average vmexit/vmentry handle cost is around 4000 
cycles. But I guess xsaveopt doesn't take so many cycles. Does anyone have the 
xsaveopt cost data?

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-04-24:
 
 
 On 24/04/2015 03:16, Zhang, Yang Z wrote:
 This is interesting since previous measurements on KVM have had the
 exact opposite results.  I think we need to understand this a lot
 more.
 
 What I can tell is that vmexit is heavy. So it is reasonable to see
 the improvement under some cases, especially kernel is using eager
 FPU now which means each schedule may trigger a vmexit.
 
 On the other hand vmexit is lighter and lighter on newer processors; a
 Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000
 vs. 2500 clock cycles approximately).
 

1000 cycles? I remember it takes about 4000 cycle even in HSW server.

 Also, measurement were done on Westmere but Sandy Bridge is the first
 processor to have XSAVEOPT and thus use eager FPU.
 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-23 Thread Zhang, Yang Z
H. Peter Anvin wrote on 2015-04-24:
 On 04/23/2015 08:28 AM, Dave Hansen wrote:
 On 04/23/2015 02:13 PM, Liang Li wrote:
 When compiling kernel on westmere, the performance of eager FPU is
 about 0.4% faster than lazy FPU.
 
 Do you have an theory why this is?  What does the regression come from?
 
 
 This is interesting since previous measurements on KVM have had the
 exact opposite results.  I think we need to understand this a lot more.

What I can tell is that vmexit is heavy. So it is reasonable to see the 
improvement under some cases, especially kernel is using eager FPU now which 
means each schedule may trigger a vmexit.

 
   -hpa



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 6/6] KVM: nVMX: Enable nested posted interrupt processing

2015-02-02 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-02-03:
 
 
 On 02/02/2015 16:33, Wincy Van wrote:
 static void vmx_accomp_nested_posted_intr(struct kvm_vcpu *vcpu) {
 struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 if (is_guest_mode(vcpu) 
 vmx-nested.posted_intr_nv != -1 
 pi_test_on(vmx-nested.pi_desc))
 kvm_apic_set_irr(vcpu,
 vmx-nested.posted_intr_nv); }
 Then we will get an nested-vmexit in vmx_check_nested_events, that
 posted intr will be handled by L1 immediately.
 This mechanism will also emulate the hardware's behavior: If a
 posted intr was not accomplished by hardware, we will get an

Actually, we cannot say not accomplished by hardware. It more like we don't 
do the job well. See my below answer.

 interrupt with POSTED_INTR_NV.
 
 Yes.

This is not enough. From L1's point, L2 is in vmx non-root mode. So we should 
emulate the posted interrupt in L0 correctly, say:
1. clear ON bit
2. ack interrupt
3, syn pir to virr
4. update RVI.
Then let the hardware(virtual interrupt delivery) to accomplish interrupt 
injection.

Force a vmexit more like a trick. It's better to follow the hardware's behavior 
unless we cannot do it. 

 
 Would this be better?
 
 I think you do not even need a new bit.  You can use KVM_REQ_EVENT and
 (to complete my suggestion, which was not enough) do the above in
 vmx_check_nested_events.
 
 Paolo


Best regards,
Yang



RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 4:00 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void)
 (unsigned long
 *)__get_free_page(GFP_KERNEL);
 if (!vmx_msr_bitmap_longmode_x2apic)
 goto out4;
 +
 +   vmx_msr_bitmap_nested = (unsigned long
 *)__get_free_page(GFP_KERNEL);
 +   if (!vmx_msr_bitmap_nested)
 +   goto out5;
 +
 
 Since the nested virtualization is off by default. It's better to
 allocate the page only when nested is true. Maybe adding the following
 check is better:
 
 if (nested) {
 vmx_msr_bitmap_nested = (unsigned long
 *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested)
 goto out5;
 }
 
 Agreed. Will do.
 
 
 ...snip...
 
 +
 +/*
 + * Merge L0's and L1's MSR bitmap, return false to indicate that
 + * we do not use the hardware.
 + */
 +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 +  struct vmcs12
 *vmcs12) {
 +   return false;
 +}
 +
 
 The following patches have nothing to do with the MSR control. Why
 leave the function empty here?
 
 
 No. In patch Enable nested virtualize x2apic mode, we will return
 false if L1 disabled virt_x2apic_mode, then the hardware MSR-bitmap control 
 is disabled.
 This is faster than rebuilding the vmx_msr_bitmap_nested.
 This function returns false here to indicate that we do not use the hardware.
 Since It is not only virtualize x2apic mode using this, other features
 may use this either. I think it isn't suitable to introduce this function in 
 other patches.

Yes, rebuilding is more costly. But your current implementation cannot leverage 
the APICv feature correctly. I replied in another thread.

 
 
 Best regards,
 Yang
 



Best regards,
Yang




RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Zhang, Yang Z wrote on 2015-01-28:
 Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z
 yang.z.zh...@intel.com
 wrote:
 
 
 If L0 wants to intercept a msr, we should set
 vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
 and that bitmaps should only be loaded in non-nested entry. Currently
 we only clear the corresponding bits if L1 enables virtualize x2apic
 mode, all the other bits are all set. On nested entry, we load
 nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In
 the nested entry, L0 only care about L2's msr access, not L1's. I
 think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap
 as your mentioned above.
 
 Mmm... I think if the bit is setting in vmcs01-msr_bitmap means
 whenever
 the VCPU(no matter in nested guest mode or not) accesses this msr
 should cause vmexit, not only L1. That's why need to construct the
 vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed
 vmcs12-msr_bitmap).
 
 
 You are right, but this is not fit for all the cases, we should
 custom the nested_msr_bitmap.
 e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
  if (enable_apicv) {
 for (msr = 0x800; msr = 0x8ff; msr++)
 vmx_disable_intercept_msr_read_x2apic(msr);
 /* According SDM, in x2apic mode, the whole id reg
 is
 used.
  * But in KVM, it only use the highest eight bits. Need to
  * intercept it */
 vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT
 */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR
 */ vmx_disable_intercept_msr_write_x2apic(0x808); /*
 EOI
 */
 vmx_disable_intercept_msr_write_x2apic(0x80b); /*
 SELF-IPI */
 vmx_disable_intercept_msr_write_x2apic(0x83f);
 }
 But L1 may not want this. So I think we are better to deal with the
 
 Actually, from L0's point, it is totally unaware of the L2. The only
 thing L0 aware is that the CPU should follow L0's configuration when
 VCPU is running. So if L0 wants to trap a msr, then the read operation
 to this msr should cause vmexit unconditionally no matter who is running(who 
 means L1, L2, L3.).
 
 msrs seperately, there is not a common way suit for all the cases.
 If other features want to intercept a msr in nested entry, they can
 put the custom code in nested_vmx_merge_msr_bitmap.
 
 Yes, if other features want to do it in 'nested' entry, they can fill
 nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be
 our responsibly to handle it correctly, how about add following check:
 
 if (type  MSR_TYPE_R  !test_bit(msr, vmcs01_msr_bitmap) 
 !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
 __clear_bit(msr, msr_bitmap_nested + 0x000 / f);


Anyway, this is not necessary for your current patch. We can consider it later 
if there really have other features will use it.

 
 
 
 Thanks,
 Wincy
 
 
 Best regards,
 Yang



Best regards,
Yang




RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-24:
 Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all of L2's
 msr access is intercepted by L0. Since many features like virtualize x2apic 
 mode
 has a complicated logic and it is difficult for us to emulate, we should use
 hardware and merge the bitmap.
 
 This patch introduces nested_vmx_merge_msr_bitmap for future use.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
  arch/x86/kvm/vmx.c |   71
 +++
  1 files changed, 60 insertions(+), 11 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c987374..36d0724
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -798,6 +798,7 @@ static unsigned long *vmx_msr_bitmap_legacy;
 static unsigned long *vmx_msr_bitmap_longmode;  static unsigned long
 *vmx_msr_bitmap_legacy_x2apic;  static unsigned long
 *vmx_msr_bitmap_longmode_x2apic;
 +static unsigned long *vmx_msr_bitmap_nested;
  static unsigned long *vmx_vmread_bitmap;  static unsigned long
 *vmx_vmwrite_bitmap;
 
 @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void)
 (unsigned long
 *)__get_free_page(GFP_KERNEL);
 if (!vmx_msr_bitmap_longmode_x2apic)
 goto out4;
 +
 +   vmx_msr_bitmap_nested = (unsigned long
 *)__get_free_page(GFP_KERNEL);
 +   if (!vmx_msr_bitmap_nested)
 +   goto out5;
 +

Since the nested virtualization is off by default. It's better to allocate the 
page
only when nested is true. Maybe adding the following check is better:

if (nested) {
vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_msr_bitmap_nested)
goto out5;
}

...snip...

 +
 +/*
 + * Merge L0's and L1's MSR bitmap, return false to indicate that
 + * we do not use the hardware.
 + */
 +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 +  struct vmcs12
 *vmcs12) {
 +   return false;
 +}
 +

The following patches have nothing to do with the MSR control. Why leave the 
function empty here?

Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Zhang, Yang Z wrote on 2015-01-28:
 Wincy Van wrote on 2015-01-24:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain
 higher performance, especially in apicv case.
 
 This patch also introduces nested_vmx_check_apicv_controls for the
 nested apicv patches.

Sorry, replied on the wrong thread.:(

 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
 
 ...snip...
 
  static void vmx_disable_intercept_for_msr(u32 msr, bool
 longmode_only)
 {
 if (!longmode_only)
 @@ -8344,7 +8394,68 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  {
 -   return false;
 +   struct page *page;
 +   unsigned long *msr_bitmap;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return false;
 +
 +   page = nested_get_page(vcpu, vmcs12-msr_bitmap);
 +   if (!page) {
 +   WARN_ON(1);
 +   return false;
 +   }
 +   msr_bitmap = (unsigned long *)kmap(page);
 +   if (!msr_bitmap) {
 +   nested_release_page_clean(page);
 +   WARN_ON(1);
 +   return false;
 +   }
 +
 +   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);
 +
 +   if (nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   /* TPR is allowed */
 +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
 +   vmx_msr_bitmap_nested,
 +   APIC_BASE_MSR + (APIC_TASKPRI 
 4),
 +   MSR_TYPE_R | MSR_TYPE_W);
 
 I didn't understand what this function does? Per my understanding, you
 only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap |
 vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in
 vmcs12-msr_bitmap is setting. Am I missing some patches?
 
 Best regards,
 Yang



Best regards,
Yang




RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 @@ -8344,7 +8394,68 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  { -   return false; +   struct page *page; +  
 unsigned long *msr_bitmap; + +   if
 (!nested_cpu_has_virt_x2apic_mode(vmcs12)) +   return
 false; + +   page = nested_get_page(vcpu, vmcs12-msr_bitmap); +  
 if (!page) { +   WARN_ON(1); +   return
 false; +   } +   msr_bitmap = (unsigned long *)kmap(page); +  
 if (!msr_bitmap) { +  
 nested_release_page_clean(page); +   WARN_ON(1); +
   return false; +   } + +   memset(vmx_msr_bitmap_nested,
 0xff, PAGE_SIZE); + +   if
 (nested_cpu_has_virt_x2apic_mode(vmcs12)) +   /* TPR is
 allowed */ +  
 nested_vmx_disable_intercept_for_msr(msr_bitmap, +
   vmx_msr_bitmap_nested, +  
 APIC_BASE_MSR + (APIC_TASKPRI  4), +  
 MSR_TYPE_R | MSR_TYPE_W);
 
 I didn't understand what this function does? Per my understanding,
 you only
 need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap |
 vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in
 vmcs12-msr_bitmap is setting. Am I missing some patches?
 
 In the beginning, I want to do vmcs01-msr_bitmap |
 vmcs12-msr_bitmap, but I remember that there isn't a instruction to
 do a bit or operation in two pages effectively, so I do the bit or
 operation in nested_vmx_disable_intercept_for_msr. If the hardware do
 not support this, I think it is faster if we deal with the bits on demand.
 nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps,
 any features can put their logic here.

You construct the nested_msr_bitmap based on vmcs12-msr_bitmap, what happens 
if vmcs01-msr_bitmap want to trap this msr?

 
 If there is a faster way, please teach me how to do it : )

You are right. Interception should be much faster.

 
 Thanks,
 Wincy
 
 
 
 Best regards,
 Yang
 



Best regards,
Yang




RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 
 
 If L0 wants to intercept a msr, we should set
 vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
 and that bitmaps should only be loaded in non-nested entry. Currently
 we only clear the corresponding bits if L1 enables virtualize x2apic
 mode, all the other bits are all set. On nested entry, we load
 nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In
 the nested entry, L0 only care about L2's msr access, not L1's. I
 think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap
 as your mentioned above.
 
 Mmm... I think if the bit is setting in vmcs01-msr_bitmap means
 whenever
 the VCPU(no matter in nested guest mode or not) accesses this msr
 should cause vmexit, not only L1. That's why need to construct the
 vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed
 vmcs12-msr_bitmap).
 
 
 You are right, but this is not fit for all the cases, we should custom
 the nested_msr_bitmap.
 e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
  if (enable_apicv) {
 for (msr = 0x800; msr = 0x8ff; msr++)
 vmx_disable_intercept_msr_read_x2apic(msr);
 /* According SDM, in x2apic mode, the whole id reg is used.
  * But in KVM, it only use the highest eight bits. Need to
  * intercept it */
 vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */
 vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */
 vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */
 vmx_disable_intercept_msr_write_x2apic(0x80b); /*
 SELF-IPI */
 vmx_disable_intercept_msr_write_x2apic(0x83f);
 }
 But L1 may not want this. So I think we are better to deal with the

Actually, from L0's point, it is totally unaware of the L2. The only thing L0 
aware is that the CPU should follow L0's configuration when VCPU is running. So 
if L0 wants to trap a msr, then the read operation to this msr should cause 
vmexit unconditionally no matter who is running(who means L1, L2, L3.).

 msrs seperately, there is not a common way suit for all the cases. If
 other features want to intercept a msr in nested entry, they can put
 the custom code in nested_vmx_merge_msr_bitmap.

Yes, if other features want to do it in 'nested' entry, they can fill 
nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our 
responsibly to handle it correctly, how about add following check:

if (type  MSR_TYPE_R  !test_bit(msr, vmcs01_msr_bitmap)  !test_bit(msr, 
msr_bitmap_l1 + 0x000 / f))
__clear_bit(msr, msr_bitmap_nested + 0x000 / f);


 
 
 Thanks,
 Wincy


Best regards,
Yang




RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 8:33 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 
 You are right, but this is not fit for all the cases, we should
 custom the nested_msr_bitmap.
 e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
  if (enable_apicv) {
 for (msr = 0x800; msr = 0x8ff; msr++)
 vmx_disable_intercept_msr_read_x2apic(msr); /*
 According SDM, in x2apic mode, the whole id reg
 is
 used.
  * But in KVM, it only use the highest eight bits.
 Need
 to
  * intercept it */
 vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT
 */ vmx_enable_intercept_msr_read_x2apic(0x839); /*
 TPR */ vmx_disable_intercept_msr_write_x2apic(0x808);
 /*
 EOI
 */
 vmx_disable_intercept_msr_write_x2apic(0x80b); /*
 SELF-IPI */
 vmx_disable_intercept_msr_write_x2apic(0x83f);
 }
 But L1 may not want this. So I think we are better to deal with
 the
 
 Actually, from L0's point, it is totally unaware of the L2. The only
 thing L0 aware is that the CPU should follow L0's configuration when
 VCPU is running. So if L0 wants to trap a msr, then the read operation
 to this msr should cause vmexit unconditionally no matter who is
 running(who means L1, L2, L3.).
 
 msrs seperately, there is not a common way suit for all the cases.
 If other features want to intercept a msr in nested entry, they
 can put the custom code in nested_vmx_merge_msr_bitmap.
 
 Yes, if other features want to do it in 'nested' entry, they can
 fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it
 should be our responsibly to handle it correctly, how about add following 
 check:
 
 if (type  MSR_TYPE_R  !test_bit(msr, vmcs01_msr_bitmap) 
 !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
 __clear_bit(msr, msr_bitmap_nested + 0x000 / f);
 
 
 Anyway, this is not necessary for your current patch. We can consider
 it later if there really have other features will use it.
 
 
 Yep, I know what you mean now, for other msrs which are not forwarded
 access by a mechanism like virtual-apic page, we should intercept it
 unconditionally. I think we should ensure the msr can be allowed
 before call nested_vmx_disable_intercept_for_msr, if L0 want to
 intercept it, just do not call nested_vmx_disable_intercept_for_msr.

Yes, this is a solution. But I prefer to handle it in nested code path since 
others may not familiar with nested and may block it by mistake.

 
  !test_bit(msr, vmcs01_msr_bitmap) will introduce a problem that some
 of the msrs will be affcted by vmcs01_msr_bitmap, TMCCT and TPR, for example.
 Intercept reading for these msrs is okay, but it is not efficient.

TMCCT is always trapped by most VMM. I don't think TPR is trapped in KVM.

 
 Thanks,
 Wincy


Best regards,
Yang




RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-24:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance, especially in apicv case.
 
 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---

...snip...

  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)  {
 if (!longmode_only)
 @@ -8344,7 +8394,68 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  {
 -   return false;
 +   struct page *page;
 +   unsigned long *msr_bitmap;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return false;
 +
 +   page = nested_get_page(vcpu, vmcs12-msr_bitmap);
 +   if (!page) {
 +   WARN_ON(1);
 +   return false;
 +   }
 +   msr_bitmap = (unsigned long *)kmap(page);
 +   if (!msr_bitmap) {
 +   nested_release_page_clean(page);
 +   WARN_ON(1);
 +   return false;
 +   }
 +
 +   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);
 +
 +   if (nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   /* TPR is allowed */
 +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
 +   vmx_msr_bitmap_nested,
 +   APIC_BASE_MSR + (APIC_TASKPRI 
 4),
 +   MSR_TYPE_R | MSR_TYPE_W);

I didn't understand what this function does? Per my understanding, you only 
need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | 
vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in 
vmcs12-msr_bitmap is setting. Am I missing some patches?

Best regards,
Yang




RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 7:25 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z
 yang.z.zh...@intel.com
 wrote:
 @@ -8344,7 +8394,68 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  { -   return false; +   struct page *page; +
 unsigned long *msr_bitmap; + +   if
 (!nested_cpu_has_virt_x2apic_mode(vmcs12)) +   return
 false; + +   page = nested_get_page(vcpu, vmcs12-msr_bitmap);
 +
 if (!page) { +   WARN_ON(1); +
 return
 false; +   } +   msr_bitmap = (unsigned long *)kmap(page); +
 if (!msr_bitmap) { +
 nested_release_page_clean(page); +   WARN_ON(1); +
   return false; +   } + +
 memset(vmx_msr_bitmap_nested,
 0xff, PAGE_SIZE); + +   if
 (nested_cpu_has_virt_x2apic_mode(vmcs12)) +   /* TPR is
 allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, +
   vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI
 4), + MSR_TYPE_R | MSR_TYPE_W);
 
 I didn't understand what this function does? Per my understanding,
 you only
 need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap |
 vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit
 vmcs12-in msr_bitmap is setting. Am I missing some patches?
 
 In the beginning, I want to do vmcs01-msr_bitmap |
 vmcs12-msr_bitmap, but I remember that there isn't a instruction
 vmcs12-to
 do a bit or operation in two pages effectively, so I do the bit or
 operation in nested_vmx_disable_intercept_for_msr. If the hardware
 do not support this, I think it is faster if we deal with the bits on 
 demand.
 nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps,
 any features can put their logic here.
 
 You construct the nested_msr_bitmap based on vmcs12-msr_bitmap, what
 happens if vmcs01-msr_bitmap want to trap this msr?
 
 
 If L0 wants to intercept a msr, we should set
 vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
 and that bitmaps should only be loaded in non-nested entry.
 Currently we only clear the corresponding bits if L1 enables
 virtualize x2apic mode, all the other bits are all set. On nested
 entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's 
 bitmap.
 In the nested entry, L0 only care about L2's msr access, not L1's. I
 think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap
 as your mentioned above.

Mmm... I think if the bit is setting in vmcs01-msr_bitmap means whenever the 
VCPU(no matter in nested guest mode or not) accesses this msr should cause 
vmexit, not only L1. That's why need to construct the vmcs02-msr_bitmap based 
on (vmcs01-msr_bitmap ORed vmcs12-msr_bitmap).

 
 Thanks,
 Wincy


Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-28:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance, especially in apicv case.
 
 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
  arch/x86/kvm/vmx.c |  114
 +++-
  1 files changed, 112 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 787f886..9d11a93
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1108,6 +1108,11 @@ static inline bool nested_cpu_has_xsaves(struct
 vmcs12 *vmcs12)
 vmx_xsaves_supported();
  }
 
 +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
 +*vmcs12) {
 +   return nested_cpu_has2(vmcs12,
 +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 +}
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2395,6 +2400,7 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
 nested_vmx_secondary_ctls_low = 0;
 nested_vmx_secondary_ctls_high =
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 SECONDARY_EXEC_WBINVD_EXITING |
 SECONDARY_EXEC_XSAVES;
 
 @@ -4155,6 +4161,52 @@ static void
 __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
 }
  }
 
 +/*
 + * If a msr is allowed by L0, we should check whether it is allowed by L1.
 + * The corresponding bit will be cleared unless both of L0 and L1 allow it.
 + */
 +static void nested_vmx_disable_intercept_for_msr(unsigned long
 *msr_bitmap_l1,
 +  unsigned long
 *msr_bitmap_nested,
 +  u32 msr, int type) {
 +   int f = sizeof(unsigned long);
 +
 +   if (!cpu_has_vmx_msr_bitmap()) {
 +   WARN_ON(1);
 +   return;
 +   }
 +
 +   /*
 +* See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
 +* have the write-low and read-high bitmap offsets the wrong way
 round.
 +* We can control MSRs 0x-0x1fff and
 0xc000-0xc0001fff.
 +*/
 +   if (msr = 0x1fff) {
 +   if (type  MSR_TYPE_R 
 +  !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
 +   /* read-low */
 +   __clear_bit(msr, msr_bitmap_nested + 0x000 / f);
 +
 +   if (type  MSR_TYPE_W 
 +  !test_bit(msr, msr_bitmap_l1 + 0x800 / f))
 +   /* write-low */
 +   __clear_bit(msr, msr_bitmap_nested + 0x800 / f);
 +
 +   } else if ((msr = 0xc000)  (msr = 0xc0001fff)) {
 +   msr = 0x1fff;
 +   if (type  MSR_TYPE_R 
 +  !test_bit(msr, msr_bitmap_l1 + 0x400 / f))
 +   /* read-high */
 +   __clear_bit(msr, msr_bitmap_nested + 0x400 / f);
 +
 +   if (type  MSR_TYPE_W 
 +  !test_bit(msr, msr_bitmap_l1 + 0xc00 / f))
 +   /* write-high */
 +   __clear_bit(msr, msr_bitmap_nested + 0xc00 / f);
 +
 +   }
 +}
 +
  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)  {
 if (!longmode_only)
 @@ -8350,7 +8402,59 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  {
 -   return false;
 +   struct page *page;
 +   unsigned long *msr_bitmap;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return false;
 +
 +   page = nested_get_page(vcpu, vmcs12-msr_bitmap);
 +   if (!page) {
 +   WARN_ON(1);
 +   return false;
 +   }
 +   msr_bitmap = (unsigned long *)kmap(page);
 +   if (!msr_bitmap) {
 +   nested_release_page_clean(page);
 +   WARN_ON(1);
 +   return false;
 +   }
 +
 +   if (nested_cpu_has_virt_x2apic_mode(vmcs12)) {
 +   /* TPR is allowed */
 +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
 +   vmx_msr_bitmap_nested,
 +   APIC_BASE_MSR + (APIC_TASKPRI 
 4),
 +   MSR_TYPE_R | MSR_TYPE_W);
 +   } else
 +   __vmx_enable_intercept_for_msr(
 +   vmx_msr_bitmap_nested,
 +   APIC_BASE_MSR + (APIC_TASKPRI 
 4),
 +   MSR_TYPE_R | MSR_TYPE_W);
 +   kunmap(page);
 +   nested_release_page_clean(page);
 +
 +   return true;
 +}
 +
 +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu,
 + 

RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-29:
 On Thu, Jan 29, 2015 at 10:54 AM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 -8646,7 +8750,8 @@ static void prepare_vmcs02(struct kvm_vcpu
 *vcpu, struct vmcs12 *vmcs12)
 else
 vmcs_write64(APIC_ACCESS_ADDR,
 
 page_to_phys(vmx-nested.apic_access_page));
 -   } else if
 (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
 +   } else if
 + (!(nested_cpu_has_virt_x2apic_mode(vmcs12))
 + 
 +
 + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) {
 exec_control |=
 
 You don't load L2's apic_page in your patch correctly when x2apic
 mode is
 used. Here is the right change for prepare_vmcs02()(maybe other place
 also need change too):
 
 @@ -8585,7 +8585,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
 
 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
 exec_control |=
 vmcs12-secondary_vm_exec_control;
 
 -   if (exec_control 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { +   if
 (exec_control  (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + +
 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
 /*
  * If translation failed, no matter: This
 feature
 asks
  * to exit when accessing the given address,
 and if it @@ -8594,7 +8595,8 @@ static void prepare_vmcs02(struct
 kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
  */
 if (!vmx-nested.apic_access_page)
 exec_control =
 - ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; +  
   ~ (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + +
 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 else
 vmcs_write64(APIC_ACCESS_ADDR,
 page_to_phys(vmx-nested.apic_access_page));
 
 
 I think we don't need to do this, if L1 enables x2apic mode, we have
 already checked that the vmcs12-SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES
 is 0. exec_control |= vmcs12-secondary_vm_exec_control; merged L1's
 settings, including x2apic mode. the special case is
 vm_need_virtualize_apic_accesses, if vm_need_virtualize_apic_accesses
 returns true, the nested_cpu_has_virt_x2apic_mode will prevent us to set
 the apic access bit.

I just realized that we are talking different thing. I am thinking about 
VIRTUAL_APIC_PAGE_ADDR(my mistake :)). And it has been handled correctly 
already. For APIC_ACCESS_ADDR, you are right. The current implementation can 
handle it well.

 
 
 Thanks,
 Wincy


Best regards,
Yang




RE: [PATCH v4 0/6] KVM: nVMX: Enable nested apicv support

2015-01-28 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-28:
 v1 --- v2:
   Use spin lock to ensure vmcs12 is safe when doing nested
   posted interrupt delivery.
 v2 --- v3:
   1. Add a new field in nested_vmx to avoid the spin lock in v2.
   2. Drop send eoi to L1 when doing nested interrupt delivery.
   3. Use hardware MSR bitmap to enable nested virtualize x2apic
  mode.
 v3 --- v4:
   1. Optimize nested msr bitmap merging.
   2. Allocate nested msr bitmap only when nested == 1.
   3. Inline the nested vmx control checking functions.

This version looks good to me. Only minor comment: EXIT_REASON_APIC_WRITE 
vmexit is introduced by apic register virtualization not virtual interrupt 
delivery, so it's better add it in 4th patch not 5th patch.(If no other 
comments, I guess Paolo can help do it when applying it).

Reviewed-by: Yang Zhang yang.z.zh...@intel.com


 Wincy Van (6):
   KVM: nVMX: Use hardware MSR bitmap
   KVM: nVMX: Enable nested virtualize x2apic mode
   KVM: nVMX: Make nested control MSRs per-cpu
   KVM: nVMX: Enable nested apic register virtualization
   KVM: nVMX: Enable nested virtual interrupt delivery
   KVM: nVMX: Enable nested posted interrupt processing
  arch/x86/kvm/vmx.c |  580
  +++- 1 files changed,
  480 insertions(+), 100 deletions(-)


Best regards,
Yang




RE: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.

2015-01-21 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-16:
 To enable nested apicv support, we need per-cpu vmx control MSRs:
   1. If in-kernel irqchip is enabled, we can enable nested
  posted interrupt, we should set posted intr bit in the
  nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled,
  we can not enable nested posted interrupt, the posted intr bit in
  the nested_vmx_pinbased_ctls_high will be cleared.
 Since there would be different settings about in-kernel irqchip
 between VMs, different nested control MSRs are needed.

I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr 
to per vcpu.

Best regards,
Yang




RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

2015-01-21 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-21:
 On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 +   if (vector == vmcs12-posted_intr_nv  +  
 nested_cpu_has_posted_intr(vmcs12)) { +   if (vcpu-mode
 == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), +
   POSTED_INTR_VECTOR); +   else {
 +   r = -1; +   goto out; +   
} + +   /* +* if posted intr is
 done by hardware, the +* corresponding eoi was sent to
 L0. Thus +* we should send eoi to L1 manually. +  
  */ +   kvm_apic_set_eoi_accelerated(vcpu, +  
 vmcs12-posted_intr_nv);
 
 Why this is necessary? As your comments mentioned, it is done by
 hardware not L1, why L1 should aware of it?
 
 
 According to SDM 29.6, if the processor recognizes a posted interrupt,
 it will send an EOI to LAPIC.
 If the posted intr is done by hardware, the processor will send eoi to
 hardware LAPIC, not L1's, just like the none-nested case(the physical
 interrupt is dismissed). So we should take care of the L1's LAPIC and send an 
 eoi to it.

No. You are not emulating the PI feature. You just reuse the hardware's 
capability. So you don't need to let L1 know it.

 
 
 Thanks,
 
 Wincy


Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

2015-01-21 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-20:
 If vcpu has a interrupt in vmx non-root mode, we will kick that vcpu
 to inject interrupt timely. With posted interrupt processing, the kick
 intr is not needed, and interrupts are fully taken care of by hardware.
 
 In nested vmx, this feature avoids much more vmexits than non-nested vmx.
 
 This patch use L0's POSTED_INTR_NV to avoid unexpected interrupt if
 L1's vector is different with L0's. If vcpu is in hardware's non-root
 mode, we use a physical ipi to deliver posted interrupts, otherwise we
 will deliver that interrupt to L1 and kick that vcpu out of nested non-root 
 mode.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
  arch/x86/kvm/vmx.c |  136
  ++-- 1 files changed,
  132 insertions(+), 4 deletions(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 ea56e9f..cda9133 100644 --- a/arch/x86/kvm/vmx.c +++
 b/arch/x86/kvm/vmx.c @@ -215,6 +215,7 @@ struct __packed vmcs12 {
 u64 tsc_offset; u64 virtual_apic_page_addr; u64
 apic_access_addr; +   u64 posted_intr_desc_addr; u64
 ept_pointer; u64 eoi_exit_bitmap0; u64 eoi_exit_bitmap1; @@
 -334,6 +335,7 @@ struct __packed vmcs12 { u32
 vmx_preemption_timer_value; u32 padding32[7]; /* room for future
 expansion */ u16 virtual_processor_id; +   u16
 posted_intr_nv; u16 guest_es_selector; u16 guest_cs_selector;
 u16 guest_ss_selector; @@ -387,6 +389,7 @@ struct nested_vmx {
 /* The host-usable pointer to the above */ struct page
 *current_vmcs12_page; struct vmcs12 *current_vmcs12; +  
 spinlock_t vmcs12_lock; struct vmcs *current_shadow_vmcs; /*
  * Indicates if the shadow vmcs must be updated with the @@
  -406,6 +409,8 @@ struct nested_vmx { */
 struct page *apic_access_page;
 struct page *virtual_apic_page;
 +   struct page *pi_desc_page;
 +   struct pi_desc *pi_desc;
 u64 msr_ia32_feature_control;
 
 struct hrtimer preemption_timer; @@ -621,6 +626,7 @@ static
 int max_shadow_read_write_fields =
 
  static const unsigned short vmcs_field_to_offset_table[] = {
 FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), +  
 FIELD(POSTED_INTR_NV, posted_intr_nv), FIELD(GUEST_ES_SELECTOR,
 guest_es_selector), FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
 FIELD(GUEST_SS_SELECTOR, guest_ss_selector), @@ -646,6 +652,7 @@
 static const unsigned short vmcs_field_to_offset_table[] = {
 FIELD64(TSC_OFFSET, tsc_offset), FIELD64(VIRTUAL_APIC_PAGE_ADDR,
 virtual_apic_page_addr), FIELD64(APIC_ACCESS_ADDR,
 apic_access_addr), +   FIELD64(POSTED_INTR_DESC_ADDR,
 posted_intr_desc_addr), FIELD64(EPT_POINTER, ept_pointer),
 FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
 FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), @@ -798,6 +805,7
 @@ static void kvm_cpu_vmxon(u64 addr);  static void
 kvm_cpu_vmxoff(void); static bool vmx_mpx_supported(void);  static
 bool vmx_xsaves_supported(void);
 +static int vmx_vm_has_apicv(struct kvm *kvm);
  static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 struct kvm_segment *var, int seg); @@
 -1159,6 +1167,11 @@ static inline bool nested_cpu_has_vid(struct
 vmcs12 *vmcs12)
 return nested_cpu_has2(vmcs12,
 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
  }
 +static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) {
 +   return vmcs12-pin_based_vm_exec_control 
 +PIN_BASED_POSTED_INTR; }
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2362,6 +2375,9 @@ static void
 nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 vmx-nested.nested_vmx_pinbased_ctls_high |=
 PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
 PIN_BASED_VMX_PREEMPTION_TIMER;
 +   if (vmx_vm_has_apicv(vmx-vcpu.kvm))
 +   vmx-nested.nested_vmx_pinbased_ctls_high |=
 +   PIN_BASED_POSTED_INTR;
 
 /* exit controls */ rdmsr(MSR_IA32_VMX_EXIT_CTLS, @@ -4267,6
 +4283,46 @@ static int vmx_vm_has_apicv(struct kvm *kvm) return
 enable_apicv  irqchip_in_kernel(kvm);  }
 +static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 +   int vector) {
 +   int r = 0;
 +   struct vmcs12 *vmcs12;
 +
 +   /*
 +* Since posted intr delivery is async,
 +* we must aquire a spin-lock to avoid
 +* the race of vmcs12.
 +*/
 +   spin_lock(to_vmx(vcpu)-nested.vmcs12_lock);
 +   vmcs12 = get_vmcs12(vcpu);
 +   if (!is_guest_mode(vcpu) || !vmcs12) {
 +   r = -1;
 +   goto out;
 +   }
 +   if (vector == vmcs12-posted_intr_nv 
 +  

RE: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.

2015-01-21 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-16:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance.
 
 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com

To enable x2apic, should you to consider the behavior changes to rdmsr and 
wrmsr. I didn't see your patch do it. Is it correct?
BTW, this patch has nothing to do with APICv, it's better to not use x2apic 
here and change to apicv in following patch.

 ---
  arch/x86/kvm/vmx.c |   49
 -
  1 files changed, 48 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct
 vmcs12 *vmcs12)
 vmx_xsaves_supported();
  }
 
 +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
 +*vmcs12) {
 +   return nested_cpu_has2(vmcs12,
 +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 +}
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void
 nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 vmx-nested.nested_vmx_secondary_ctls_low = 0;
 vmx-nested.nested_vmx_secondary_ctls_high =
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 SECONDARY_EXEC_WBINVD_EXITING |
 SECONDARY_EXEC_XSAVES;
 
 @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct
 kvm_vcpu *vcpu)
 case EXIT_REASON_APIC_ACCESS:
 return nested_cpu_has2(vmcs12,
 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 +   case EXIT_REASON_APIC_WRITE:
 +   /* apic_write should exit unconditionally. */
 +   return 1;

APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize 
x2apic. Move it to next patch.

 case EXIT_REASON_EPT_VIOLATION:
 /*
  * L0 always deals with the EPT violation. If nested EPT is
 @@ -8356,6 +8365,38 @@ static void vmx_start_preemption_timer(struct
 kvm_vcpu *vcpu)
   ns_to_ktime(preemption_timeout),
 HRTIMER_MODE_REL);  }
 
 +static inline int nested_vmx_check_virt_x2apic(struct kvm_vcpu *vcpu,
 +  struct vmcs12
 *vmcs12) {
 +   if (nested_cpu_has2(vmcs12,
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 +   return -EINVAL;
 +   return 0;
 +}
 +
 +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu,
 +  struct vmcs12 *vmcs12) {
 +   int r;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return 0;
 +
 +   r = nested_vmx_check_virt_x2apic(vcpu, vmcs12);
 +   if (r)
 +   goto fail;
 +
 +   /* tpr shadow is needed by all apicv features. */
 +   if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
 +   r = -EINVAL;
 +   goto fail;
 +   }
 +
 +   return 0;
 +
 +fail:
 +   return r;
 +}
 +
  static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu,
unsigned long count_field,
unsigned long addr_field, @@
 -8649,7 +8690,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
 else
 vmcs_write64(APIC_ACCESS_ADDR,
 
 page_to_phys(vmx-nested.apic_access_page));
 -   } else if
 (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
 +   } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) 
 +
 + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) {
 exec_control |=
 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 kvm_vcpu_reload_apic_access_page(vcpu);
 @@ -8856,6 +8898,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu,
 bool launch)
 return 1;
 }
 
 +   if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) {
 +   nested_vmx_failValid(vcpu,
 VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 +   return 1;
 +   }
 +
 if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) {
 nested_vmx_failValid(vcpu,
 VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 return 1;
 --
 1.7.1

Best regards,
Yang




RE: [PATCH 0/5] KVM: nVMX: Enable nested apicv support.

2015-01-19 Thread Zhang, Yang Z
Wincy Van wrote on 2015-01-20:
 Hi, Yang,
 
 Could you please have a look at this patch set?
 Your comment is very appreciated!

Sure. I will take a look. 

 
 
 Thanks,
 
 Wincy


Best regards,
Yang




RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-12-23:
 
 
 On 23/12/2014 10:07, Wu, Feng wrote:
 On 23/12/2014 01:37, Zhang, Yang Z wrote:
 I don't quite understand it. If user set an interrupt's affinity
 to a CPU, but he still see the interrupt delivers to other CPUs in host.
 Do you think it is a right behavior?
 
 No, the interrupt is not delivered at all in the host.  Normally you'd have:
 
 - interrupt delivered to CPU from host affinity
 
 - VFIO interrupt handler writes to irqfd
 
 - interrupt delivered to vCPU from guest affinity
 
 Here, you just skip the first two steps.  The interrupt is
 delivered to the thread that is running the vCPU directly, so the
 host affinity is bypassed entirely.
 
 ... unless you are considering the case where the vCPU is blocked
 and the host is processing the posted interrupt wakeup vector.  In
 that case yes, it would be better to set NDST to a CPU matching the host 
 affinity.
 
 In my understanding, wakeup vector should have no relationship with
 the host affinity of the irq. Wakeup notification event should be
 delivered to the pCPU which the vCPU was blocked on. And in kernel's
 point of view, the irq is not associated with the wakeup vector, right?
 
 That is correct indeed.  It is not associated to the wakeup vector,
 hence this patch is right, I think.
 
 However, the wakeup vector has the same function as the VFIO interrupt
 handler, so you could argue that it is tied to the host affinity
 rather than the guest.  Let's wait for Yang to answer.

Actually, that's my original question too. I am wondering what happens if the 
user changes the assigned device's affinity in host's /proc/irq/? If ignore it 
is acceptable, then this patch is ok. But it seems the discussion out of my 
scope, need some experts to tell us their idea since it will impact the user 
experience. 

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z
Jiang Liu wrote on 2014-12-24:
 On 2014/12/24 9:38, Zhang, Yang Z wrote:
 Paolo Bonzini wrote on 2014-12-23:
 
 
 On 23/12/2014 10:07, Wu, Feng wrote:
 On 23/12/2014 01:37, Zhang, Yang Z wrote:
 I don't quite understand it. If user set an interrupt's affinity
 to a CPU, but he still see the interrupt delivers to other CPUs in host.
 Do you think it is a right behavior?
 
 No, the interrupt is not delivered at all in the host.  Normally you'd 
 have:
 
 - interrupt delivered to CPU from host affinity
 
 - VFIO interrupt handler writes to irqfd
 
 - interrupt delivered to vCPU from guest affinity
 
 Here, you just skip the first two steps.  The interrupt is
 delivered to the thread that is running the vCPU directly, so the
 host affinity is bypassed entirely.
 
 ... unless you are considering the case where the vCPU is blocked
 and the host is processing the posted interrupt wakeup vector.
 In that case yes, it would be better to set NDST to a CPU
 matching the host
 affinity.
 
 In my understanding, wakeup vector should have no relationship
 with the host affinity of the irq. Wakeup notification event
 should be delivered to the pCPU which the vCPU was blocked on. And
 in kernel's point of view, the irq is not associated with the wakeup 
 vector, right?
 
 That is correct indeed.  It is not associated to the wakeup vector,
 hence this patch is right, I think.
 
 However, the wakeup vector has the same function as the VFIO
 interrupt handler, so you could argue that it is tied to the host
 affinity rather than the guest.  Let's wait for Yang to answer.
 
 Actually, that's my original question too. I am wondering what
 happens if the
 user changes the assigned device's affinity in host's /proc/irq/? If
 ignore it is acceptable, then this patch is ok. But it seems the
 discussion out of my scope, need some experts to tell us their idea since it 
 will impact the user experience.
 Hi Yang,

Hi Jiang,

   Originally we have a proposal to return failure when user sets IRQ
 affinity through native OS interfaces if an IRQ is in PI mode. But
 that proposal will break CPU hot-removal because OS needs to migrate
 away all IRQs binding to the CPU to be offlined. Then we propose
 saving user IRQ affinity setting without changing hardware
 configuration (keeping PI configuration). Later when PI mode is
 disabled, the cached affinity setting will be used to setup IRQ
 destination for native OS. On the other hand, for IRQ in PI mode, it
 won't be delivered to native OS, so user may not sense that the IRQ is 
 delivered to CPUs other than those in the affinity set.

The IRQ is still there but will be delivered to host in the form of PI event(if 
the VCPU is running in root-mode). I am not sure whether those interrupts 
should be reflected in /proc/interrupts? If the answer is yes, then which 
entries should be used, a new PI entry or use the original IRQ entry?

 In that aspect, I think it's acceptable:) Regards!

Yes, if all of you guys(especially the IRQ maintainer) are think it is 
acceptable then we can follow current implementation and document it.

 Gerry
 
 
 Paolo
 
 
 Best regards,
 Yang
 



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-12-24:
 
 
 Zhang, Yang Z wrote on 2014-12-24:
 Cc: io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org;
 KVM list; Eric Auger
 Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for
 VT-d Posted-Interrupts
 
 Jiang Liu wrote on 2014-12-24:
 On 2014/12/24 9:38, Zhang, Yang Z wrote:
 Paolo Bonzini wrote on 2014-12-23:
 
 
 On 23/12/2014 10:07, Wu, Feng wrote:
 On 23/12/2014 01:37, Zhang, Yang Z wrote:
 I don't quite understand it. If user set an interrupt's affinity
 to a CPU, but he still see the interrupt delivers to other CPUs
 in host. Do you think it is a right behavior?
 
 No, the interrupt is not delivered at all in the host. Normally
 you'd have:
 
 - interrupt delivered to CPU from host affinity
 
 - VFIO interrupt handler writes to irqfd
 
 - interrupt delivered to vCPU from guest affinity
 
 Here, you just skip the first two steps.  The interrupt is
 delivered to the thread that is running the vCPU directly, so
 the host affinity is bypassed entirely.
 
 ... unless you are considering the case where the vCPU is
 blocked and the host is processing the posted interrupt wakeup vector.
 In that case yes, it would be better to set NDST to a CPU
 matching the host
 affinity.
 
 In my understanding, wakeup vector should have no relationship
 with the host affinity of the irq. Wakeup notification event
 should be delivered to the pCPU which the vCPU was blocked on.
 And in kernel's point of view, the irq is not associated with
 the wakeup vector,
 right?
 
 That is correct indeed.  It is not associated to the wakeup
 vector, hence this patch is right, I think.
 
 However, the wakeup vector has the same function as the VFIO
 interrupt handler, so you could argue that it is tied to the
 host affinity rather than the guest.  Let's wait for Yang to answer.
 
 Actually, that's my original question too. I am wondering what
 happens if the
 user changes the assigned device's affinity in host's /proc/irq/? If
 ignore it is acceptable, then this patch is ok. But it seems the
 discussion out of my scope, need some experts to tell us their idea
 since it will impact the user experience. Hi Yang,
 
 Hi Jiang,
 
 Originally we have a proposal to return failure when user sets
 IRQ affinity through native OS interfaces if an IRQ is in PI mode.
 But that proposal will break CPU hot-removal because OS needs to
 migrate away all IRQs binding to the CPU to be offlined. Then we
 propose saving user IRQ affinity setting without changing hardware
 configuration (keeping PI configuration). Later when PI mode is
 disabled, the cached affinity setting will be used to setup IRQ
 destination for native OS. On the other hand, for IRQ in PI mode,
 it won't be delivered to native OS, so user may not sense that the
 IRQ is
 delivered to CPUs other than those in the affinity set.
 
 The IRQ is still there but will be delivered to host in the form of
 PI event(if the VCPU is running in root-mode). I am not sure whether
 those interrupts should be reflected in /proc/interrupts? If the
 answer is yes, then which entries should be used, a new PI entry or
 use the
 original IRQ entry?
 
 Even though, setting the affinity of the IRQ in host should not affect
 the destination of the PI event (normal notification event of wakeup

This is your implementation. To me, disable PI if the VCPU is going to 
run in the CPU out of IRQ affinity bitmap also is acceptable. And it will 
keep the user interface looks the same as before. 

Hi Thomas, Ingo, Peter

Can you guys help to review this patch? Really appreciate if you can give
some feedbacks.

 notification event), because the destination of the PI event is
 determined in NDST field of Posted-interrupts descriptor and PI
 notification vector is global. Just had a discussion with Jiang
 offline, maybe we can add the statistics information for the notification 
 vector in /proc/interrupts just like any other global interrupts.
 
 Thanks,
 Feng
 
 
 In that aspect, I think it's acceptable:) Regards!
 
 Yes, if all of you guys(especially the IRQ maintainer) are think it
 is acceptable then we can follow current implementation and document it.
 
 Gerry
 
 
 Paolo
 
 
 Best regards,
 Yang
 
 
 
 
 Best regards,
 Yang



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-22 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-12-19:
 
 
 On 19/12/2014 02:46, Zhang, Yang Z wrote:
 If the IRQ is posted, its affinity is controlled by guest (irq
 --- vCPU  pCPU), it has no effect when host changes its affinity.
 
 That's the problem: User is able to changes it in host but it never
 takes effect since it is actually controlled by guest. I guess it
 will break the IRQ balance too.
 
 I don't think that's a problem.
 
 Controlling the affinity in the host affects which CPU in the host
 takes care of signaling the guest.
 
 If this signaling is done directly by the chipset, there is no need to
 do anything in the host and thus the host affinity can be bypassed.

I don't quite understand it. If user set an interrupt's affinity to a CPU, but 
he still see the interrupt delivers to other CPUs in host. Do you think it is a 
right behavior?

 
 Paolo


Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-22 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-12-23:
 The problem is we still need to support PI with lowest priority
 delivery mode
 even if guest does not configure irq affinity via /proc/irq/. Don't we?
 
 Yes, but we can get the basic support working first.
 
 I and Feng talked on irc and agreed to start with a simple
 implementation. Then he can add support for vector hashing for all
 interrupts, both vt-d pi and normal LAPIC interrupts.

Agree. Obviously, current PI has some limitations to achieve highest 
performance. We can start with a simple implementation but must ensure we don't 
change hardware's behavior(From guest's point). Feng's current implementation 
or use the way I suggested are both ok since both of them cannot solve the 
problem well.

 
 Paolo



Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z
Feng Wu wrote on 2014-12-12:
 We don't need to migrate the irqs for VT-d Posted-Interrupts here.
 When 'pst' is set in IRTE, the associated irq will be posted to guests
 instead of interrupt remapping. The destination of the interrupt is
 set in Posted-Interrupts Descriptor, and the migration happens during
 vCPU scheduling.
 
 However, we still update the cached irte here, which can be used when
 changing back to remapping mode.
 
 Signed-off-by: Feng Wu feng...@intel.com
 Reviewed-by: Jiang Liu jiang@linux.intel.com
 ---
  drivers/iommu/intel_irq_remapping.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)
 diff --git a/drivers/iommu/intel_irq_remapping.c
 b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 ---
 a/drivers/iommu/intel_irq_remapping.c +++
 b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
 intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
  {
   struct intel_ir_data *ir_data = data-chip_data;struct irte 
 *irte =
  ir_data-irte_entry; +  struct irte_pi *irte_pi = (struct irte_pi
  *)irte;  struct irq_cfg *cfg = irqd_cfg(data);   struct irq_data *parent
  = data-parent_data; int ret;
 @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
 const struct cpumask *mask,
*/
   irte-vector = cfg-vector;
   irte-dest_id = IRTE_DEST(cfg-dest_apicid);
 - modify_irte(ir_data-irq_2_iommu, irte);
 +
 + /* We don't need to modify irte if the interrupt is for posting. */
 + if (irte_pi-pst != 1)
 + modify_irte(ir_data-irq_2_iommu, irte);

What happens if user changes the IRQ affinity manually?

 
   /*
* After this point, all the interrupts will start arriving


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z
Feng Wu wrote on 2014-12-12:
 This patch defines a new interface kvm_find_dest_vcpu for
 VT-d PI, which can returns the destination vCPU of the
 interrupt for guests.
 
 Since VT-d PI cannot handle broadcast/multicast interrupt,
 Here we only handle Fixed and Lowest priority interrupts.
 
 The current method of handling guest lowest priority interrtups
 is to use a counter 'apic_arb_prio' for each vCPU, we choose the
 vCPU with smallest 'apic_arb_prio' and then increase it by 1.
 However, for VT-d PI, we cannot re-use this, since we no longer
 have control to 'apic_arb_prio' with posted interrupt direct
 delivery by Hardware.
 
 Here, we introduce a similar way with 'apic_arb_prio' to handle guest
 lowest priority interrtups when VT-d PI is used. Here is the ideas: -
 Each vCPU has a counter 'round_robin_counter'. - When guests sets an
 interrupts to lowest priority, we choose the vCPU with smallest
 'round_robin_counter' as the destination, then increase it.
 
How this can work well? All subsequent interrupts are delivered to one vCPU? It 
shouldn't be the best solution, need more consideration. Also, I think you 
should take the apic_arb_prio into consider since the priority is for the whole 
vCPU not for one interrupt.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 21/26] x86, irq: Define a global vector for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z
Feng Wu wrote on 2014-12-12:
 Currently, we use a global vector as the Posted-Interrupts
 Notification Event for all the vCPUs in the system. We need to
 introduce another global vector for VT-d Posted-Interrtups, which will
 be used to wakeup the sleep vCPU when an external interrupt from a 
 direct-assigned device happens for that vCPU.
 

Hi Feng, 

Since the idea of two global vectors mechanism is from me, please add me to the 
comments.

 Signed-off-by: Feng Wu feng...@intel.com
 ---
  arch/x86/include/asm/entry_arch.h  |  2 ++
  arch/x86/include/asm/hardirq.h |  1 +
  arch/x86/include/asm/hw_irq.h  |  2 ++
  arch/x86/include/asm/irq_vectors.h |  1 +
  arch/x86/kernel/entry_64.S |  2 ++
  arch/x86/kernel/irq.c  | 27 +++
  arch/x86/kernel/irqinit.c  |  2 ++
  7 files changed, 37 insertions(+)
 diff --git a/arch/x86/include/asm/entry_arch.h
 b/arch/x86/include/asm/entry_arch.h index dc5fa66..27ca0af 100644 ---
 a/arch/x86/include/asm/entry_arch.h +++
 b/arch/x86/include/asm/entry_arch.h @@ -23,6 +23,8 @@
 BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)  #ifdef
 CONFIG_HAVE_KVM BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR,
smp_kvm_posted_intr_ipi)
 +BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR,
 +  smp_kvm_posted_intr_wakeup_ipi)
  #endif
  
  /*
 diff --git a/arch/x86/include/asm/hardirq.h
 b/arch/x86/include/asm/hardirq.h index 0f5fb6b..9866065 100644
 --- a/arch/x86/include/asm/hardirq.h
 +++ b/arch/x86/include/asm/hardirq.h
 @@ -14,6 +14,7 @@ typedef struct {
  #endif #ifdef CONFIG_HAVE_KVMunsigned int kvm_posted_intr_ipis;
  +unsigned int kvm_posted_intr_wakeup_ipis; #endifunsigned int
  x86_platform_ipis;   /* arch dependent */unsigned int apic_perf_irqs;
 diff --git a/arch/x86/include/asm/hw_irq.h
 b/arch/x86/include/asm/hw_irq.h index e7ae6eb..38fac9b 100644
 --- a/arch/x86/include/asm/hw_irq.h
 +++ b/arch/x86/include/asm/hw_irq.h
 @@ -29,6 +29,7 @@
  extern asmlinkage void apic_timer_interrupt(void);  extern asmlinkage
 void x86_platform_ipi(void);  extern asmlinkage void
 kvm_posted_intr_ipi(void); +extern asmlinkage void
 kvm_posted_intr_wakeup_ipi(void);
  extern asmlinkage void error_interrupt(void);  extern asmlinkage void
 irq_work_interrupt(void);
 
 @@ -92,6 +93,7 @@ extern void
 trace_call_function_single_interrupt(void);
  #define trace_irq_move_cleanup_interrupt  irq_move_cleanup_interrupt
 #define trace_reboot_interrupt  reboot_interrupt  #define
 trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
 +#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi
  #endif /* CONFIG_TRACING */
  
  struct irq_domain;
 diff --git a/arch/x86/include/asm/irq_vectors.h
 b/arch/x86/include/asm/irq_vectors.h index b26cb12..dca94f2 100644 ---
 a/arch/x86/include/asm/irq_vectors.h +++
 b/arch/x86/include/asm/irq_vectors.h @@ -105,6 +105,7 @@
  /* Vector for KVM to deliver posted interrupt IPI */  #ifdef
  CONFIG_HAVE_KVM #define POSTED_INTR_VECTOR   0xf2 +#define
  POSTED_INTR_WAKEUP_VECTOR0xf1 #endif
  
  /*
 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index e61c14a..a598447 100644 --- a/arch/x86/kernel/entry_64.S +++
 b/arch/x86/kernel/entry_64.S @@ -960,6 +960,8 @@ apicinterrupt
 X86_PLATFORM_IPI_VECTOR \  #ifdef CONFIG_HAVE_KVM
  apicinterrupt3 POSTED_INTR_VECTOR \
   kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
 +apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \
 + kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi
  #endif
  
  #ifdef CONFIG_X86_MCE_THRESHOLD
 diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index
 922d285..47408c3 100644
 --- a/arch/x86/kernel/irq.c
 +++ b/arch/x86/kernel/irq.c
 @@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs
 *regs)  }
 
  #ifdef CONFIG_HAVE_KVM
 +void (*wakeup_handler_callback)(void) = NULL;
 +EXPORT_SYMBOL_GPL(wakeup_handler_callback); +
  /*
   * Handler for POSTED_INTERRUPT_VECTOR.
   */
 @@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct
 pt_regs *regs)
 
   set_irq_regs(old_regs);
  }
 +
 +/*
 + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
 + */
 +__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs) {
 + struct pt_regs *old_regs = set_irq_regs(regs);
 +
 + ack_APIC_irq();
 +
 + irq_enter();
 +
 + exit_idle();
 +
 + inc_irq_stat(kvm_posted_intr_wakeup_ipis);
 +
 + if (wakeup_handler_callback)
 + wakeup_handler_callback();
 +
 + irq_exit();
 +
 + set_irq_regs(old_regs);
 +}
 +
  #endif
  
  __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) diff
 --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index
 70e181e..844673c 100644 --- a/arch/x86/kernel/irqinit.c +++
 b/arch/x86/kernel/irqinit.c @@ -144,6 +144,8 @@ static void __init
 apic_intr_init(void)  #ifdef CONFIG_HAVE_KVM
   /* IPI for KVM to deliver 

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-12-18:
 
 
 On 18/12/2014 04:14, Wu, Feng wrote:
 
 
 linux-kernel-ow...@vger.kernel.org wrote on 
 mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
 x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org;
 joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
 joro-zLv9SwRftAIdnm+Jiang
 Liu
 Cc: io...@lists.linux-foundation.org;
 linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
 Eric Auger
 Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
 set
 
 
 
 On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all interrupts are
 recognized as non-urgent interrupt, so we cannot send
 posted-interrupt when 'SN' is set.
 
 Can this happen?  If the vcpu is in guest mode, it cannot have been
 scheduled out, and that's the only case when SN is set.
 
 Paolo
 
 Currently, the only place where SN is set is vCPU is preempted and

If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if 
a PI is occurs when vCPU is preempted?

 waiting for the next scheduling in the runqueue. But I am not sure
 whether we need to set SN for other purpose in future. Adding SN
 checking here is just to follow the Spec. non-urgent interrupts are
 suppressed
 when SN is set.
 
 I would change that to a WARN_ON_ONCE then.


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 12/26] KVM: Initialize VT-d Posted-Interrupts Descriptor

2014-12-18 Thread Zhang, Yang Z
Feng Wu wrote on 2014-12-12:
 This patch initializes the VT-d Posted-Interrupts Descriptor.
 
 Signed-off-by: Feng Wu feng...@intel.com
 ---
  arch/x86/kvm/vmx.c | 27 +++
  1 file changed, 27 insertions(+)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 0b1383e..66ca275 100644 --- a/arch/x86/kvm/vmx.c +++
 b/arch/x86/kvm/vmx.c @@ -45,6 +45,7 @@
  #include asm/perf_event.h
  #include asm/debugreg.h
  #include asm/kexec.h
 +#include asm/irq_remapping.h
 
  #include trace.h
 @@ -4433,6 +4434,30 @@ static void ept_set_mmio_spte_mask(void)
   kvm_mmu_set_mmio_spte_mask((0x3ull  62) | 0x6ull);  }
 +static void pi_desc_init(struct vcpu_vmx *vmx) {
 + unsigned int dest;
 +
 + if (!irq_remapping_cap(IRQ_POSTING_CAP))
 + return;
 +
 + /*
 +  * Initialize Posted-Interrupt Descriptor
 +  */
 +
 + pi_clear_sn(vmx-pi_desc);
 + vmx-pi_desc.nv = POSTED_INTR_VECTOR;

Here.

 +
 + /* Physical mode for Notificaiton Event */
 + vmx-pi_desc.ndm = 0;

And from here..

 + dest = cpu_physical_id(vmx-vcpu.cpu);
 +
 + if (x2apic_enabled())
 + vmx-pi_desc.ndst = dest;
 + else
 + vmx-pi_desc.ndst = (dest  8)  0xFF00; }
 +

..to here are useless. The right place to update PI descriptor is where vcpu 
got loaded not in initialization.

  /*
   * Sets up the vmcs for emulated real mode.
   */
 @@ -4476,6 +4501,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
   vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
   vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx-pi_desc)));
 +
 + pi_desc_init(vmx);
   }
  
   if (ple_gap) {


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-12-19:
 
 
 On 18/12/2014 15:49, Zhang, Yang Z wrote:
 Here, we introduce a similar way with 'apic_arb_prio' to handle
 guest lowest priority interrtups when VT-d PI is used. Here is the
 ideas: - Each vCPU has a counter 'round_robin_counter'. - When
 guests sets an interrupts to lowest priority, we choose the vCPU
 with smallest 'round_robin_counter' as the destination, then
 increase it.
 
 How this can work well? All subsequent interrupts are delivered to
 one vCPU? It shouldn't be the best solution, need more consideration.
 
 Well, it's a hardware limitation.  The alternative (which is easy to

Agree, it is limited by hardware. But lowest priority distributes the interrupt 
more efficient than fixed mode. And current implementation more likes to switch 
the lowest priority mode to fixed mode. In case of interrupt intensive 
environment, this may be a bottleneck and VM may not benefit greatly from VT-d 
PI. But agree again, it is really a hardware limitation.

 implement) is to only do PI for single-CPU interrupts.  This should
 work well for multiqueue NICs (and of course for UP guests :)), so
 perhaps it's a good idea to only support that as a first attempt.

The more easy way is to deliver the interrupt to the first matched VCPU we 
find. The round_robin_counter really helps nothing here since the interrupt is 
delivered by hardware directly.

 
 Paolo
 
 Also, I think you should take the apic_arb_prio into consider since
 the priority is for the whole vCPU not for one interrupt.


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-12-19:
 
 
 Zhang, Yang Z wrote on 2014-12-18:
 jiang@linux.intel.com
 Cc: eric.au...@linaro.org; linux-ker...@vger.kernel.org;
 io...@lists.linux-foundation.org; kvm@vger.kernel.org; Wu, Feng
 Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for
 VT-d Posted-Interrupts
 
 Feng Wu wrote on 2014-12-12:
 We don't need to migrate the irqs for VT-d Posted-Interrupts here.
 When 'pst' is set in IRTE, the associated irq will be posted to
 guests instead of interrupt remapping. The destination of the
 interrupt is set in Posted-Interrupts Descriptor, and the
 migration happens during vCPU scheduling.
 
 However, we still update the cached irte here, which can be used
 when changing back to remapping mode.
 
 Signed-off-by: Feng Wu feng...@intel.com
 Reviewed-by: Jiang Liu jiang@linux.intel.com
 ---
  drivers/iommu/intel_irq_remapping.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-) diff --git
 a/drivers/iommu/intel_irq_remapping.c
 b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a
 100644
 --- a/drivers/iommu/intel_irq_remapping.c +++
 b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
 intel_ir_set_affinity(struct irq_data *data, const struct cpumask
 *mask,  {
 struct intel_ir_data *ir_data = data-chip_data;struct irte 
 *irte =
  ir_data-irte_entry; +struct irte_pi *irte_pi = (struct irte_pi
  *)irte;struct irq_cfg *cfg = irqd_cfg(data);   struct irq_data *parent
  = data-parent_data;   int ret;
 @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
 const struct cpumask *mask,
  */
 irte-vector = cfg-vector;
 irte-dest_id = IRTE_DEST(cfg-dest_apicid);
 -   modify_irte(ir_data-irq_2_iommu, irte);
 +
 +   /* We don't need to modify irte if the interrupt is for posting. */
 +   if (irte_pi-pst != 1)
 +   modify_irte(ir_data-irq_2_iommu, irte);
 
 What happens if user changes the IRQ affinity manually?
 
 If the IRQ is posted, its affinity is controlled by guest (irq ---
 vCPU  pCPU), it has no effect when host changes its affinity.

That's the problem: User is able to changes it in host but it never takes 
effect since it is actually controlled by guest. I guess it will break the IRQ 
balance too.

 
 Thanks,
 Feng
 
 
 
 /*
  * After this point, all the interrupts will start arriving
 
 
 Best regards,
 Yang



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-12-19:
 
 
 Paolo Bonzini wrote on 2014-12-19:
 jiang@linux.intel.com
 Cc: eric.au...@linaro.org; linux-ker...@vger.kernel.org;
 io...@lists.linux-foundation.org; kvm@vger.kernel.org
 Subject: Re: [v3 13/26] KVM: Define a new interface
 kvm_find_dest_vcpu() for VT-d PI
 
 
 
 On 18/12/2014 15:49, Zhang, Yang Z wrote:
 Here, we introduce a similar way with 'apic_arb_prio' to handle
 guest lowest priority interrtups when VT-d PI is used. Here is
 the
 ideas: - Each vCPU has a counter 'round_robin_counter'. - When
 guests sets an interrupts to lowest priority, we choose the vCPU
 with smallest 'round_robin_counter' as the destination, then
 increase it.
 
 How this can work well? All subsequent interrupts are delivered to
 one vCPU? It shouldn't be the best solution, need more consideration.
 
 Well, it's a hardware limitation.  The alternative (which is easy to
 implement) is to only do PI for single-CPU interrupts.  This should
 work well for multiqueue NICs (and of course for UP guests :)), so
 perhaps it's a good idea to only support that as a first attempt.
 
 Paolo
 
 Paolo, what do you mean by single-CPU interrupts? Do you mean we

It should be same idea as I mentioned on another thread: deliver the interrupt 
to a single CPU(maybe the first matched VCPU?)

 don't support lowest priority interrupts for PI? But Linux OS uses
 lowest priority for most of the case? If so, we can hardly get benefit
 from this feature for Linux guest OS.
 
 Thanks,
 Feng
 
 
 Also, I think you should take the apic_arb_prio into consider
 since the priority is for the whole vCPU not for one interrupt.


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-12-19:
 
 
 iommu-boun...@lists.linux-foundation.org wrote on 
 mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
 Cc: io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org;
 kvm@vger.kernel.org
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
 set
 
 Paolo Bonzini wrote on 2014-12-18:
 
 
 On 18/12/2014 04:14, Wu, Feng wrote:
 
 
 linux-kernel-ow...@vger.kernel.org wrote on
 mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
 x...@kernel.org; Gleb Natapov; Paolo Bonzini;
 dw...@infradead.org;
 joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
 joro-zLv9SwRftAIdnm+Jiang
 Liu
 Cc: io...@lists.linux-foundation.org;
 linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
 Eric Auger
 Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set
 
 
 
 On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all interrupts
 are recognized as non-urgent interrupt, so we cannot send
 posted-interrupt when 'SN' is set.
 
 Can this happen?  If the vcpu is in guest mode, it cannot have
 been scheduled out, and that's the only case when SN is set.
 
 Paolo
 
 Currently, the only place where SN is set is vCPU is preempted
 and
 
 If the vCPU is preempted, shouldn't the subsequent be ignored? What
 happens if a PI is occurs when vCPU is preempted?
 
 If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts
 are suppressed for posting.

I mean what happens if we don't set SN bit. From my point, if preempter already 
disabled the interrupt, it is ok to leave SN bit as zero. But if preempter 
enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, 
since there already has ON bit, so this means there only have one interrupt 
arrived at most and it doesn't hurt performance. Do we really need to set SN 
bit?

 
 Thanks,
 Feng
 
 
 waiting for the next scheduling in the runqueue. But I am not
 sure whether we need to set SN for other purpose in future.
 Adding SN checking here is just to follow the Spec. non-urgent
 interrupts are suppressed
 when SN is set.
 
 I would change that to a WARN_ON_ONCE then.
 
 
 Best regards,
 Yang
 
 
 ___
 iommu mailing list
 io...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-12-19:
 
 
 Zhang, Yang Z wrote on 2014-12-19:
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
 set
 
 Wu, Feng wrote on 2014-12-19:
 
 
 iommu-boun...@lists.linux-foundation.org wrote on
 mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
 Cc: io...@lists.linux-foundation.org;
 linux-ker...@vger.kernel.org; kvm@vger.kernel.org
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set
 
 Paolo Bonzini wrote on 2014-12-18:
 
 
 On 18/12/2014 04:14, Wu, Feng wrote:
 
 
 linux-kernel-ow...@vger.kernel.org wrote on
 mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
 x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org;
 joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
 joro-zLv9SwRftAIdnm+Jiang Liu Cc:
 io...@lists.linux-foundation.org;
 linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
 Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt
 when 'SN' is set
 
 
 
 On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all interrupts
 are recognized as non-urgent interrupt, so we cannot send
 posted-interrupt when 'SN' is set.
 
 Can this happen?  If the vcpu is in guest mode, it cannot have
 been scheduled out, and that's the only case when SN is set.
 
 Paolo
 
 Currently, the only place where SN is set is vCPU is preempted
 and
 
 If the vCPU is preempted, shouldn't the subsequent be ignored?
 What happens if a PI is occurs when vCPU is preempted?
 
 If a vCPU is preempted, the 'SN' bit is set, the subsequent
 interrupts are suppressed for posting.
 
 I mean what happens if we don't set SN bit. From my point, if
 preempter already disabled the interrupt, it is ok to leave SN bit
 as zero. But if preempter enabled the interrupt, doesn't this mean
 he allow interrupt to happen? BTW, since there already has ON bit,
 so this means there only have one interrupt arrived at most and it
 doesn't hurt performance. Do we really need to set SN bit?
 
 
 See this scenario:
 vCPU0 is running on pCPU0
 -- vCPU0 is preempted by vCPU1
 -- Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule
 -- in runqueue
 
 If the we don't set SN for vCPU0, then all subsequent interrupts for
 vCPU0 is posted to vCPU1, this will consume hardware and software

The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 
should be wakeup vector. Why vCPU1 will consume this PI event?

 efforts and in fact it is not needed at all. If SN is set for vCPU0,
 VT-d hardware will not issue Notification Event for vCPU0 when an
 interrupt is for it, but just setting the related PIR bit.
 
 Thanks,
 Feng
 
 
 
 Thanks,
 Feng
 
 
 waiting for the next scheduling in the runqueue. But I am not
 sure whether we need to set SN for other purpose in future.
 Adding SN checking here is just to follow the Spec. non-urgent
 interrupts are suppressed
 when SN is set.
 
 I would change that to a WARN_ON_ONCE then.
 
 
 Best regards,
 Yang
 
 
 ___
 iommu mailing list
 io...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu
 
 
 Best regards,
 Yang



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-12-19:
 
 
 Zhang, Yang Z wrote on 2014-12-19:
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
 set
 
 Wu, Feng wrote on 2014-12-19:
 
 
 Zhang, Yang Z wrote on 2014-12-19:
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set
 
 Wu, Feng wrote on 2014-12-19:
 
 
 iommu-boun...@lists.linux-foundation.org wrote on
 mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
 Cc: io...@lists.linux-foundation.org;
 linux-ker...@vger.kernel.org; kvm@vger.kernel.org
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set
 
 Paolo Bonzini wrote on 2014-12-18:
 
 
 On 18/12/2014 04:14, Wu, Feng wrote:
 
 
 linux-kernel-ow...@vger.kernel.org wrote on
 mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
 x...@kernel.org; Gleb Natapov; Paolo Bonzini;
 dw...@infradead.org;
 joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
 joro-zLv9SwRftAIdnm+Jiang Liu Cc:
 io...@lists.linux-foundation.org;
 linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
 Eric Auger Subject: Re: [v3 25/26] KVM: Suppress
 posted-interrupt when 'SN' is set
 
 
 
 On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all
 interrupts are recognized as non-urgent interrupt, so we
 cannot send posted-interrupt when 'SN' is set.
 
 Can this happen?  If the vcpu is in guest mode, it cannot
 have been scheduled out, and that's the only case when SN is set.
 
 Paolo
 
 Currently, the only place where SN is set is vCPU is
 preempted and
 
 If the vCPU is preempted, shouldn't the subsequent be ignored?
 What happens if a PI is occurs when vCPU is preempted?
 
 If a vCPU is preempted, the 'SN' bit is set, the subsequent
 interrupts are suppressed for posting.
 
 I mean what happens if we don't set SN bit. From my point, if
 preempter already disabled the interrupt, it is ok to leave SN
 bit as zero. But if preempter enabled the interrupt, doesn't this
 mean he allow interrupt to happen? BTW, since there already has
 ON bit, so this means there only have one interrupt arrived at
 most and it doesn't hurt performance. Do we really need to set SN bit?
 
 
 See this scenario:
 vCPU0 is running on pCPU0
 -- vCPU0 is preempted by vCPU1
 -- Then vCPU1 is running on pCPU0 and vCPU0 is waiting for
 -- schedule in runqueue
 
 If the we don't set SN for vCPU0, then all subsequent interrupts
 for
 vCPU0 is posted to vCPU1, this will consume hardware and software
 
 The PI vector for vCPU1 is notification vector, but the PI vector
 for
 vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event?
 
 Wakeup vector is only used for blocking case, when vCPU is preempted
 and waiting in the runqueue, the NV is the notification vector.

I see your point. But from performance point, if we can schedule the vCPU to 
another PCPU to handle the interrupt, it would helpful. But I remember current 
KVM will not schedule the vCPU in run queue (even though it got preempted) to 
another pCPU to run(Am I right?). So it may hard to do it.

 
 Thanks,
 Feng
 
 
 efforts and in fact it is not needed at all. If SN is set for
 vCPU0, VT-d hardware will not issue Notification Event for vCPU0
 when an interrupt is for it, but just setting the related PIR bit.
 
 Thanks,
 Feng
 
 
 
 Thanks,
 Feng
 
 
 waiting for the next scheduling in the runqueue. But I am not
 sure whether we need to set SN for other purpose in future.
 Adding SN checking here is just to follow the Spec.
 non-urgent interrupts are suppressed
 when SN is set.
 
 I would change that to a WARN_ON_ONCE then.
 
 
 Best regards,
 Yang
 
 
 ___
 iommu mailing list
 io...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu
 
 
 Best regards,
 Yang
 
 
 
 Best regards,
 Yang



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-12-19:
 
 
 Zhang, Yang Z wrote on 2014-12-19:
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
 set
 
 Wu, Feng wrote on 2014-12-19:
 
 
 Zhang, Yang Z wrote on 2014-12-19:
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set
 
 Wu, Feng wrote on 2014-12-19:
 
 
 Zhang, Yang Z wrote on 2014-12-19:
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set
 
 Wu, Feng wrote on 2014-12-19:
 
 
 iommu-boun...@lists.linux-foundation.org wrote on
 mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
 Cc: io...@lists.linux-foundation.org;
 linux-ker...@vger.kernel.org; kvm@vger.kernel.org
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set
 
 Paolo Bonzini wrote on 2014-12-18:
 
 
 On 18/12/2014 04:14, Wu, Feng wrote:
 
 
 linux-kernel-ow...@vger.kernel.org wrote on
 mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
 x...@kernel.org; Gleb Natapov; Paolo Bonzini;
 dw...@infradead.org;
 joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
 joro-zLv9SwRftAIdnm+Jiang Liu Cc:
 io...@lists.linux-foundation.org;
 linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org;
 KVM
 list;
 Eric Auger Subject: Re: [v3 25/26] KVM: Suppress
 posted-interrupt when 'SN' is set
 
 
 
 On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all
 interrupts are recognized as non-urgent interrupt, so we
 cannot send posted-interrupt when 'SN' is set.
 
 Can this happen?  If the vcpu is in guest mode, it cannot
 have been scheduled out, and that's the only case when SN is set.
 
 Paolo
 
 Currently, the only place where SN is set is vCPU is
 preempted and
 
 If the vCPU is preempted, shouldn't the subsequent be ignored?
 What happens if a PI is occurs when vCPU is preempted?
 
 If a vCPU is preempted, the 'SN' bit is set, the subsequent
 interrupts are suppressed for posting.
 
 I mean what happens if we don't set SN bit. From my point, if
 preempter already disabled the interrupt, it is ok to leave SN
 bit as zero. But if preempter enabled the interrupt, doesn't
 this mean he allow interrupt to happen? BTW, since there
 already has ON bit, so this means there only have one interrupt
 arrived at most and it doesn't hurt performance. Do we really need to 
 set SN bit?
 
 
 See this scenario:
 vCPU0 is running on pCPU0
 -- vCPU0 is preempted by vCPU1
 -- Then vCPU1 is running on pCPU0 and vCPU0 is waiting for
 -- schedule in runqueue
 
 If the we don't set SN for vCPU0, then all subsequent interrupts
 for
 vCPU0 is posted to vCPU1, this will consume hardware and
 software
 
 The PI vector for vCPU1 is notification vector, but the PI vector
 for
 vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event?
 
 Wakeup vector is only used for blocking case, when vCPU is
 preempted and waiting in the runqueue, the NV is the notification vector.
 
 I see your point. But from performance point, if we can schedule the
 vCPU to another PCPU to handle the interrupt, it would helpful. But I
 remember current KVM will not schedule the vCPU in run queue (even
 though it got preempted) to another pCPU to run(Am I right?). So it may
 hard to do it.
 
 
 KVM is using the Linux scheduler, when the preempted vCPU (in
 runqueue) is scheduled again depends on the scheduling algorithm
 itself, I think it is a little hard for us to get involved.
 
 I think what you mentioned is a little like the urgent interrupt in VT-d PI 
 Spec.
 For this kind of interrupts, if an interrupt is coming for an
 preempted vCPU (waiting in the run queue), we need to schedule the
 vCPU immediately. This is some real time things. And we don't support urgent 
 interrupt so far.

Yes. IIRC, if we use two global vectors mechanism properly, there should no 
need to use hardware urgent interrupt mechanism. :)

 
 Thanks,
 Feng
 
 
 Thanks,
 Feng
 
 
 efforts and in fact it is not needed at all. If SN is set for
 vCPU0, VT-d hardware will not issue Notification Event for vCPU0
 when an interrupt is for it, but just setting the related PIR bit.
 
 Thanks,
 Feng
 
 
 
 Thanks,
 Feng
 
 
 waiting for the next scheduling in the runqueue. But I am
 not sure whether we need to set SN for other purpose in future.
 Adding SN checking here is just to follow the Spec.
 non-urgent interrupts are suppressed
 when SN is set.
 
 I would change that to a WARN_ON_ONCE then.
 
 
 Best regards,
 Yang
 
 
 ___
 iommu mailing list
 io...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu
 
 
 Best regards,
 Yang
 
 
 
 Best regards,
 Yang
 
 
 
 Best regards,
 Yang



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: x86: reset RVI upon system reset

2014-12-11 Thread Zhang, Yang Z
Zhang Haoyu wrote on 2014-12-11:
 Then?

It's already in upstream KVM

commit 4114c27d450bef228be9c7b0c40a888e18a3a636
Author: Wei Wang wei.w.w...@intel.com
Date:   Wed Nov 5 10:53:43 2014 +0800

KVM: x86: reset RVI upon system reset

A bug was reported as follows: when running Windows 7 32-bit guests on 
qemu-kvm,
sometimes the guests run into blue screen during reboot. The problem was 
that a
guest's RVI was not cleared when it rebooted. This patch has fixed the 
problem.

Signed-off-by: Wei Wang wei.w.w...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
Tested-by: Rongrong Liu rongrongx@intel.com, Da Chun ng...@qq.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com


 
 On 05/11/2014 10:02, Chen, Tiejun wrote:
 I think both are ok.
 If we zero max_irr in vmx_set_rvi(), we still need this check:
 if ((is_guest_mode(vcpu)  nested_exit_on_intr(vcpu)) || max_irr
 ==
 -1)
 
 No, I don't think we need to add this.
 
 You don't, because the code will look like:
 
 if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu))
 return; if (!is_guest_mode(vcpu)) {
 vmx_set_rvi(max_irr); return;
 }
 
 if (max_irr == -1)
 return;
 and thus vmx_set_rvi() is never reached if is_guest_mode(vcpu) 
 !nested_exit_on_intr(vcpu).
 
 I don't think the above code is perfect. Since hwapic_irr_update() is
 a hot point,
 it's better to move the first check after the second check. In this
 case, Wei's patch looks more reasonable.
 
 
 I applied the lapic.c part of Wei's patch, and the vmx.c part of Tiejun's 
 patch.
 
 Paolo
 
 
 Best regards,


Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-12 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-11-13:
 
 
 kvm-ow...@vger.kernel.org wrote on 2014-11-12:
 kvm@vger.kernel.org; io...@lists.linux-foundation.org; 
 linux-ker...@vger.kernel.org
 Subject: Re: [PATCH 05/13] KVM: Update IRTE according to guest 
 interrupt configuration changes
 
 
 
 On 12/11/2014 10:19, Wu, Feng wrote:
 You can certainly backport these patches to distros that do not 
 have VFIO.  But upstream we should work on VFIO first.  VFIO has 
 feature parity with legacy device assignment, and adding a new 
 feature that is not in VFIO would be a bad idea.
 
 By the way, do you have benchmark results for it?  We have not been 
 able to see any performance improvement for APICv on e.g. netperf.
 
 Do you mean benchmark results for APICv itself or VT-d Posted-Interrtups?
 
 Especially for VT-d posted interrupts---but it'd be great to know 
 which workloads see the biggest speedup from APICv.
 
 We have some draft performance data internally, please see the 
 attached. For VT-d PI, I think we can get the biggest performance gain 
 if the VCPU is running in non-root mode for most of the time (not in 
 HLT state), since external interrupt from assigned devices will be delivered 
 by guest directly in this case.
 That means we can run some cpu intensive workload in the guests.

Have you check that the CPU side posted interrupt is taking effect in w/o VT-D 
PI case? Per my understanding, the performance gap should be so large if you 
use CPU side posted interrupt. This data more like the VT-d PI vs non PI(both 
VT-d and CPU).

 
 Thanks,
 Feng
 
 
 Paolo
 --
 To unsubscribe from this list: send the line unsubscribe kvm in the 
 body of a message to majord...@vger.kernel.org More majordomo info at 
 http://vger.kernel.org/majordomo-info.html


Best regards,
Yang




RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-12 Thread Zhang, Yang Z
Wu, Feng wrote on 2014-11-13:
 
 
 Zhang, Yang Z wrote on 2014-11-13:
 kvm@vger.kernel.org; io...@lists.linux-foundation.org;
 linux-ker...@vger.kernel.org
 Subject: RE: [PATCH 05/13] KVM: Update IRTE according to guest
 interrupt configuration changes
 
 Wu, Feng wrote on 2014-11-13:
 
 
 kvm-ow...@vger.kernel.org wrote on 2014-11-12:
 kvm@vger.kernel.org; io...@lists.linux-foundation.org;
 linux-ker...@vger.kernel.org
 Subject: Re: [PATCH 05/13] KVM: Update IRTE according to guest
 interrupt configuration changes
 
 
 
 On 12/11/2014 10:19, Wu, Feng wrote:
 You can certainly backport these patches to distros that do not
 have VFIO.  But upstream we should work on VFIO first.  VFIO
 has feature parity with legacy device assignment, and adding a
 new feature that is not in VFIO would be a bad idea.
 
 By the way, do you have benchmark results for it?  We have not
 been able to see any performance improvement for APICv on e.g.
 netperf.
 
 Do you mean benchmark results for APICv itself or VT-d
 Posted-Interrtups?
 
 Especially for VT-d posted interrupts---but it'd be great to know
 which workloads see the biggest speedup from APICv.
 
 We have some draft performance data internally, please see the
 attached. For VT-d PI, I think we can get the biggest performance gain
 if the VCPU is running in non-root mode for most of the time (not in
 HLT state), since external interrupt from assigned devices will be
 delivered by guest directly in this case. That means we can run some
 cpu intensive workload in the guests.
 
 Have you check that the CPU side posted interrupt is taking effect
 in w/o VT-D PI case? Per my understanding, the performance gap
 should be so large if you use CPU side posted interrupt. This data
 more like the VT-d PI vs non PI(both VT-d and CPU).
 
 Yes, this data is VT-d PI vs Non VT-d PI. The CPU side APICv mechanism
 (including CPU side Posted-Interrtups) is enabled.

From the CPU utilization data, it seems the environment of APICv is not 
reasonable to me. with current APICv, the interrupt should not deliver to the 
PCPU where vcpu is running. Otherwise, it will force the vcpu vmexit and the 
CPU side posted interrupt cannot take effect. Do you set the interrupt affinity 
manually?

 
 Thanks,
 Feng
 
 
 
 Thanks,
 Feng
 
 
 Paolo
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org More majordomo
 info at http://vger.kernel.org/majordomo-info.html
 
 
 Best regards,
 Yang



Best regards,
Yang




RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes

2014-11-11 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-11-11:
 
 
 On 11/11/2014 10:20, Wu, Feng wrote:
 Since legacy KVM device assignment is effectively deprecated, have
 you considered how we might do this with VFIO?  Thanks,
 
 I haven't thought about how to enable this in VFIO so far. I think I
 can continue to implement that if needed after this patch set is finished.
 What do you think of this?
 
 Hi Feng,
 
 we are not applying new features to legacy KVM device assignment,
 since it is unsafe (it does not honor ACS).

Personally, I think this feature will be helpful to the legacy device 
assignment. Agree, vfio is the right solution for future feature enabling. But 
the old kvm without the good vfio supporting is still used largely today. The 
user really looking for this feature but they will not upgrade their kernel. 
It's easy for us to backport this feature to old kvm with the legacy device 
assignment, but it is impossible to backport the whole vfio. So I think you 
guys can take a consider to add this feature to both vfio and legacy device 
assignment.

 
 I and Alex can help you with designing a way to interface VFIO with
 KVM posted interrupts.  Give us a few days to study these patches
 more, or feel free to request comments if you have ideas about it yourself.
 
 Paolo


Best regards,
Yang



RE: [PATCH] KVM: x86: reset RVI upon system reset

2014-11-05 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-11-05:
 
 
 On 05/11/2014 10:02, Chen, Tiejun wrote:
 I think both are ok.
 If we zero max_irr in vmx_set_rvi(), we still need this check:
 if ((is_guest_mode(vcpu)  nested_exit_on_intr(vcpu)) || max_irr
 ==
 -1)
 
 No, I don't think we need to add this.
 
 You don't, because the code will look like:
 
 if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu))
 return;
 if (!is_guest_mode(vcpu)) {
 vmx_set_rvi(max_irr);
 return;
 }
 
 if (max_irr == -1)
 return;
 and thus vmx_set_rvi() is never reached if is_guest_mode(vcpu) 
 !nested_exit_on_intr(vcpu).

I don't think the above code is perfect. Since hwapic_irr_update() is a hot 
point, it's better to move the first check after the second check. In this 
case, Wei's patch looks more reasonable.

 
 I applied the lapic.c part of Wei's patch, and the vmx.c part of Tiejun's 
 patch.
 
 Paolo


Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: nVMX: APICv seems to cause breakages

2014-10-20 Thread Zhang, Yang Z
Jan Kiszka wrote on 2014-10-20:
 Hi all,
 
 we just started a test with Jailhouse in a VM on a decently recent Intel
 box. It has APICv / posted interrupts support. And it breaks Jailhouse
 activation (L1 breakage, host seems to be fine). Loading kvm-intel with
 enable_apicv=0 resolves the issue. Henning started to debug from
 Jailhouse POV and found out it may be NMI IPI related. We will try to
 dig deeper, but any support is welcome regarding the APICv specialties.

It is interesting. I will spend some time to look this issue too.

 
 Could this also explain some of the incompatibilities you found, Yang?

Maybe, I will do a double check.

 Maybe you want to check if more hypervisor work when APICv is off.
 
 Jan



Best regards,
Yang



RE: [PATCH v2] KVM: x86: keep eoi exit bitmap accurate before loading it.

2014-08-28 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-08-27:
 Il 27/08/2014 16:05, Wei Wang ha scritto:
  Guest may mask the IOAPIC entry before issue EOI. In such case, EOI
  will not be intercepted by the hypervisor, since the corresponding bit
  in eoi_exit_bitmap is not set after the masking of IOAPIC entry.
 
  The solution here is to OR eoi_exit_bitmap with tmr to make sure that
  all level-triggered interrupts have their bits in eoi_exit_bitmap set.
 
 This commit message does not explain why this change is necessary, and the
 relationship between this patch and the previous one.
 
 For example:
 
 --
 Commit 0f6c0a740b7d (KVM: x86: always exit on EOIs for interrupts listed in
 the IOAPIC redir table, 2014-07-30) fixed an APICv bug where an incorrect EOI
 exit bitmap triggered an interrupt storm inside the guest.
 
 There is a corner case for which that patch would have disabled accelerated
 EOI unnecessarily.  Suppose you have:
 
 - a device that was the sole user of an INTx interrupt and is hot-unplugged
 
 - an OS that masks the INTx interrupt entry in the IOAPIC after the unplug
 
 - another device that uses MSI and is subsequently hot-plugged
 
 If the OS chooses to reuse the same LAPIC interrupt vector for the two
 interrupts, the patch would have left the vector in the EOI exit bitmap, 
 because
 KVM takes into account the stale entry in the IOAPIC redirection table.
 
 We do know exactly which masked interrupts are still in-service and thus
 require broadcasting an EOI to the IOAPIC: this information is in the TMR.  
 So,
 this patch ORs the EOI exit bitmap provided by the ioapic with the TMR 
 register.
 Thanks to the previous patch, an active level-triggered interrupt will always 
 be
 included in the EOI exit bitmap.
 --
 
 However, see below.
 
  Tested-by: Rongrong Liu rongrongx@intel.com
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  Signed-off-by: Wei Wang wei.w.w...@intel.com
  ---
   arch/x86/kvm/lapic.c |   12 
   arch/x86/kvm/lapic.h |1 +
   arch/x86/kvm/x86.c   |1 +
   virt/kvm/ioapic.c|7 ---
   4 files changed, 18 insertions(+), 3 deletions(-)
 
  diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index
  8c1162d..0fcac3c 100644
  --- a/arch/x86/kvm/lapic.c
  +++ b/arch/x86/kvm/lapic.c
  @@ -539,6 +539,18 @@ void kvm_apic_update_tmr(struct kvm_vcpu *vcpu,
 u32 *tmr)
  }
   }
 
  +void kvm_apic_update_eoi_exitmap(struct kvm_vcpu *vcpu, u64
  +*eoi_exit_bitmap) {
  +   struct kvm_lapic *apic = vcpu-arch.apic;
  +   u32 i;
  +   u32 tmr;
  +
  +   for (i = 0; i  8; i++) {
  +   tmr = kvm_apic_get_reg(apic, APIC_TMR + 0x10 * i);
  +   *((u32 *)eoi_exit_bitmap + i) |= tmr;
  +   }
  +}
  +
   static void apic_update_ppr(struct kvm_lapic *apic)  {
  u32 tpr, isrv, ppr, old_ppr;
  diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index
  6a11845..d2b96f2 100644
  --- a/arch/x86/kvm/lapic.h
  +++ b/arch/x86/kvm/lapic.h
  @@ -55,6 +55,7 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 
   void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr);  void
  kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
  +void kvm_apic_update_eoi_exitmap(struct kvm_vcpu *vcpu, u64
  +*eoi_exit_bitmap);
   int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
  int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);  int
  kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index
  d401684..d23b558 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -5992,6 +5992,7 @@ static void vcpu_scan_ioapic(struct kvm_vcpu
  *vcpu)
 
  kvm_ioapic_scan_entry(vcpu, eoi_exit_bitmap, tmr);
  kvm_apic_update_tmr(vcpu, tmr);
  +   kvm_apic_update_eoi_exitmap(vcpu, eoi_exit_bitmap);
  kvm_x86_ops-load_eoi_exitmap(vcpu, eoi_exit_bitmap);  }
 
  diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index
  e8ce34c..ed15936 100644
  --- a/virt/kvm/ioapic.c
  +++ b/virt/kvm/ioapic.c
  @@ -254,9 +254,10 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
 u64 *eoi_exit_bitmap,
  spin_lock(ioapic-lock);
  for (index = 0; index  IOAPIC_NUM_PINS; index++) {
  e = ioapic-redirtbl[index];
  -   if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG ||
  -   kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index)
 ||
  -   index == RTC_GSI) {
  +   if ((!e-fields.mask
  +e-fields.trig_mode == IOAPIC_LEVEL_TRIG)
  +   || kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC,
  +   index) || index == RTC_GSI) {
  if (kvm_apic_match_dest(vcpu, NULL, 0,
  e-fields.dest_id, e-fields.dest_mode)) {
  __set_bit(e-fields.vector,
 
 
 There's still something missing here.
 
 Suppose you have the following:
 
 Program edge-triggered MSI for vector 123
 Interrupt comes 

RE: [PATCH v2] KVM: x86: check ISR and TMR to construct eoi exit bitmap

2014-08-17 Thread Zhang, Yang Z
Chen, Tiejun wrote on 2014-08-15:
 On 2014/8/14 3:16, Wei Wang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Guest may mask the IOAPIC entry before issue EOI. In such case,
 EOI will not be intercepted by hypervisor due to the corrensponding
 
 s/corrensponding/corresponding
 
 bit in eoi exit bitmap is not setting.
 
 The solution is to check ISR + TMR to construct the EOI exit bitmap.
 
 This patch is a better fixing for the issue that commit 0f6c0a740b
 tries to solve.
 

Hi Paolo and Alex,

Any other comments with this version? 

 Tested-by: Alex Williamson alex.william...@redhat.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Wei Wang wei.w.w...@intel.com
 ---
   arch/x86/kvm/lapic.c |   17 +
   arch/x86/kvm/lapic.h |2 ++
   arch/x86/kvm/x86.c   |9 +
   virt/kvm/ioapic.c|7 ---
   4 files changed, 32 insertions(+), 3 deletions(-)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 08e8a89..0ed4bcb 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -515,6 +515,23 @@ static void pv_eoi_clr_pending(struct kvm_vcpu
 *vcpu)
  __clear_bit(KVM_APIC_PV_EOI_PENDING, vcpu-arch.apic_attention); }
 +void kvm_apic_zap_eoi_exitmap(struct kvm_vcpu *vcpu, u64
 *eoi_exit_bitmap, +  u32 *tmr) +{ +  u32 i, reg_off, intr_in_service;
 +struct kvm_lapic *apic = vcpu-arch.apic; + +   for (i = 0; i  8; i++)
 { +  reg_off = 0x10 * i; +   intr_in_service = 
 apic_read_reg(apic,
 APIC_ISR + reg_off)  +  kvm_apic_get_reg(apic, APIC_TMR 
 + reg_off);
 +if (intr_in_service) { +*((u32 
 *)eoi_exit_bitmap + i) |=
 intr_in_service; +   tmr[i] |= intr_in_service; +
 } + } +} +
   void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr)
   {
  struct kvm_lapic *apic = vcpu-arch.apic;
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 6a11845..4ee3d70 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -53,6 +53,8 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64
 value);
   u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
   void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 +void kvm_apic_zap_eoi_exitmap(struct kvm_vcpu *vcpu, u64
 *eoi_exit_bitmap, +  u32 *tmr);
   void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr);
   void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
   int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 204422d..755b556 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6005,6 +6005,15 @@ static void vcpu_scan_ioapic(struct kvm_vcpu
 *vcpu)
  memset(tmr, 0, 32);
   
  kvm_ioapic_scan_entry(vcpu, eoi_exit_bitmap, tmr);
 +/*
 + * Guest may mask the IOAPIC entry before issue EOI. In such case,
 + * EOI will not be intercepted by hypervisor due to the corrensponding
 
 s/corrensponding/corresponding
 
 + * bit in eoi exit bitmap is not setting.
 + *
 + * The solution is to check ISR + TMR to construct the EOI exit bitmap.
 + */
 +kvm_apic_zap_eoi_exitmap(vcpu, eoi_exit_bitmap, tmr);
 +
  kvm_x86_ops-load_eoi_exitmap(vcpu, eoi_exit_bitmap);
  kvm_apic_update_tmr(vcpu, tmr);
   }
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index e8ce34c..2458a1d 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -254,9 +254,10 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
 u64 *eoi_exit_bitmap,
  spin_lock(ioapic-lock);
  for (index = 0; index  IOAPIC_NUM_PINS; index++) {
  e = ioapic-redirtbl[index];
 -if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG || -   

 kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index) || -

 index == RTC_GSI) { +if (!e-fields.mask  +
 (e-fields.trig_mode
 == IOAPIC_LEVEL_TRIG || + 
 kvm_irq_has_notifier(ioapic-kvm,
 KVM_IRQCHIP_IOAPIC, + index) || index == 
 RTC_GSI)) {
  if (kvm_apic_match_dest(vcpu, NULL, 0,
  e-fields.dest_id, e-fields.dest_mode)) {
  __set_bit(e-fields.vector,


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table

2014-08-07 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-08-07:
 Il 07/08/2014 03:31, Zhang, Yang Z ha scritto:
 Let me give an example to see whether my concern is a real problem:
 Guest allocates a vector and set it in IOAPIC entry to deliver
 interrupt. Later it masks the IOAPIC entry(means stop the
 corresponding device) and assign this vector to a MSI device. With
 this patch, even the vector is not used by IOAPIC, but it still set
 eoi exit bitmap unconditionally. The subsequent EOIs to MSI device
 will force vmexit. Could this happen?
 
 Yes, I guess it could.  I'm not sure whether it could on Linux or Windows.
 
 I think the right fixing is to check the ISR plus TMR to construct
 the eoi exit bitmap.
 
 Do you care enough to propose a patch? :)
 

Sure. I will do it.

 Paolo


Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3] KVM: nVMX: nested TPR shadow/threshold emulation

2014-08-06 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-08-05:
 Il 05/08/2014 09:56, Zhang, Yang Z ha scritto:
 Wanpeng Li wrote on 2014-08-04:
 This patch fix bug
 https://bugzilla.kernel.org/show_bug.cgi?id=61411
 
 TPR shadow/threshold feature is important to speed up the Windows guest.
 Besides, it is a must feature for certain VMM.
 
 We map virtual APIC page address and TPR threshold from L1 VMCS. If
 TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest and L1
 interested in, we inject it into L1 VMM for handling.
 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 ---
 v2 - v3:
  * nested vm entry failure if both tpr shadow and cr8 exiting bits
 are not set
 v1 - v2:
  * don't take L0's virtualize APIC accesses setting into account *
  virtual_apic_page do exactly the same thing that is done for
  apic_access_page * add the tpr threshold field to the read-write
  fields for shadow
 VMCS
 
  arch/x86/kvm/vmx.c | 38 --
  1 file changed, 36 insertions(+), 2 deletions(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 c604f3c..7a56e2c 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -379,6 +379,7 @@ struct nested_vmx {
  * we must keep them pinned while L2 runs.   */ struct page
  *apic_access_page; +   struct page *virtual_apic_page; u64
  msr_ia32_feature_control;
  
 struct hrtimer preemption_timer; @@ -533,6 +534,7 @@ static int
  max_shadow_read_only_fields =  ARRAY_SIZE(shadow_read_only_fields);
  
  static unsigned long shadow_read_write_fields[] = { +  TPR_THRESHOLD,
 GUEST_RIP,  GUEST_RSP,  GUEST_CR0,
 @@ -2330,7 +2332,7 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
 CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
 CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
 CPU_BASED_RDPMC_EXITING | CPU_BASED_RDTSC_EXITING |
 -   CPU_BASED_PAUSE_EXITING |
 +   CPU_BASED_PAUSE_EXITING | CPU_BASED_TPR_SHADOW |
 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;  /*   * We can allow 
 some
  features even when not supported by the @@ -6148,6 +6150,10 @@ static
  void free_nested(struct vcpu_vmx *vmx)
 nested_release_page(vmx-nested.apic_access_page);
 vmx-nested.apic_access_page = 0;   }
 +   if (vmx-nested.virtual_apic_page) {
 +   nested_release_page(vmx-nested.virtual_apic_page);
 +   vmx-nested.virtual_apic_page = 0;
 +   }
 
 nested_free_all_saved_vmcss(vmx);  } @@ -6936,7 +6942,7 @@ static
  bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)case
  EXIT_REASON_MCE_DURING_VMENTRY:return 0;   case
  EXIT_REASON_TPR_BELOW_THRESHOLD:
 -   return 1;
 +   return nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW);
 case EXIT_REASON_APIC_ACCESS:
 return nested_cpu_has2(vmcs12,
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 @@ -7057,6 +7063,9 @@ static int vmx_handle_exit(struct kvm_vcpu
 *vcpu)
 
  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr,
 int
 irr)  {
 +   if (is_guest_mode(vcpu))
 +   return;
 +
 if (irr == -1 || tpr  irr) {
 vmcs_write32(TPR_THRESHOLD, 0);
 return;
 @@ -8024,6 +8033,27 @@ static void prepare_vmcs02(struct kvm_vcpu
 *vcpu, struct vmcs12 *vmcs12)
 exec_control = ~CPU_BASED_VIRTUAL_NMI_PENDING;
 exec_control = ~CPU_BASED_TPR_SHADOW;
 exec_control |= vmcs12-cpu_based_vm_exec_control;
 +
 +   if (exec_control  CPU_BASED_TPR_SHADOW) {
 +   if (vmx-nested.virtual_apic_page)
 +   nested_release_page(vmx-nested.virtual_apic_page);
 +   vmx-nested.virtual_apic_page =
 +  nested_get_page(vcpu, vmcs12-virtual_apic_page_addr);
 +   if (!vmx-nested.virtual_apic_page)
 +   exec_control =
 +   ~CPU_BASED_TPR_SHADOW;
 +   else
 +   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
 +   page_to_phys(vmx-nested.virtual_apic_page));
 +
 +   if (!(exec_control  CPU_BASED_TPR_SHADOW) 
 +   !((exec_control  CPU_BASED_CR8_LOAD_EXITING) 
 +   (exec_control  CPU_BASED_CR8_STORE_EXITING)))
 +   nested_vmx_failValid(vcpu,
 VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 
 I think this is not correct. The vmx-nested.virtual_apic_page may not
 valid due to two reasons: 1. The virtual_apic_page_addr is not a valid
 gfn. In this case, the vmx failure
 must be injected to L1 unconditionally regardless of the setting of
 CR8 load/store exiting.
 
 You're right that accesses to the APIC-access page may also end up
 writing to the virtual-APIC page.  Hence, if the virtualize APIC
 accesses setting is 1 in the secondary exec controls, you also have to fail 
 the vmentry.
 
 Doing it unconditionally is not correct, but it is the simplest thing

Why not correct?

 to do and it would be okay

RE: [PATCH] KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table

2014-08-06 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-07-31:
 Currently, the EOI exit bitmap (used for APICv) does not include
 interrupts that are masked.  However, this can cause a bug that manifests
 as an interrupt storm inside the guest.  Alex Williamson reported the
 bug and is the one who really debugged this; I only wrote the patch. :)
 
 The scenario involves a multi-function PCI device with OHCI and EHCI
 USB functions and an audio function, all assigned to the guest, where
 both USB functions use legacy INTx interrupts.
 
 As soon as the guest boots, interrupts for these devices turn into an
 interrupt storm in the guest; the host does not see the interrupt storm.
 Basically the EOI path does not work, and the guest continues to see the
 interrupt over and over, even after it attempts to mask it at the APIC.
 The bug is only visible with older kernels (RHEL6.5, based on 2.6.32
 with not many changes in the area of APIC/IOAPIC handling).
 
 Alex then tried forcing bit 59 (corresponding to the USB functions' IRQ)
 on in the eoi_exit_bitmap and TMR, and things then work.  What happens
 is that VFIO asserts IRQ11, then KVM recomputes the EOI exit bitmap.
 It does not have set bit 59 because the RTE was masked, so the IOAPIC
 never sees the EOI and the interrupt continues to fire in the guest.
 
 Probably, the guest is masking the interrupt in the redirection table in
 the interrupt routine, i.e. while the interrupt is set in a LAPIC's ISR.
 The simplest fix is to ignore the masking state, we would rather have
 an unnecessary exit rather than a missed IRQ ACK and anyway IOAPIC
 interrupts are not as performance-sensitive as for example MSIs.

I feel this fixing may hurt performance in some cases. If the mask bit is set, 
this means the vector in this entry may be used by other devices(like a 
assigned device). But here you set it in eoi exit bitmap and this will cause 
vmexit on each EOI which should not happen.

 
 [Thanks to Alex for his precise description of the problem
  and initial debugging effort.  A lot of the text above is
  based on emails exchanged with him.]
 
 Reported-by: Alex Williamson alex.william...@redhat.com
 Cc: sta...@vger.kernel.org
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  virt/kvm/ioapic.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)
 
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 2458a1dc2ba9..e8ce34c9db32 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -254,10 +254,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
 u64 *eoi_exit_bitmap,
   spin_lock(ioapic-lock);
   for (index = 0; index  IOAPIC_NUM_PINS; index++) {
   e = ioapic-redirtbl[index];
 - if (!e-fields.mask 
 - (e-fields.trig_mode == IOAPIC_LEVEL_TRIG ||
 -  kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC,
 -  index) || index == RTC_GSI)) {
 + if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG ||
 + kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index)
 ||
 + index == RTC_GSI) {
   if (kvm_apic_match_dest(vcpu, NULL, 0,
   e-fields.dest_id, e-fields.dest_mode)) {
   __set_bit(e-fields.vector,

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table

2014-08-06 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-08-06:
 Il 06/08/2014 16:03, Zhang, Yang Z ha scritto:
 Paolo Bonzini wrote on 2014-07-31:
 Probably, the guest is masking the interrupt in the redirection
 table in the interrupt routine, i.e. while the interrupt is set in a 
 LAPIC's ISR.
 The simplest fix is to ignore the masking state, we would rather
 have an unnecessary exit rather than a missed IRQ ACK and anyway
 IOAPIC interrupts are not as performance-sensitive as for example MSIs.
 
 I feel this fixing may hurt performance in some cases. If the mask
 bit is set, this means the vector in this entry may be used by other
 devices(like a assigned device). But here you set it in eoi exit
 bitmap and this will cause vmexit on each EOI which should not happen.
 
 Note that this *was* reported on an assigned device.
 
 IOAPIC should not be a performance-sensitive path.  High-performance
 assigned devices should be using MSIs.

Let me give an example to see whether my concern is a real problem:
Guest allocates a vector and set it in IOAPIC entry to deliver interrupt. Later 
it masks the IOAPIC entry(means stop the corresponding device) and assign this 
vector to a MSI device. With this patch, even the vector is not used by IOAPIC, 
but it still set eoi exit bitmap unconditionally. The subsequent EOIs to MSI 
device will force vmexit. Could this happen?

I think the right fixing is to check the ISR plus TMR to construct the eoi exit 
bitmap.

 
 Paolo


Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3] KVM: nVMX: nested TPR shadow/threshold emulation

2014-08-05 Thread Zhang, Yang Z
Wanpeng Li wrote on 2014-08-04:
 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=61411
 
 TPR shadow/threshold feature is important to speed up the Windows guest.
 Besides, it is a must feature for certain VMM.
 
 We map virtual APIC page address and TPR threshold from L1 VMCS. If
 TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest and L1 interested
 in, we inject it into L1 VMM for handling.
 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 ---
 v2 - v3:
  * nested vm entry failure if both tpr shadow and cr8 exiting bits are not set
 v1 - v2:
  * don't take L0's virtualize APIC accesses setting into account
  * virtual_apic_page do exactly the same thing that is done for
 apic_access_page
  * add the tpr threshold field to the read-write fields for shadow VMCS
 
  arch/x86/kvm/vmx.c | 38 --
  1 file changed, 36 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index c604f3c..7a56e2c 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -379,6 +379,7 @@ struct nested_vmx {
* we must keep them pinned while L2 runs.
*/
   struct page *apic_access_page;
 + struct page *virtual_apic_page;
   u64 msr_ia32_feature_control;
 
   struct hrtimer preemption_timer;
 @@ -533,6 +534,7 @@ static int max_shadow_read_only_fields =
   ARRAY_SIZE(shadow_read_only_fields);
 
  static unsigned long shadow_read_write_fields[] = {
 + TPR_THRESHOLD,
   GUEST_RIP,
   GUEST_RSP,
   GUEST_CR0,
 @@ -2330,7 +2332,7 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
   CPU_BASED_MOV_DR_EXITING |
 CPU_BASED_UNCOND_IO_EXITING |
   CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
   CPU_BASED_RDPMC_EXITING | CPU_BASED_RDTSC_EXITING |
 - CPU_BASED_PAUSE_EXITING |
 + CPU_BASED_PAUSE_EXITING | CPU_BASED_TPR_SHADOW |
   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
   /*
* We can allow some features even when not supported by the
 @@ -6148,6 +6150,10 @@ static void free_nested(struct vcpu_vmx *vmx)
   nested_release_page(vmx-nested.apic_access_page);
   vmx-nested.apic_access_page = 0;
   }
 + if (vmx-nested.virtual_apic_page) {
 + nested_release_page(vmx-nested.virtual_apic_page);
 + vmx-nested.virtual_apic_page = 0;
 + }
 
   nested_free_all_saved_vmcss(vmx);
  }
 @@ -6936,7 +6942,7 @@ static bool nested_vmx_exit_handled(struct
 kvm_vcpu *vcpu)
   case EXIT_REASON_MCE_DURING_VMENTRY:
   return 0;
   case EXIT_REASON_TPR_BELOW_THRESHOLD:
 - return 1;
 + return nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW);
   case EXIT_REASON_APIC_ACCESS:
   return nested_cpu_has2(vmcs12,
   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 @@ -7057,6 +7063,9 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 
  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
  {
 + if (is_guest_mode(vcpu))
 + return;
 +
   if (irr == -1 || tpr  irr) {
   vmcs_write32(TPR_THRESHOLD, 0);
   return;
 @@ -8024,6 +8033,27 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
   exec_control = ~CPU_BASED_VIRTUAL_NMI_PENDING;
   exec_control = ~CPU_BASED_TPR_SHADOW;
   exec_control |= vmcs12-cpu_based_vm_exec_control;
 +
 + if (exec_control  CPU_BASED_TPR_SHADOW) {
 + if (vmx-nested.virtual_apic_page)
 + nested_release_page(vmx-nested.virtual_apic_page);
 + vmx-nested.virtual_apic_page =
 +nested_get_page(vcpu, vmcs12-virtual_apic_page_addr);
 + if (!vmx-nested.virtual_apic_page)
 + exec_control =
 + ~CPU_BASED_TPR_SHADOW;
 + else
 + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR,
 + page_to_phys(vmx-nested.virtual_apic_page));
 +
 + if (!(exec_control  CPU_BASED_TPR_SHADOW) 
 + !((exec_control  CPU_BASED_CR8_LOAD_EXITING) 
 + (exec_control  CPU_BASED_CR8_STORE_EXITING)))
 + nested_vmx_failValid(vcpu,
 VMXERR_ENTRY_INVALID_CONTROL_FIELD);

I think this is not correct. The vmx-nested.virtual_apic_page may not valid 
due to two reasons:
1. The virtual_apic_page_addr is not a valid gfn. In this case, the vmx failure 
must be injected to L1 unconditionally regardless of the setting of CR8 
load/store exiting.
2. The virtual_apic_page is swapped by L0. In this case, we should not inject 
failure to L1.

 +
 + vmcs_write32(TPR_THRESHOLD, vmcs12-tpr_threshold);
 + }

Miss else here:
If L2 owns the APIC and doesn't use TPR_SHADOW, we need to setup the vmcs02 
based on vmcs01. For example, if vmcs01 is using TPR_SHADOW, then vmcs02 must 
set 

RE: [PATCH] KVM: nVMX: nested TPR shadow/threshold emulation

2014-08-01 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-08-01:
 Il 01/08/2014 02:57, Zhang, Yang Z ha scritto:
 TPR_THRESHOLD will be likely written as zero, but the processor
 will never use it anyway.  It's just a small optimization because
 nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) will almost always
 be true.
 
 Theoretically, you are right. But we should not expect all VMMs
 follow it. It is not worth to violate the SDM just for saving two or
 three instructions' cost.
 
 Yes, you do need an if (cpu_has_vmx_tpr_shadow()) around the
 vmcs_write32.  But still, checking nested_cpu_has is not strictly necessary.
 Right now they both are a single AND, but I have plans to change all
 of the
 cpu_has_*() checks to static keys.

See v2 patch. It isn't a problem anymore.

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: nVMX: nested TPR shadow/threshold emulation

2014-07-31 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-07-31:
 Il 31/07/2014 10:03, Wanpeng Li ha scritto:
 One thing:
 
 +  if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW))
 +  vmcs_write32(TPR_THRESHOLD, vmcs12-tpr_threshold);
 
 I think you can just do this write unconditionally, since most
 hypervisors will enable this.  Also, you probably can add the tpr
 
 What will happen if a hypervisor doesn't enable it? I make it more
 cleaner in version two.
 
 TPR_THRESHOLD will be likely written as zero, but the processor will
 never use it anyway.  It's just a small optimization because
 nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) will almost always be true.

Theoretically, you are right. But we should not expect all VMMs follow it. It 
is not worth to violate the SDM just for saving two or three instructions' cost.

 
 Paolo
 
 threshold field to the read-write fields for shadow VMCS.
 
 Agreed.
 
 Regards,
 Wanpeng Li


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


deadline of CFP for kvm forum

2014-07-27 Thread Zhang, Yang Z
Hi all,

I see the deadline of CFP for KVM forum is July 27, 2014. But I found there is 
no kvm forum selection list when I tried to submit a presentation yesterday. Is 
the CFP closed early than expected?

BTW, it is in July 27 in US when sending this mail.

best regards
yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/3] KVM: nVMX: Fix virtual interrupt delivery injection

2014-07-17 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-07-17:
 Il 17/07/2014 06:56, Wanpeng Li ha scritto:
 This patch fix bug reported in
 https://bugzilla.kernel.org/show_bug.cgi?id=73331, after the patch
 http://www.spinics.net/lists/kvm/msg105230.html applied, there is some
 progress and the L2 can boot up, however, slowly. The original idea of
 this fix vid injection patch is from Zhang, Yang Z
 yang.z.zh...@intel.com.
 
 Interrupt which delivered by vid should be injected to L1 by L0 if
 current is in L1, or should be injected to L2 by L0 through the old
 injection way if L1 doesn't have set VM_EXIT_ACK_INTR_ON_EXIT. The
 current logic doen't consider these cases. This patch fix it by vid
 intr to L1 if current is L1 or L2 through old injection way if L1
 doen't have VM_EXIT_ACK_INTR_ON_EXIT set.
 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 Signed-off-by: Zhang, Yang Z yang.z.zh...@intel.com
 ---
  arch/x86/kvm/vmx.c | 18 --
  1 file changed, 16 insertions(+), 2 deletions(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 021d84a..ad36646 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7112,8 +7112,22 @@ static void vmx_hwapic_irr_update(struct
 kvm_vcpu *vcpu, int max_irr)  {
  if (max_irr == -1)
  return;
 -
 -vmx_set_rvi(max_irr);
 +if (!is_guest_mode(vcpu)) {
 +vmx_set_rvi(max_irr);
 +} else if (is_guest_mode(vcpu)  !nested_exit_on_intr(vcpu)) {
 +/*
 + * Fall back to old way to inject the interrupt since there
 + * is no vAPIC-v for L2.
 + */
 +if (vcpu-arch.exception.pending ||
 +vcpu-arch.nmi_injected ||
 +vcpu-arch.interrupt.pending)
 +return;
 +else if (vmx_interrupt_allowed(vcpu)) {
 +kvm_queue_interrupt(vcpu, max_irr, false);
 +vmx_inject_irq(vcpu);
 +}
 +}
  }
  
  static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64
 *eoi_exit_bitmap)
 
 
 What hypervisor did you test with? nested_exit_on_intr(vcpu) will

Jailhouse will clear External-interrupt exiting bit. Am I right? Jan.

 return true for both Xen and KVM (nested_exit_on_intr is not the same
 thing as ACK_INTR_ON_EXIT).

I guess he want to say External-interrupt exiting bit not ACK_INTR_ON_EXIT. 

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] KVM: nVMX: Fix fail to get nested ack intr's vector during nested vmexit

2014-07-17 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-07-17:
 Il 17/07/2014 06:56, Wanpeng Li ha scritto:
   nested_exit_intr_ack_set(vcpu)) {
  int irq = kvm_cpu_get_interrupt(vcpu);
 +
 +if (irq  0  kvm_apic_vid_enabled(vcpu-kvm))
 +irq = kvm_get_apic_interrupt(vcpu);
 
 There's something weird in this patch.  If you inline
 kvm_cpu_get_interrupt, what you get is this:
 
  int irq;
   /* Beginning of kvm_cpu_get_interrupt... */
  if (!irqchip_in_kernel(v-kvm))
  irq = v-arch.interrupt.nr;
   else {
   irq = kvm_cpu_get_extint(v); /* PIC */
   if (!kvm_apic_vid_enabled(v-kvm)  irq == -1)
   irq = kvm_get_apic_interrupt(v); /* APIC */
   }
 
   /* kvm_cpu_get_interrupt done. */
   if (irq  0  kvm_apic_vid_enabled(vcpu-kvm))
   irq = kvm_get_apic_interrupt(vcpu);
 
 There are just two callers of kvm_cpu_get_interrupt, and the other is
 protected by kvm_cpu_has_injectable_intr so it won't be executed if
 virtual interrupt delivery is enabled.  So you patch is effectively the same 
 as this:
 
 diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c index
 bd0da43..a1ec6a5 100644 --- a/arch/x86/kvm/irq.c +++
 b/arch/x86/kvm/irq.c @@ -108,7 +108,7 @@ int
 kvm_cpu_get_interrupt(struct kvm_vcpu *v)
 
   vector = kvm_cpu_get_extint(v);
 - if (kvm_apic_vid_enabled(v-kvm) || vector != -1)
 + if (vector != -1)
   return vector;  /* PIC */
   
   return kvm_get_apic_interrupt(v);   /* APIC */
 But in kvm_get_apic_interrupt I have just added this comment:
 
  /* Note that we never get here with APIC virtualization
* enabled.  */
 
 because kvm_get_apic_interrupt calls apic_set_isr, and apic_set_isr
 must never be called with APIC virtualization enabled either.  With
 APIC virtualization enabled, isr_count is always 1, and
 highest_isr_cache is always -1, and apic_set_isr breaks both of these 
 invariants.
 

You are right. kvm_lapic_find_highest_irr should be the right one.

 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] KVM: nVMX: Fix fail to get nested ack intr's vector during nested vmexit

2014-07-16 Thread Zhang, Yang Z
Wanpeng Li wrote on 2014-07-17:
 WARNING: CPU: 9 PID: 7251 at arch/x86/kvm/vmx.c:8719
 nested_vmx_vmexit+0xa4/0x233 [kvm_intel]() Modules linked in: tun
 nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs fscache
 lockd sunrpc pci_stub netconsole kvm_intel kvm bridge stp llc autofs4
 8021q ipv6 uinput joydev microcode pcspkr igb i2c_algo_bit ehci_pci
 ehci_hcd e1000e ixgbe ptp pps_core hwmon mdio i2c_i801 i2c_core
 tpm_tis tpm ipmi_si ipmi_msghandler isci libsas scsi_transport_sas
 button dm_mirror dm_region_hash dm_log dm_mod
 CPU: 9 PID: 7251 Comm: qemu-system-x86 Tainted: GW
 3.16.0-rc1 #2
 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
 RMLSDP.86I.00.29.D696.131329 11/11/2013  220f
 880ffd107bf8 81493563 220f
   880ffd107c38 8103f0eb 880ffd107c48
  a059709a 881ffc9e0040 8800b74b8000 
  Call Trace: [81493563] dump_stack+0x49/0x5e 
  [8103f0eb]
 warn_slowpath_common+0x7c/0x96  [a059709a] ?
 nested_vmx_vmexit+0xa4/0x233 [kvm_intel]  [8103f11a]
 warn_slowpath_null+0x15/0x17  [a059709a]
 nested_vmx_vmexit+0xa4/0x233 [kvm_intel]  [a0594295] ?
 nested_vmx_exit_handled+0x6a/0x39e [kvm_intel]  [a0537931] ?
 kvm_apic_has_interrupt+0x80/0xd5 [kvm]  [a05972ec]
 vmx_check_nested_events+0xc3/0xd3 [kvm_intel]  [a051ebe9]
 inject_pending_event+0xd0/0x16e [kvm]  [a051efa0]
 vcpu_enter_guest+0x319/0x704 [kvm]
 
 After commit 77b0f5d (KVM: nVMX: Ack and write vector info to
 intr_info if L1 asks us to), Acknowledge interrupt on exit behavior
 can be emulated. Current logic will ask for intr vector if it is
 nested vmexit and VM_EXIT_ACK_INTR_ON_EXIT is set by L1. However, intr
 vector for posted intr can't be got by generic read pending interrupt
 vector and intack routine, there is a requirement to sync from pir to
 irr. This patch fix it by ask the intr vector after sync pir to irr.
 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com

Reviewed-by: Yang Zhang yang.z.zh...@intel.com

 ---
  arch/x86/kvm/lapic.c | 1 +
  arch/x86/kvm/vmx.c   | 3 +++
  2 files changed, 4 insertions(+)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index
 0069118..b7d45dc 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -1637,6 +1637,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
   apic_clear_irr(vector, apic);
   return vector;
  }
 +EXPORT_SYMBOL_GPL(kvm_get_apic_interrupt);
 
  void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
   struct kvm_lapic_state *s)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 4ae5ad8..31f1479 100644 --- a/arch/x86/kvm/vmx.c +++
 b/arch/x86/kvm/vmx.c @@ -8697,6 +8697,9 @@ static void
 nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
   if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
nested_exit_intr_ack_set(vcpu)) {
   int irq = kvm_cpu_get_interrupt(vcpu);
 +
 + if (irq  0  kvm_apic_vid_enabled(vcpu-kvm))
 + irq = kvm_get_apic_interrupt(vcpu);
   WARN_ON(irq  0);
   vmcs12-vm_exit_intr_info = irq |
   INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/4] KVM: Add SMAP support when setting CR4

2014-03-27 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2014-03-27:
 Il 27/03/2014 13:25, Feng Wu ha scritto:
 +void update_permission_bitmask(struct kvm_vcpu *vcpu,
  struct kvm_mmu *mmu, bool ept)
  {
  unsigned bit, byte, pfec;
  u8 map;
 -bool fault, x, w, u, wf, uf, ff, smep;
 +bool fault, x, w, u, wf, uf, ff, smep, smap;
 
  smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); + smap =
  kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);  for (byte = 0; byte 
  ARRAY_SIZE(mmu-permissions); ++byte) { pfec = byte  1;   
 map = 0;
 @@ -3617,11 +3618,26 @@ static void update_permission_bitmask(struct
 kvm_vcpu *vcpu,
  w |= !is_write_protection(vcpu)  !uf;
  /* Disallow supervisor fetches of user code if 
 cr4.smep */
  x = !(smep  u  !uf);
 +
 +/*
 + * SMAP:kernel-mode data accesses from user-mode
 + * mappings should fault. A fault is considered
 + * as a SMAP violation if all of the following
 + * conditions are ture:
 + *   - X86_CR4_SMAP is set in CR4
 + *   - An user page is accessed
 + *   - !(CPL3  X86_EFLAGS_AC is set)
 + *   - Page fault in kernel mode
 + */
 +smap = smap  u  !uf 
 +!((kvm_x86_ops-get_cpl(vcpu)  3) 
 +((kvm_x86_ops-get_rflags(vcpu) 
 +X86_EFLAGS_AC) == 1));
 
 Unfortunately this doesn't work.
 
 The reason is that changing X86_EFLAGS_AC doesn't trigger
 update_permission_bitmask.  So the value of CPL  3  AC = 1 must not
 be checked in update_permission_bitmask; instead, it must be included
 in the index into the permissions array.  You can reuse the
 PFERR_RSVD_MASK bit, like
 
   smapf = pfec  PFERR_RSVD_MASK;
   ...
   smap = smap  smapf u  !uf;
 
 The VCPU can then be passed to permission_fault in order to get the
 value of the CPL and the AC bit.

Is CPL check needed? Shouldn't it already have been covered by U bit?

 
 Please test nested virtualization too.  I think PFERR_RSVD_MASK should
 be removed in translate_nested_gpa.
 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: hyper-v support in KVM

2014-02-24 Thread Zhang, Yang Z
Vadim Rozenfeld wrote on 2014-02-24:
 On Mon, 2014-02-24 at 03:01 +, Zhang, Yang Z wrote:
 Vadim Rozenfeld wrote on 2014-02-14:
 On Fri, 2014-02-14 at 02:35 +, Liu, RongrongX wrote:
 Vadim Rozenfeld wrote on 2014-02-12:
 On Wed, 2014-02-12 at 01:33 +, Zhang, Yang Z wrote:
 Vadim Rozenfeld wrote on 2014-02-10:
 On Mon, 2014-02-10 at 08:21 +, Zhang, Yang Z wrote:
 Hi Vadim,
 
 Do you know the latest status of Hyper-v Enlightenments
 supporting in
 KVM?
 Like how many Hyper-v interfaces are supported in KVM?
 
 Hi Yang,
 
 There is no many at the moment. KVM currently supports the
 following Hyper-V features:
 
 Guest Spinlocks
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff539
 08
 1%
 28v=vs.85%29.aspx
 
 Local APIC MSR Accesses
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542
 39
 6%
 28v=vs.85%29.aspx
 
 Reference Time Counter
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542
 63
 7%
 28v=vs.85%29.aspx
 
 We are going to add: Reference TSC Page
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542
 64
 3%
 28v=vs.85%29.aspx
 
 Lazy EOI support, maybe more.
 
 
 Thanks for your update. More questions: I want to measure the
 performance improvement with hyper-v features enabled. So I
 want to
 know: Are those features enabled by default in KVM?
 In KVM - yes, but you also need to specify them in QEMU command line.
 They can be enabled by -cpu features hv_vapic, hv_spinlocks,
 hv_time, and hv_relaxed
 Hi Vadim,
 in QEMU command line, how to enable these feature?
 I try it with
 1. [root@vt-snb9 ]#qemu-system-x86_64 -enable-kvm -m 2048 -smp 2
 -net none win8.1.img -cpu feature
 hv_vapic,+hv_spinlocks,+hv_time,+hv_relaxed
  qemu-system-x86_64: -cpu feature: drive with bus=0, unit=0
 (index=0) exists 2. [root@vt-snb9]# qemu-system-x86_64
 -enable-kvm -m 2048 -smp
 2 -net none win8.1.img -cpu
 qemu64,+hv_vapic,+hv_spinlocks,+hv_time,+hv_relaxed
 
 something like this: -cpu qemu64,
 +x2apic,family=0xf,hv_vapic,hv_spinlocks=0xfff,hv_relaxed,hv_time
 
 (for hv_vapic we also need x2apic to be enabled)
 
 I saw the win8.1 guest boot up fail with error code 0x005c after
 enabling
 hv_vapic on my ivy bridge-EP box. But it works well on my Sandy
 bridge-EP. Any thought?
 
 What are the bug check parameters coming with
 HAL_INITIALIZATION_FAILED bug check? Is your guest 32 or 64-bit? How does it 
 work with Win8?
 
Parameters:
0x110, 0xffd11000, 0x19, 0xc001

Also saw this issue with Win8.


 Thanks,
 Vadim.
 
 
 
 Best regards,
 Vadim.
 
 CPU feature hv_vapic not found
 CPU feature hv_spinlocks not found CPU feature hv_time not found
 CPU feature hv_relaxed not found CPU feature hv_vapic not found
 CPU feature hv_spinlocks not found CPU feature hv_time not found
 CPU feature hv_relaxed not found VNC server running on `::1:5900'
 
 
 How to turn off/on it manually?
 Yes. From the QEMU command line.
 
 And how can I know whether guest is using it really?
 There are two options - printk from the KVM side or WinDbg from
 the guest side. But in case of hv_time you can check the value
 returned by QueryPerformanceFrequency
 http://msdn.microsoft.com/en-us/library/windows/desktop/ms644905
 %
 28v=vs.85%29.aspx
 it should be 10MHz
 
  Also, Do you have any performance data?
 http://www.linux-kvm.org/wiki/images/0/0a/2012-forum-kvm_hyperv.
 pd
 f
 pp 16, 18
 I compared DPC and ISR times with xperf for two cases - with and
 without enlightenment.
 I also have seen reports mentioned around 5-10 percent CPU usage
 drop on the host side, when loading guest with some disk-stress tests.
 
 
 Kind regards.
 Vadim.
 
 
 best regards
 yang
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm
 in the body of a message to majord...@vger.kernel.org More
 majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 
 Best regards,
 Yang
 
 
 
 
 
 
 
 Best regards,
 Yang
 
 



Best regards,
Yang




RE: hyper-v support in KVM

2014-02-23 Thread Zhang, Yang Z
Vadim Rozenfeld wrote on 2014-02-14:
 On Fri, 2014-02-14 at 02:35 +, Liu, RongrongX wrote:
 Vadim Rozenfeld wrote on 2014-02-12:
 On Wed, 2014-02-12 at 01:33 +, Zhang, Yang Z wrote:
 Vadim Rozenfeld wrote on 2014-02-10:
 On Mon, 2014-02-10 at 08:21 +, Zhang, Yang Z wrote:
 Hi Vadim,
 
 Do you know the latest status of Hyper-v Enlightenments
 supporting in
 KVM?
 Like how many Hyper-v interfaces are supported in KVM?
 
 Hi Yang,
 
 There is no many at the moment. KVM currently supports the
 following Hyper-V features:
 
 Guest Spinlocks
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff539
 08
 1%
 28v=vs.85%29.aspx
 
 Local APIC MSR Accesses
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542
 39
 6%
 28v=vs.85%29.aspx
 
 Reference Time Counter
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542
 63
 7%
 28v=vs.85%29.aspx
 
 We are going to add: Reference TSC Page
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542
 64
 3%
 28v=vs.85%29.aspx
 
 Lazy EOI support, maybe more.
 
 
 Thanks for your update. More questions: I want to measure the
 performance improvement with hyper-v features enabled. So I want to
 know: Are those features enabled by default in KVM?
 In KVM - yes, but you also need to specify them in QEMU command line.
 They can be enabled by -cpu features hv_vapic, hv_spinlocks,
 hv_time, and hv_relaxed
 Hi Vadim,
 in QEMU command line, how to enable these feature?
 I try it with
 1. [root@vt-snb9 ]#qemu-system-x86_64 -enable-kvm -m 2048 -smp 2
 -net none win8.1.img -cpu feature
 hv_vapic,+hv_spinlocks,+hv_time,+hv_relaxed
  qemu-system-x86_64: -cpu feature: drive with bus=0, unit=0
 (index=0) exists 2. [root@vt-snb9]# qemu-system-x86_64 -enable-kvm
 -m 2048 -smp
 2 -net none win8.1.img -cpu
 qemu64,+hv_vapic,+hv_spinlocks,+hv_time,+hv_relaxed
 
 something like this: -cpu qemu64,
 +x2apic,family=0xf,hv_vapic,hv_spinlocks=0xfff,hv_relaxed,hv_time
 
 (for hv_vapic we also need x2apic to be enabled)

I saw the win8.1 guest boot up fail with error code 0x005c after enabling 
hv_vapic on my ivy bridge-EP box. But it works well on my Sandy bridge-EP. Any 
thought?


 
 Best regards,
 Vadim.
 
 CPU feature hv_vapic not found
 CPU feature hv_spinlocks not found
 CPU feature hv_time not found
 CPU feature hv_relaxed not found
 CPU feature hv_vapic not found
 CPU feature hv_spinlocks not found
 CPU feature hv_time not found
 CPU feature hv_relaxed not found
 VNC server running on `::1:5900'
 
 
 How to turn off/on it manually?
 Yes. From the QEMU command line.
 
 And how can I know whether guest is using it really?
 There are two options - printk from the KVM side or WinDbg from
 the guest side. But in case of hv_time you can check the value
 returned by QueryPerformanceFrequency
 http://msdn.microsoft.com/en-us/library/windows/desktop/ms644905%
 28v=vs.85%29.aspx
 it should be 10MHz
 
  Also, Do you have any performance data?
 http://www.linux-kvm.org/wiki/images/0/0a/2012-forum-kvm_hyperv.pd
 f
 pp 16, 18
 I compared DPC and ISR times with xperf for two cases - with and
 without enlightenment.
 I also have seen reports mentioned around 5-10 percent CPU usage
 drop on the host side, when loading guest with some disk-stress tests.
 
 
 Kind regards.
 Vadim.
 
 
 best regards
 yang
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm
 in the body of a message to majord...@vger.kernel.org More
 majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 
 Best regards,
 Yang
 
 
 
 



Best regards,
Yang




RE: hyper-v support in KVM

2014-02-11 Thread Zhang, Yang Z
Vadim Rozenfeld wrote on 2014-02-10:
 On Mon, 2014-02-10 at 08:21 +, Zhang, Yang Z wrote:
 Hi Vadim,
 
 Do you know the latest status of Hyper-v Enlightenments supporting in KVM?
 Like how many Hyper-v interfaces are supported in KVM?
 
 Hi Yang,
 
 There is no many at the moment. KVM currently supports the following
 Hyper-V features:
 
 Guest Spinlocks
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff539081%
 28v=vs.85%29.aspx
 
 Local APIC MSR Accesses
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542396%
 28v=vs.85%29.aspx
 
 Reference Time Counter
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542637%
 28v=vs.85%29.aspx
 
 We are going to add: Reference TSC Page
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff542643%
 28v=vs.85%29.aspx
 
 Lazy EOI support, maybe more.
 

Thanks for your update. More questions:
I want to measure the performance improvement with hyper-v features enabled. So 
I want to know:
Are those features enabled by default in KVM? How to turn off/on it manually? 
And how can I know whether guest is using it really? Also, Do you have any 
performance data?

 Kind regards.
 Vadim.
 
 
 best regards
 yang
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 



Best regards,
Yang




hyper-v support in KVM

2014-02-10 Thread Zhang, Yang Z
Hi Vadim, 

Do you know the latest status of Hyper-v Enlightenments supporting in KVM? Like 
how many Hyper-v interfaces are supported in KVM?

best regards
yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] iommu/intel: SNP bit is not dependent on iommu domain coherency

2014-01-06 Thread Zhang, Yang Z
Alex Williamson wrote on 2013-12-24:
 David,
 
 Any comments on this patch?  Thanks,
 

Hi Alex,

There do have some IOMMUs will treat SNP bit in the PTE as reserved (0) and 
will cause a reserved field violation fault if it is set but hardware not 
support snoop-control(bit 7 in ECAP_REG is 0). So your patch seems wrong to me.

 Alex
 
 On Tue, 2013-10-29 at 10:21 -0600, Alex Williamson wrote:
 The setting of the SNP bit in the intel-iommu page tables should not
 be dependent on the current capability of the iommu domain.  The
 current VT-d spec (2.2) indicates the SNP bit is treated as
 reserved[0] by hardware implementations not supporting Snoop Control.
 Furthermore, section 3.7.3 indicates:
 
   If the Snoop Control (SC) field in extended capability Register is
   reported as 0, snoop behavior for access to the page mapped through
   second-level translation is determined by the no-snoop attribute in
   the request.
 This all seems to indicate that hardware incapable of Snoop Control
 will handle the SNP bit as zero regardless of the value stored in
 the PTE.
 
 The trouble with the current implementation is that mapping flags
 depend on the state of the iommu domain at the time of the mapping,
 yet no attempt is made to update existing mappings when the iommu
 domain composition changes.  This leaves the iommu domain in a state
 where some mappings may enforce coherency, others do not, and the
 user of the IOMMU API has no ability to later enable the desired
 flags atomically with respect to DMA.
 
 If we always honor the IOMMU_CACHE flag then an IOMMU API user who
 specifies IOMMU_CACHE for all mappings can assume that the coherency
 of the mappings within a domain follow the coherency capability of
 the domain itself.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
  drivers/iommu/intel-iommu.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 diff --git a/drivers/iommu/intel-iommu.c
 b/drivers/iommu/intel-iommu.c index 15e9b57..c46c6a6 100644
 --- a/drivers/iommu/intel-iommu.c
 +++ b/drivers/iommu/intel-iommu.c
 @@ -4084,7 +4084,7 @@ static int intel_iommu_map(struct iommu_domain
 *domain,
  prot |= DMA_PTE_READ;
  if (iommu_prot  IOMMU_WRITE)
  prot |= DMA_PTE_WRITE;
 -if ((iommu_prot  IOMMU_CACHE)  dmar_domain-iommu_snooping)
 +if (iommu_prot  IOMMU_CACHE)
  prot |= DMA_PTE_SNP;
  
  max_addr = iova + size;
 



Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Zhang, Yang Z
Arthur Chunqi Li wrote on 2013-09-04:
 This patch contains the following two changes:
 1. Fix the bug in nested preemption timer support. If vmexit L2-L0 with some
 reasons not emulated by L1, preemption timer value should be save in such
 exits.
 2. Add support of Save VMX-preemption timer value VM-Exit controls to
 nVMX.
 
 With this patch, nested VMX preemption timer features are fully supported.
 
 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
 This series depends on queue.
 
  arch/x86/include/uapi/asm/msr-index.h |1 +
  arch/x86/kvm/vmx.c|   51
 ++---
  2 files changed, 48 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/include/uapi/asm/msr-index.h
 b/arch/x86/include/uapi/asm/msr-index.h
 index bb04650..b93e09a 100644
 --- a/arch/x86/include/uapi/asm/msr-index.h
 +++ b/arch/x86/include/uapi/asm/msr-index.h
 @@ -536,6 +536,7 @@
 
  /* MSR_IA32_VMX_MISC bits */
  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL  29)
 +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
  /* AMD-V MSRs */
 
  #define MSR_VM_CR   0xc0010114
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2204,7 +2204,14 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
   VM_EXIT_HOST_ADDR_SPACE_SIZE |
  #endif
 - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
 + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
 + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
 + if (!(nested_vmx_pinbased_ctls_high 
 PIN_BASED_VMX_PREEMPTION_TIMER))
 + nested_vmx_exit_ctls_high =
 + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
 + if (!(nested_vmx_exit_ctls_high 
 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
 + nested_vmx_pinbased_ctls_high =
 + (~PIN_BASED_VMX_PREEMPTION_TIMER);
The following logic is more clearly:
if(nested_vmx_pinbased_ctls_high  PIN_BASED_VMX_PREEMPTION_TIMER)
nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER

BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's 
capability when expose those vmx features(not just preemption timer) to L1. 

   nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 VM_EXIT_LOAD_IA32_EFER);
 
 @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu
 *vcpu, u64 *info1, u64 *info2)
   *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
 
 +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
 + u64 delta_tsc_l1;
 + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
 +
 + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) 
 + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
 + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
 + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu,
 + native_read_tsc()) - vcpu-arch.last_guest_tsc;
 + preempt_val_l1 = delta_tsc_l1  preempt_scale;
 + if (preempt_val_l2 - preempt_val_l1  0)
 + preempt_val_l2 = 0;
 + else
 + preempt_val_l2 -= preempt_val_l1;
 + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); }
  /*
   * The guest has exited.  See if we can fix it or if we need userspace
   * assistance.
 @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
   struct vcpu_vmx *vmx = to_vmx(vcpu);
   u32 exit_reason = vmx-exit_reason;
   u32 vectoring_info = vmx-idt_vectoring_info;
 + int ret;
 
   /* If guest state is invalid, start emulating */
   if (vmx-emulation_required)
 @@ -6795,12 +6820,15 @@ static int vmx_handle_exit(struct kvm_vcpu
 *vcpu)
 
   if (exit_reason  kvm_vmx_max_exit_handlers
kvm_vmx_exit_handlers[exit_reason])
 - return kvm_vmx_exit_handlers[exit_reason](vcpu);
 + ret = kvm_vmx_exit_handlers[exit_reason](vcpu);
   else {
   vcpu-run-exit_reason = KVM_EXIT_UNKNOWN;
   vcpu-run-hw.hardware_exit_reason = exit_reason;
 + ret = 0;
   }
 - return 0;
 + if (is_guest_mode(vcpu))
 + nested_adjust_preemption_timer(vcpu);
Move this forward to avoid the changes for ret.
 + return ret;
  }
 
  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) @@
 -7518,6 +7546,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)  {
   struct vcpu_vmx *vmx = to_vmx(vcpu);
   u32 exec_control;
 + u32 exit_control;
 
   vmcs_write16(GUEST_ES_SELECTOR, vmcs12-guest_es_selector);
   vmcs_write16(GUEST_CS_SELECTOR, vmcs12-guest_cs_selector); @@
 -7691,7 +7720,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
* we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
* bits are further modified by 

RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Zhang, Yang Z
Arthur Chunqi Li wrote on 2013-09-05:
 On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
  Arthur Chunqi Li wrote on 2013-09-04:
  This patch contains the following two changes:
  1. Fix the bug in nested preemption timer support. If vmexit L2-L0
  with some reasons not emulated by L1, preemption timer value should
  be save in such exits.
  2. Add support of Save VMX-preemption timer value VM-Exit controls
  to nVMX.
 
  With this patch, nested VMX preemption timer features are fully supported.
 
  Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
  ---
  This series depends on queue.
 
   arch/x86/include/uapi/asm/msr-index.h |1 +
   arch/x86/kvm/vmx.c|   51
  ++---
   2 files changed, 48 insertions(+), 4 deletions(-)
 
  diff --git a/arch/x86/include/uapi/asm/msr-index.h
  b/arch/x86/include/uapi/asm/msr-index.h
  index bb04650..b93e09a 100644
  --- a/arch/x86/include/uapi/asm/msr-index.h
  +++ b/arch/x86/include/uapi/asm/msr-index.h
  @@ -536,6 +536,7 @@
 
   /* MSR_IA32_VMX_MISC bits */
   #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL 
 29)
  +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
   /* AMD-V MSRs */
 
   #define MSR_VM_CR   0xc0010114
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
  1f1da43..870caa8
  100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -2204,7 +2204,14 @@ static __init void
  nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
VM_EXIT_HOST_ADDR_SPACE_SIZE |  #endif
  - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
  + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
  + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
  + if (!(nested_vmx_pinbased_ctls_high 
  PIN_BASED_VMX_PREEMPTION_TIMER))
  + nested_vmx_exit_ctls_high =
  + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
  + if (!(nested_vmx_exit_ctls_high 
  VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
  + nested_vmx_pinbased_ctls_high =
  + (~PIN_BASED_VMX_PREEMPTION_TIMER);
  The following logic is more clearly:
  if(nested_vmx_pinbased_ctls_high 
 PIN_BASED_VMX_PREEMPTION_TIMER)
  nested_vmx_exit_ctls_high |=
 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
 Here I have such consideration: this logic is wrong if CPU support
 PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support
 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does
 occurs. So the codes above reads the MSR and reserves the features it
 supports, and here I just check if these two features are supported
 simultaneously.
 
No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on 
PIN_BASED_VMX_PREEMPTION_TIMER. PIN_BASED_VMX_PREEMPTION_TIMER is an 
independent feature

 You remind that this piece of codes can write like this:
 if (!(nested_vmx_pin_based_ctls_high 
 PIN_BASED_VMX_PREEMPTION_TIMER) ||
 !(nested_vmx_exit_ctls_high 
 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
 nested_vmx_exit_ctls_high
 =(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
 nested_vmx_pinbased_ctls_high =
 (~PIN_BASED_VMX_PREEMPTION_TIMER);
 }
 
 This may reflect the logic I describe above that these two flags should 
 support
 simultaneously, and brings less confusion.
 
  BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's
 capability when expose those vmx features(not just preemption timer) to L1.
 The codes just above here, when setting pinbased control for nested vmx, it
 firstly rdmsr MSR_IA32_VMX_PINBASED_CTLS, then use this to mask the
 features hardware not support. So does other control fields.
 
Yes, I saw it.

nested_vmx_exit_ctls_high |=
  (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
  VM_EXIT_LOAD_IA32_EFER);
 
  @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu
  *vcpu, u64 *info1, u64 *info2)
*info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
 
  +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
  + u64 delta_tsc_l1;
  + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
  +
  + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) 
  +
 MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
  + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
  + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu,
  + native_read_tsc()) - vcpu-arch.last_guest_tsc;
  + preempt_val_l1 = delta_tsc_l1  preempt_scale;
  + if (preempt_val_l2 - preempt_val_l1  0)
  + preempt_val_l2 = 0;
  + else
  + preempt_val_l2 -= preempt_val_l1;
  + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE,
 preempt_val_l2); }
   /*
* The guest has exited.  See if we can fix it or if we need userspace
* assistance.
  @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu
 *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 exit_reason = vmx-exit_reason;
u32

RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

2013-09-05 Thread Zhang, Yang Z
Arthur Chunqi Li wrote on 2013-09-05:
  Arthur Chunqi Li wrote on 2013-09-05:
  On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z
  yang.z.zh...@intel.com
  wrote:
   Arthur Chunqi Li wrote on 2013-09-04:
   This patch contains the following two changes:
   1. Fix the bug in nested preemption timer support. If vmexit
   L2-L0 with some reasons not emulated by L1, preemption timer
   value should be save in such exits.
   2. Add support of Save VMX-preemption timer value VM-Exit
   controls to nVMX.
  
   With this patch, nested VMX preemption timer features are fully
 supported.
  
   Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
   ---
   This series depends on queue.
  
arch/x86/include/uapi/asm/msr-index.h |1 +
arch/x86/kvm/vmx.c|   51
   ++---
2 files changed, 48 insertions(+), 4 deletions(-)
  
   diff --git a/arch/x86/include/uapi/asm/msr-index.h
   b/arch/x86/include/uapi/asm/msr-index.h
   index bb04650..b93e09a 100644
   --- a/arch/x86/include/uapi/asm/msr-index.h
   +++ b/arch/x86/include/uapi/asm/msr-index.h
   @@ -536,6 +536,7 @@
  
/* MSR_IA32_VMX_MISC bits */
#define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL
 
  29)
   +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
/* AMD-V MSRs */
  
#define MSR_VM_CR   0xc0010114
   diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
   1f1da43..870caa8
   100644
   --- a/arch/x86/kvm/vmx.c
   +++ b/arch/x86/kvm/vmx.c
   @@ -2204,7 +2204,14 @@ static __init void
   nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
 VM_EXIT_HOST_ADDR_SPACE_SIZE |  #endif
   - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
   + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT
 |
   + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
   + if (!(nested_vmx_pinbased_ctls_high 
   PIN_BASED_VMX_PREEMPTION_TIMER))
   + nested_vmx_exit_ctls_high =
   +
 (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
   + if (!(nested_vmx_exit_ctls_high 
   VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
   + nested_vmx_pinbased_ctls_high =
   + (~PIN_BASED_VMX_PREEMPTION_TIMER);
   The following logic is more clearly:
   if(nested_vmx_pinbased_ctls_high 
  PIN_BASED_VMX_PREEMPTION_TIMER)
   nested_vmx_exit_ctls_high |=
  VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
  Here I have such consideration: this logic is wrong if CPU support
  PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support
  VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this
 does
  occurs. So the codes above reads the MSR and reserves the features it
  supports, and here I just check if these two features are supported
  simultaneously.
 
  No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on
  PIN_BASED_VMX_PREEMPTION_TIMER.
 PIN_BASED_VMX_PREEMPTION_TIMER is an
  independent feature
 
  You remind that this piece of codes can write like this:
  if (!(nested_vmx_pin_based_ctls_high 
  PIN_BASED_VMX_PREEMPTION_TIMER) ||
  !(nested_vmx_exit_ctls_high 
  VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
  nested_vmx_exit_ctls_high
  =(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
  nested_vmx_pinbased_ctls_high =
  (~PIN_BASED_VMX_PREEMPTION_TIMER);
  }
 
  This may reflect the logic I describe above that these two flags
  should support simultaneously, and brings less confusion.
  
   BTW: I don't see nested_vmx_setup_ctls_msrs() considers the
   hardware's
  capability when expose those vmx features(not just preemption timer) to L1.
  The codes just above here, when setting pinbased control for nested
  vmx, it firstly rdmsr MSR_IA32_VMX_PINBASED_CTLS, then use this to
  mask the features hardware not support. So does other control fields.
  
  Yes, I saw it.
 
 nested_vmx_exit_ctls_high |=
   (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
  
 VM_EXIT_LOAD_IA32_EFER);
  
   @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct
   kvm_vcpu *vcpu, u64 *info1, u64 *info2)
 *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
  
   +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
   + u64 delta_tsc_l1;
   + u32 preempt_val_l1, preempt_val_l2, preempt_scale;
   +
   + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) 
   +
  MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
   + preempt_val_l2 =
 vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
   + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu,
   + native_read_tsc()) -
 vcpu-arch.last_guest_tsc;
   + preempt_val_l1 = delta_tsc_l1  preempt_scale;
   + if (preempt_val_l2 - preempt_val_l1  0)
   + preempt_val_l2 = 0;
   + else
   + preempt_val_l2 -= preempt_val_l1;
   + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE,
  preempt_val_l2); }
/*
 * The guest has exited.  See if we can fix it or if we need userspace
 * assistance.
   @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu

RE: [PATCH v2 5/8] KVM: nVMX: Fix guest CR3 read-back on VM-exit

2013-08-06 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-08-06:
 On Tue, Aug 06, 2013 at 10:39:59AM +0200, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 If nested EPT is enabled, the L2 guest may change CR3 without any exits.
 We therefore have to read the current value from the VMCS when
 switching to L1. However, if paging wasn't enabled, L0 tracks L2's
 CR3, and
 GUEST_CR3 rather contains the real-mode identity map. So we need to
 retrieve CR3 from the architectural state after conditionally
 updating it - and this is what kvm_read_cr3 does.
 
 I have a headache from trying to think about it already, but shouldn't
 L1 be the one who setups identity map for L2? I traced what
 vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) return here and do not see
Here is my understanding:
In vmx_set_cr3(), if enabled ept, it will check whether target vcpu is enabling 
paging. When L2 running in real mode, then target vcpu is not enabling paging 
and it will use L0's identity map for L2. If you read GUEST_CR3 from VMCS, then 
you may get the L2's identity map not L1's.

 different values in real mode.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  arch/x86/kvm/vmx.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 b482d47..09666aa 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -8106,7 +8106,7 @@ static void prepare_vmcs12(struct kvm_vcpu
 *vcpu, struct vmcs12 *vmcs12)
   * Additionally, restore L2's PDPTR to vmcs12.
   */
  if (enable_ept) {
 -vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3);
 +vmcs12-guest_cr3 = kvm_read_cr3(vcpu);
  vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
  vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
  vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
 --
 1.7.3.4



Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 5/8] KVM: nVMX: Fix guest CR3 read-back on VM-exit

2013-08-06 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-08-06:
 On Tue, Aug 06, 2013 at 11:44:41AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-08-06:
 On Tue, Aug 06, 2013 at 10:39:59AM +0200, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 If nested EPT is enabled, the L2 guest may change CR3 without any exits.
 We therefore have to read the current value from the VMCS when
 switching to L1. However, if paging wasn't enabled, L0 tracks
 L2's CR3, and
 GUEST_CR3 rather contains the real-mode identity map. So we need
 to retrieve CR3 from the architectural state after conditionally
 updating it - and this is what kvm_read_cr3 does.
 
 I have a headache from trying to think about it already, but
 shouldn't
 L1 be the one who setups identity map for L2? I traced what
 vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) return here and do not
 see
 Here is my understanding:
 In vmx_set_cr3(), if enabled ept, it will check whether target vcpu
 is enabling
 paging. When L2 running in real mode, then target vcpu is not enabling
 paging and it will use L0's identity map for L2. If you read GUEST_CR3
 from VMCS, then you may get the L2's identity map not L1's.
 
 Yes, but why it makes sense to use L0 identity map for L2? I didn't see
 different vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) values because L0
 and L1 use the same identity map address. When I changed identity
 address L1 configures vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) are
 indeed different, but the real CR3 L2 uses points to L0 identity map. If
 I zero L1 identity map page L2 still works.

If L2 in real mode, then L2PA == L1PA. So L0's identity map also works if L2 is 
in real mode.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 5/8] KVM: nVMX: Fix guest CR3 read-back on VM-exit

2013-08-06 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-08-06:
 On Tue, Aug 06, 2013 at 02:12:51PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-08-06:
 On Tue, Aug 06, 2013 at 11:44:41AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-08-06:
 On Tue, Aug 06, 2013 at 10:39:59AM +0200, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 If nested EPT is enabled, the L2 guest may change CR3 without any
 exits. We therefore have to read the current value from the VMCS
 when switching to L1. However, if paging wasn't enabled, L0 tracks
 L2's CR3, and GUEST_CR3 rather contains the real-mode identity map.
 So we need to retrieve CR3 from the architectural state after
 conditionally updating it - and this is what kvm_read_cr3 does.
 
 I have a headache from trying to think about it already, but
 shouldn't
 L1 be the one who setups identity map for L2? I traced what
 vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) return here and do not
 see
 Here is my understanding:
 In vmx_set_cr3(), if enabled ept, it will check whether target
 vcpu is enabling
 paging. When L2 running in real mode, then target vcpu is not
 enabling paging and it will use L0's identity map for L2. If you
 read GUEST_CR3 from VMCS, then you may get the L2's identity map
 not
 L1's.
 
 Yes, but why it makes sense to use L0 identity map for L2? I didn't
 see different vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) values because
 L0 and L1 use the same identity map address. When I changed identity
 address L1 configures vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) are
 indeed different, but the real CR3 L2 uses points to L0 identity map.
 If I zero L1 identity map page L2 still works.
 
 If L2 in real mode, then L2PA == L1PA. So L0's identity map also works
 if L2 is in real mode.
 
 That not the point. It may work accidentally for kvm on kvm, but what
 if other hypervisor plays different tricks and builds different ident map for 
 its guest?
Yes, if other hypervisor doesn't build the 1:1 mapping for its guest, it will 
fail to work. But I cannot imagine what kind of hypervisor will do this and 
what the purpose is.
Anyway, current logic is definitely wrong. It should use L1's identity map 
instead L0's.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-08-02 Thread Zhang, Yang Z
Jan Kiszka wrote on 2013-08-02:
 On 2013-08-02 05:04, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-08-01:
 From: Nadav Har'El n...@il.ibm.com
 
 Recent KVM, since
 http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
 switch the EFER MSR when EPT is used and the host and guest have 
 different NX bits. So if we add support for nested EPT (L1 guest 
 using EPT to run L2) and want to be able to run recent KVM as L1, we 
 need to allow L1 to use this EFER switching feature.
 
 To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if 
 available, and if it isn't, it uses the generic 
 VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the 
 latter is still unsupported).
 
 Nested entry and exit emulation (prepare_vmcs_02 and 
 load_vmcs12_host_state,
 respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly.
 So all that's left to do in this patch is to properly advertise this 
 feature to L1.
 
 Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, 
 by using vmx_set_efer (which itself sets one of several vmcs02 
 fields), so we always support this feature, regardless of whether 
 the host supports it.
 
 Reviewed-by: Orit Wasserman owass...@redhat.com
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/vmx.c |   23 ---
  1 file changed, 16 insertions(+), 7 deletions(-) diff --git 
 a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2198,7 +2198,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
  #else
 nested_vmx_exit_ctls_high = 0;
  #endif
 -   nested_vmx_exit_ctls_high |=
 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 +   nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 + VM_EXIT_LOAD_IA32_EFER);
 
 /* entry controls */
 rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
 @@ -2207,8 +2208,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
 nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 nested_vmx_entry_ctls_high =   VM_ENTRY_LOAD_IA32_PAT |
  VM_ENTRY_IA32E_MODE;
 -   nested_vmx_entry_ctls_high |=
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 -
 +   nested_vmx_entry_ctls_high |=
 (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
 +  VM_ENTRY_LOAD_IA32_EFER);
 Just saw it, we didn't expose bit22 save VMX-preemption timer in 
 vm-exit
 control but we already allowed guest to set active VMX-preemption 
 timer in pin based vm-execution conrols. This is wrong.
 
 Does the presence of preemption timer support imply that saving its 
 value is also supported? Then we could demand this combination (ie. do 
 not expose preemption timer support at all to L1 if value saving is
 missing) and build our preemption timer emulation on top.
 
I don't see we saved the preemption timer value to vmcs12 in prepare_vmcs12(). 
Will it be saved automatically?

 There is more broken /wrt VMX preemption timer, patches are welcome.
 Arthur will also try to develop test cases for it. But that topic is 
 unrelated to this series.
 
 Jan
 
 
 /* cpu-based controls */
 rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
 nested_vmx_procbased_ctls_low,
 nested_vmx_procbased_ctls_high);
 @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu 
 *vcpu, struct vmcs12 *vmcs12)
 vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
 vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu-arch.cr0_guest_owned_bits);
 
 -   /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
 below */
 -   vmcs_write32(VM_EXIT_CONTROLS,
 -   vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
 -   vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
 +   /* L2-L1 exit controls are emulated - the hardware exit is to L0 so
 +* we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
 +* bits are further modified by vmx_set_efer() below.
 +*/
 +   vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
 Should we mentioned that save vmx preemption bit must use host|guest, 
 not just host?
 
 +
 +   /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
 are
 +* emulated by vmx_set_efer(), below.
 +*/
 +   vmcs_write32(VM_ENTRY_CONTROLS,
 +   (vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
 +   ~VM_ENTRY_IA32E_MODE) |
 (vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));
  
 if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
 --
 1.7.10.4
 
 Best regards,
 Yang



Best regards,
Yang




RE: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-08-01 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-08-01:
 From: Nadav Har'El n...@il.ibm.com
 
 Recent KVM, since
 http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
 switch the EFER MSR when EPT is used and the host and guest have different
 NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
 and want to be able to run recent KVM as L1, we need to allow L1 to use this
 EFER switching feature.
 
 To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
 available,
 and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
 support for the former (the latter is still unsupported).
 
 Nested entry and exit emulation (prepare_vmcs_02 and
 load_vmcs12_host_state,
 respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
 all
 that's left to do in this patch is to properly advertise this feature to L1.
 
 Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
 using
 vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
 support this feature, regardless of whether the host supports it.
 
 Reviewed-by: Orit Wasserman owass...@redhat.com
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/vmx.c |   23 ---
  1 file changed, 16 insertions(+), 7 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index e999dc7..27efa6a 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2198,7 +2198,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
  #else
   nested_vmx_exit_ctls_high = 0;
  #endif
 - nested_vmx_exit_ctls_high |=
 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 + nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 +   VM_EXIT_LOAD_IA32_EFER);
 
   /* entry controls */
   rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
 @@ -2207,8 +2208,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
   nested_vmx_entry_ctls_low =
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
   nested_vmx_entry_ctls_high =
   VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
 - nested_vmx_entry_ctls_high |=
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 -
 + nested_vmx_entry_ctls_high |=
 (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
 +VM_ENTRY_LOAD_IA32_EFER);
Just saw it, we didn't expose bit22 save VMX-preemption timer in vm-exit 
control but we already allowed guest to set active VMX-preemption timer in pin 
based vm-execution conrols. This is wrong.

   /* cpu-based controls */
   rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
   nested_vmx_procbased_ctls_low,
 nested_vmx_procbased_ctls_high);
 @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu
 *vcpu, struct vmcs12 *vmcs12)
   vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
   vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu-arch.cr0_guest_owned_bits);
 
 - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
 below */
 - vmcs_write32(VM_EXIT_CONTROLS,
 - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
 - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
 + /* L2-L1 exit controls are emulated - the hardware exit is to L0 so
 +  * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
 +  * bits are further modified by vmx_set_efer() below.
 +  */
 + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
Should we mentioned that save vmx preemption bit must use host|guest, not just 
host?

 +
 + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
 are
 +  * emulated by vmx_set_efer(), below.
 +  */
 + vmcs_write32(VM_ENTRY_CONTROLS,
 + (vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
 + ~VM_ENTRY_IA32E_MODE) |
   (vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));
 
   if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
 --
 1.7.10.4

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-07-08 Thread Zhang, Yang Z
 -Original Message-
 From: Gleb Natapov [mailto:g...@redhat.com]
 Sent: Monday, July 08, 2013 8:38 PM
 To: Zhang, Yang Z
 Cc: Jan Kiszka; Paolo Bonzini; Nakajima, Jun; kvm@vger.kernel.org
 Subject: Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit
 controls for L1
 
 On Thu, Jul 04, 2013 at 08:42:53AM +, Zhang, Yang Z wrote:
  Gleb Natapov wrote on 2013-07-02:
   On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote:
   On 2013-07-02 17:15, Gleb Natapov wrote:
   On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
   On 2013-07-02 15:59, Gleb Natapov wrote:
   On Tue, Jul 02, 2013 at 03:01:24AM +, Zhang, Yang Z wrote:
   Since this series is pending in mail list for long time. And
   it's really a big feature for Nested. Also, I doubt the
   original authors(Jun and Nahav)should not have enough time to
 continue it.
   So I will pick it up. :)
  
   See comments below:
  
   Paolo Bonzini wrote on 2013-05-20:
   Il 19/05/2013 06:52, Jun Nakajima ha scritto:
   From: Nadav Har'El n...@il.ibm.com
  
   Recent KVM, since
   http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
   switch the EFER MSR when EPT is used and the host and guest
 have
   different NX bits. So if we add support for nested EPT (L1 guest
   using EPT to run L2) and want to be able to run recent KVM as L1,
   we need to allow L1 to use this EFER switching feature.
  
   To do this EFER switching, KVM uses
 VM_ENTRY/EXIT_LOAD_IA32_EFER
   if available, and if it isn't, it uses the generic
   VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the
 former
   (the latter is still unsupported).
  
   Nested entry and exit emulation (prepare_vmcs_02 and
   load_vmcs12_host_state, respectively) already handled
   VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do
   in this patch is to properly advertise this feature to L1.
  
   Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are
 emulated by
   L0, by using vmx_set_efer (which itself sets one of several
   vmcs02 fields), so we always support this feature, regardless of
   whether the host supports it.
  
   Signed-off-by: Nadav Har'El n...@il.ibm.com
   Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
   Signed-off-by: Xinhao Xu xinhao...@intel.com
   ---
arch/x86/kvm/vmx.c | 23 ---
1 file changed, 16 insertions(+), 7 deletions(-)
   diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
   260a919..fb9cae5 100644
   --- a/arch/x86/kvm/vmx.c
   +++ b/arch/x86/kvm/vmx.c
   @@ -2192,7 +2192,8 @@ static __init void
   nested_vmx_setup_ctls_msrs(void)  #else
  nested_vmx_exit_ctls_high = 0;  #endif
   -  nested_vmx_exit_ctls_high |=
 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
   +  nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR
   | +  VM_EXIT_LOAD_IA32_EFER);
  
  /* entry controls */
  rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8
   @@ static
   __init void nested_vmx_setup_ctls_msrs(void)
  nested_vmx_entry_ctls_low =
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
  nested_vmx_entry_ctls_high =
   VM_ENTRY_LOAD_IA32_PAT |
VM_ENTRY_IA32E_MODE;
   -  nested_vmx_entry_ctls_high |=
   VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; -
   +  nested_vmx_entry_ctls_high |=
   (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | +
   VM_ENTRY_LOAD_IA32_EFER);
  /* cpu-based controls */
  rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
  nested_vmx_procbased_ctls_low,
   nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@
 static
   void prepare_vmcs02(struct kvm_vcpu *vcpu,
   struct vmcs12 *vmcs12)
  vcpu-arch.cr0_guest_owned_bits =
~vmcs12-cr0_guest_host_mask;
   vmcs_writel(CR0_GUEST_HOST_MASK,
   ~vcpu-arch.cr0_guest_owned_bits);
  
   -  /* Note: IA32_MODE, LOAD_IA32_EFER are modified by
   vmx_set_efer
   below */
   -  vmcs_write32(VM_EXIT_CONTROLS, -
   vmcs12-vm_exit_controls |
   vmcs_config.vmexit_ctrl); -vmcs_write32(VM_ENTRY_CONTROLS,
   vmcs12-vm_entry_controls | +  /* L2-L1 exit controls are
   emulated - the hardware exit is +to L0 so + * we should use
 its
   exit controls. Note that IA32_MODE, LOAD_IA32_EFER +*
 bits are
   further modified by vmx_set_efer() below. + */
   +  vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
   This is wrong. We cannot use L0 exit control directly.
   LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT,
   ACK_INTR_ON_EXIT should use host's exit control. But others, still
   need use (vmcs12|host).
  
   I do not see why. We always intercept DR7/PAT/EFER, so save is
   emulated too. Host address space size always come from L0 and
   preemption timer is not supported for nested IIRC and when it
   will be host will have to save it on exit anyway for correct 
   emulation.
  
   Preemption timer is already supported and works fine as far as I 
   tested.
   KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
  
   So what

RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-07-04 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-07-02:
 On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote:
 On 2013-07-02 17:15, Gleb Natapov wrote:
 On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote:
 On 2013-07-02 15:59, Gleb Natapov wrote:
 On Tue, Jul 02, 2013 at 03:01:24AM +, Zhang, Yang Z wrote:
 Since this series is pending in mail list for long time. And
 it's really a big feature for Nested. Also, I doubt the
 original authors(Jun and Nahav)should not have enough time to continue 
 it.
 So I will pick it up. :)
 
 See comments below:
 
 Paolo Bonzini wrote on 2013-05-20:
 Il 19/05/2013 06:52, Jun Nakajima ha scritto:
 From: Nadav Har'El n...@il.ibm.com
 
 Recent KVM, since
 http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
 switch the EFER MSR when EPT is used and the host and guest have
 different NX bits. So if we add support for nested EPT (L1 guest
 using EPT to run L2) and want to be able to run recent KVM as L1,
 we need to allow L1 to use this EFER switching feature.
 
 To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER
 if available, and if it isn't, it uses the generic
 VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former
 (the latter is still unsupported).
 
 Nested entry and exit emulation (prepare_vmcs_02 and
 load_vmcs12_host_state, respectively) already handled
 VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do
 in this patch is to properly advertise this feature to L1.
 
 Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by
 L0, by using vmx_set_efer (which itself sets one of several
 vmcs02 fields), so we always support this feature, regardless of
 whether the host supports it.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 ---
  arch/x86/kvm/vmx.c | 23 ---
  1 file changed, 16 insertions(+), 7 deletions(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 260a919..fb9cae5 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2192,7 +2192,8 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)  #else
nested_vmx_exit_ctls_high = 0;  #endif
 -  nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 +  nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR
 | +  VM_EXIT_LOAD_IA32_EFER);
 
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8
 @@ static
 __init void nested_vmx_setup_ctls_msrs(void)
nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_entry_ctls_high =   VM_ENTRY_LOAD_IA32_PAT |
  VM_ENTRY_IA32E_MODE;
 -  nested_vmx_entry_ctls_high |=
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; -
 +  nested_vmx_entry_ctls_high |=
 (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | +
   
 VM_ENTRY_LOAD_IA32_EFER);
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
nested_vmx_procbased_ctls_low,
 nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@ static
 void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
vcpu-arch.cr0_guest_owned_bits =
  ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu-arch.cr0_guest_owned_bits);
 
 -  /* Note: IA32_MODE, LOAD_IA32_EFER are modified by
 vmx_set_efer
 below */
 -  vmcs_write32(VM_EXIT_CONTROLS, -
 vmcs12-vm_exit_controls |
 vmcs_config.vmexit_ctrl); -vmcs_write32(VM_ENTRY_CONTROLS,
 vmcs12-vm_entry_controls | +  /* L2-L1 exit controls are
 emulated - the hardware exit is +to L0 so + * we should use its
 exit controls. Note that IA32_MODE, LOAD_IA32_EFER +* bits are
 further modified by vmx_set_efer() below. + */
 +  vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
 This is wrong. We cannot use L0 exit control directly.
 LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT,
 ACK_INTR_ON_EXIT should use host's exit control. But others, still
 need use (vmcs12|host).
 
 I do not see why. We always intercept DR7/PAT/EFER, so save is
 emulated too. Host address space size always come from L0 and
 preemption timer is not supported for nested IIRC and when it
 will be host will have to save it on exit anyway for correct emulation.
 
 Preemption timer is already supported and works fine as far as I tested.
 KVM doesn't use it for L1, so we do not need to save/restore it - IIRC.
 
 So what happens if L1 configures it to value X after X/2 ticks L0
 exit happen and L0 gets back to L2 directly. The counter will be X
 again instead of X/2.
 
 Likely. Yes, we need to improve our emulation by setting Save
 VMX-preemption timer value or emulate this in software if the
 hardware lacks support for it (was this flag introduced after the
 preemption timer itself?).
 
 Not sure, but my point was that for correct emulation host needs to
 set save preempt timer

RE: IA32_FEATURE_CONTROL MSR in nested virt

2013-07-03 Thread Zhang, Yang Z

Il 03/07/2013 10:24, Arthur Chunqi Li ha scritto:
 Hi Gleb and Paolo,
 When I write test cases for nested virt and found that reading/writing  
 IA32_FEATURE_CONTROL will be simply ignored or return 0 (in
 arch/x86/kvm/vmx.c) in VM. Checking this MSR will be done by some 
 hypervisors (e.g. NOVA) and may cause error then, so it is necessary 
 to behave right when read/write it in VM.
 
 Are there any difficulties to handle this MSR? I have two solutions.
 The first one is return the value of physical CPU's and always return 
 true when write. This is simple but may behave as if it is a VM 
 because write to it after VMXON will not return GP exception. This 
 solution can solve most basic problems since this MSR is not commonly 
 used. Another solution is adding a field in VCPU to handle this MSR.
 This is a complex but better method.
 
 I think I can complete this if needed.

Would it be enough to return 5 (binary 101) on reads if nested VMX is enabled, 
and #GP on writes?
Agree!, It should be enough. You cannot return physical value directly since 
there is no SMX in nested.

Best regards,
Yang


RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-07-01 Thread Zhang, Yang Z
Since this series is pending in mail list for long time. And it's really a big 
feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not 
have enough time to continue it. So I will pick it up. :)

See comments below:

Paolo Bonzini wrote on 2013-05-20:
 Il 19/05/2013 06:52, Jun Nakajima ha scritto:
  From: Nadav Har'El n...@il.ibm.com
 
  Recent KVM, since
 http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
  switch the EFER MSR when EPT is used and the host and guest have different
  NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
  and want to be able to run recent KVM as L1, we need to allow L1 to use this
  EFER switching feature.
 
  To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if
 available,
  and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
  support for the former (the latter is still unsupported).
 
  Nested entry and exit emulation (prepare_vmcs_02 and
 load_vmcs12_host_state,
  respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So
 all
  that's left to do in this patch is to properly advertise this feature to L1.
 
  Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by
 using
  vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
  support this feature, regardless of whether the host supports it.
 
  Signed-off-by: Nadav Har'El n...@il.ibm.com
  Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
  Signed-off-by: Xinhao Xu xinhao...@intel.com
  ---
   arch/x86/kvm/vmx.c | 23 ---
   1 file changed, 16 insertions(+), 7 deletions(-)
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 260a919..fb9cae5 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
   #else
  nested_vmx_exit_ctls_high = 0;
   #endif
  -   nested_vmx_exit_ctls_high |=
 VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
  +   nested_vmx_exit_ctls_high |=
 (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
  + VM_EXIT_LOAD_IA32_EFER);
 
  /* entry controls */
  rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
  @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
  nested_vmx_entry_ctls_low =
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
  nested_vmx_entry_ctls_high =
  VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
  -   nested_vmx_entry_ctls_high |=
 VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
  -
  +   nested_vmx_entry_ctls_high |=
 (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
  +  VM_ENTRY_LOAD_IA32_EFER);
  /* cpu-based controls */
  rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
  nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
  @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
  vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
  vmcs_writel(CR0_GUEST_HOST_MASK,
 ~vcpu-arch.cr0_guest_owned_bits);
 
  -   /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
 below */
  -   vmcs_write32(VM_EXIT_CONTROLS,
  -   vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
  -   vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
  +   /* L2-L1 exit controls are emulated - the hardware exit is to L0 so
  +* we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
  +* bits are further modified by vmx_set_efer() below.
  +*/
  +   vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
This is wrong. We cannot use L0 exit control directly.
LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should 
use host's exit control. But others, still need use (vmcs12|host).

  +
  +   /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
 are
  +* emulated by vmx_set_efer(), below.
 
 VM_ENTRY_LOAD_IA32_EFER is not emulated by vmx_set_efer, so:
VM_ENTRY_LOAD_IA32_EFER is hanlded in setup_msrs(), and vmx_set_efer already 
call it.

 
 /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
  * are emulated below.  VM_ENTRY_IA32E_MODE is handled in
  * vmx_set_efer().  */
 
 Paolo
 
  +*/
  +   vmcs_write32(VM_ENTRY_CONTROLS,
  +   (vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
  +   ~VM_ENTRY_IA32E_MODE) |
  (vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));
 
  if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt

2013-05-06 Thread Zhang, Yang Z
Yangminqiang wrote on 2013-05-03:
 Nakajima, Jun wrote on 2013-04-26:
 Subject: Re: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver 
 virtual
 interrupt
 
 On Fri, Apr 26, 2013 at 2:29 AM, Yangminqiang yangminqi...@huawei.com
 wrote:
 
 Ivytown or newer platform supported it.
 
 Ivytown? Do you mean Ivy Bridge?
 
 
 Ivy Town is the codename of Ivy Bridge-based servers.
 
 One more question, what is the relationship between x2APIC and APIC
 virtualization? APIC-v requires x2APIC or APIC-v includes x2APIC?
If you are using x2apic way(MSR base access)inside guest and want to benefit 
from apic virtualization technology, then you should set virtual x2apic bit in 
Secondary Processor-Based VM-Execution Controls.

Best regards,
Yang



RE: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt

2013-04-25 Thread Zhang, Yang Z
Yangminqiang wrote on 2013-04-26:
 Hi Yang Zhang,
 
 Could you please let me know your CPU model or the CPU models which
 supports apic-v which your patch requires()? So that I could try you
 patches.
 
   Intel Software Developer's Manualm, Volume 3C,
   System Programming Guide, Part 3. Ch29,
   APIC VIRTUALIZATION AND VIRTUAL INTERRUPTS
 Or how can I know whether my hardware support those features listed in the
 manual above?
Ivytown or newer platform supported it. 

 Thanks,
 Steven
 
 kvm-ow...@vger.kernel.org wrote on 2013-04-11:
 Subject: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver virtual
 interrupt
 
 From: Yang Zhang yang.z.zh...@intel.com
 
 If posted interrupt is avaliable, then uses it to inject virtual
 interrupt to guest.
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/kvm/lapic.c |   30 +++---
  arch/x86/kvm/vmx.c   |2 +-
  arch/x86/kvm/x86.c   |1 +
  3 files changed, 21 insertions(+), 12 deletions(-)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index dbf74c9..e29883c 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -353,6 +353,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic
 *apic)
  if (!apic-irr_pending)
  return -1;
 +kvm_x86_ops-sync_pir_to_irr(apic-vcpu);
  result = apic_search_irr(apic);
  ASSERT(result == -1 || result = 16);
 @@ -683,18 +684,25 @@ static int __apic_accept_irq(struct kvm_lapic *apic,
 int delivery_mode,
  if (dest_map)
  __set_bit(vcpu-vcpu_id, dest_map);
 -result = !apic_test_and_set_irr(vector, apic);
 -trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 -  trig_mode, vector, !result);
 -if (!result) {
 -if (trig_mode)
 -apic_debug(level trig mode repeatedly for 
 -vector %d, vector);
 -break;
 -}
 +if (kvm_x86_ops-deliver_posted_interrupt) {
 +result = 1;
 +kvm_x86_ops-deliver_posted_interrupt(vcpu, vector);
 +} else {
 +result = !apic_test_and_set_irr(vector, apic);
 
 -kvm_make_request(KVM_REQ_EVENT, vcpu);
 -kvm_vcpu_kick(vcpu);
 +if (!result) {
 +if (trig_mode)
 +apic_debug(level trig mode repeatedly 
 +for vector %d, vector);
 +goto out;
 +}
 +
 +kvm_make_request(KVM_REQ_EVENT, vcpu);
 +kvm_vcpu_kick(vcpu);
 +}
 +out:
 +trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 +trig_mode, vector, !result);
  break;
  
  case APIC_DM_REMRD:
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 314b2ed..52b21da 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -84,7 +84,7 @@ module_param(vmm_exclusive, bool, S_IRUGO);
  static bool __read_mostly fasteoi = 1;
  module_param(fasteoi, bool, S_IRUGO);
 -static bool __read_mostly enable_apicv;
 +static bool __read_mostly enable_apicv = 1;
  module_param(enable_apicv, bool, S_IRUGO);
  
  /*
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 6147d24..628582f 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2685,6 +2685,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
  static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,  

  struct kvm_lapic_state *s) { +  kvm_x86_ops-sync_pir_to_irr(vcpu);
  memcpy(s-regs, vcpu-arch.apic-regs, sizeof *s);
  
  return 0;
 --
 1.7.1
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best regards,
Yang




RE: [patch] x86, kvm: fix build failure with CONFIG_SMP disabled

2013-04-17 Thread Zhang, Yang Z
David Rientjes wrote on 2013-04-18:
 On Wed, 17 Apr 2013, Randy Dunlap wrote:
 
 On 04/17/13 16:12, David Rientjes wrote:
 The build fails when CONFIG_SMP is disabled:
 
 arch/x86/kvm/vmx.c: In function 'vmx_deliver_posted_interrupt':
 arch/x86/kvm/vmx.c:3950:3: error: 'apic' undeclared (first use in
 this function)
 
 Fix it by including the necessary header.
 
 Sorry, i386 build still fails with the same error message plus this one:
 
 ERROR: apic [arch/x86/kvm/kvm-intel.ko] undefined!
 
 
 Ahh, that's because you don't have CONFIG_X86_LOCAL_APIC as you already
 mentioned.  So it looks like this error can manifest in two different ways
 and we got different reports.
 
 This failure came from KVM: VMX: Add the deliver posted interrupt
 algorithm, so adding Yang to the cc to specify the dependency this has on
 apic and how it can be protected without CONFIG_X86_LOCAL_APIC on i386.
How about the follow patch?

commit a49dd819f502c1029c5a857e87201ef25ec06ce6
Author: Yang Zhang yang.z.zh...@intel.com
Date:   Wed Apr 17 05:34:07 2013 -0400

KVM: x86: Don't sending posted interrupt if not config CONFIG_SMP

In UP, posted interrupt logic will not work. So we should not send
posted interrupt and let vcpu to pick the pending interrupt before
vmentry.

Signed-off-by: Yang Zhang yang.z.zh...@intel.com

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 52b21da..d5c6b95 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3946,10 +3946,12 @@ static void vmx_deliver_posted_interrupt(struct 
kvm_vcpu *vcpu, int vector)

r = pi_test_and_set_on(vmx-pi_desc);
kvm_make_request(KVM_REQ_EVENT, vcpu);
+#ifdef CONFIG_SMP
if (!r  (vcpu-mode == IN_GUEST_MODE))
apic-send_IPI_mask(get_cpu_mask(vcpu-cpu),
POSTED_INTR_VECTOR);
else
+#endif
kvm_vcpu_kick(vcpu);
 }

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch] x86, kvm: fix build failure with CONFIG_SMP disabled

2013-04-17 Thread Zhang, Yang Z
Randy Dunlap wrote on 2013-04-18:
 On 04/17/13 17:35, Zhang, Yang Z wrote:
 David Rientjes wrote on 2013-04-18:
 On Wed, 17 Apr 2013, Randy Dunlap wrote:
 
 On 04/17/13 16:12, David Rientjes wrote:
 The build fails when CONFIG_SMP is disabled:
 
   arch/x86/kvm/vmx.c: In function 'vmx_deliver_posted_interrupt':
   arch/x86/kvm/vmx.c:3950:3: error: 'apic' undeclared (first use in
 this function)
 
 Fix it by including the necessary header.
 
 Sorry, i386 build still fails with the same error message plus this one:
 
 ERROR: apic [arch/x86/kvm/kvm-intel.ko] undefined!
 
 
 Ahh, that's because you don't have CONFIG_X86_LOCAL_APIC as you already
 mentioned.  So it looks like this error can manifest in two different ways
 and we got different reports.
 
 This failure came from KVM: VMX: Add the deliver posted interrupt
 algorithm, so adding Yang to the cc to specify the dependency this has on
 apic and how it can be protected without CONFIG_X86_LOCAL_APIC on i386.
 How about the follow patch?
 
 commit a49dd819f502c1029c5a857e87201ef25ec06ce6
 Author: Yang Zhang yang.z.zh...@intel.com
 Date:   Wed Apr 17 05:34:07 2013 -0400
 
 KVM: x86: Don't sending posted interrupt if not config CONFIG_SMP
 
 In UP, posted interrupt logic will not work. So we should not send
 posted interrupt and let vcpu to pick the pending interrupt before
 vmentry.
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 
 Missing Reported-by: and the patch does not apply cleanly (looks like
 lots of spaces instead of tabs in it)... but it does build now after
 massaging the patch.
Thanks.
Just copy it to you for a quick testing. I will resend a formal patch.

 Thanks.
 
 Acked-by: Randy Dunlap rdun...@infradead.org
 
 
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 52b21da..d5c6b95 100644 --- a/arch/x86/kvm/vmx.c +++
 b/arch/x86/kvm/vmx.c @@ -3946,10 +3946,12 @@ static void
 vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 
 r = pi_test_and_set_on(vmx-pi_desc);
 kvm_make_request(KVM_REQ_EVENT, vcpu); +#ifdef CONFIG_SMP if
 (!r  (vcpu-mode == IN_GUEST_MODE))
 apic-send_IPI_mask(get_cpu_mask(vcpu-cpu),
 POSTED_INTR_VECTOR);
 else
 +#endif
 kvm_vcpu_kick(vcpu);
  }
 Best regards,
 Yang
 
 
 
 
 --
 ~Randy
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v9 0/7] KVM: VMX: Add Posted Interrupt supporting

2013-04-11 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-11:
 On Thu, Apr 11, 2013 at 01:03:30AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-10:
 On Wed, Apr 10, 2013 at 09:22:50PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 The follwoing patches are adding the Posted Interrupt supporting to KVM:
 The first patch enables the feature 'acknowledge interrupt on vmexit'.Since
 it is required by Posted interrupt, we need to enable it firstly.
 
 And the subsequent patches are adding the posted interrupt supporting:
 Posted Interrupt allows APIC interrupts to inject into guest directly
 without any vmexit.
 
 - When delivering a interrupt to guest, if target vcpu is running,
   update Posted-interrupt requests bitmap and send a notification
   event to the vcpu. Then the vcpu will handle this interrupt
   automatically, without any software involvemnt. - If target vcpu is
   not running or there already a notification event pending in the
   vcpu, do nothing. The interrupt will be handled by next vm entry
 Changes from v8 to v9:
 * Add tracing in PI case when deliver interrupt.
 * Scan ioapic when updating SPIV register.
 Do not see it at the patch series. Have I missed it?
 The change is in forth patch:
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 6796218..4ccdc94 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -134,11 +134,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic,
 u32 val)
  static_key_slow_inc(apic_sw_disabled.key);
  }
  apic_set_reg(apic, APIC_SPIV, val);
 -}
 -
 -static inline int apic_enabled(struct kvm_lapic *apic)
 -{
 -return kvm_apic_sw_enabled(apic)  kvm_apic_hw_enabled(apic);
 +kvm_make_request(KVM_REQ_SCAN_IOAPIC, apic-vcpu);
  }
 OK, see it now. Thanks.
 
 As you mentioned, since it will call apic_enabled() to check whether apic is
 enabled in vcpu_scan_ioapic. So we must ensure rescan ioapic when apic state
 changed.
 And I found recalculate_apic_map() doesn't track the enable/disable apic by
 software approach. So make_scan_ioapic_request in recalculate_apic_map() is
 not enough.
 We also should force rescan ioapic when apic state is changed via
 software approach(update spiv reg).
 
 10.4.7.2 Local APIC State After It Has Been Software Disabled says:
 
   Pending interrupts in the IRR and ISR registers are held and require
   masking or handling by the CPU.
 My understanding is that we should treat software disabled APIC as a
 valid target for an interrupt. vcpu_scan_ioapic() should check
 kvm_apic_hw_enabled() only.
Indeed. kvm_apic_hw_enabled() is the right one.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v9 7/7] KVM: Use eoi to track RTC interrupt delivery status

2013-04-11 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-11:
 On Wed, Apr 10, 2013 at 09:22:20PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Current interrupt coalescing logci which only used by RTC has conflict
 with Posted Interrupt.
 This patch introduces a new mechinism to use eoi to track interrupt:
 When delivering an interrupt to vcpu, the pending_eoi set to number of
 vcpu that received the interrupt. And decrease it when each vcpu writing
 eoi. No subsequent RTC interrupt can deliver to vcpu until all vcpus
 write eoi.
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  virt/kvm/ioapic.c |   39 ++-
  1 files changed, 38 insertions(+), 1 deletions(-)
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index a49fcd5..aeac154 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -147,6 +147,26 @@ static void kvm_rtc_eoi_tracking_restore_all(struct
 kvm_ioapic *ioapic)
  __rtc_irq_eoi_tracking_restore_one(vcpu);
  }
 +static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu)
 +{
 +if (test_and_clear_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map))
 +--ioapic-rtc_status.pending_eoi;
 +
 +WARN_ON(ioapic-rtc_status.pending_eoi  0);
 +}
 +
 +static bool rtc_irq_check_coalesced(struct kvm_ioapic *ioapic, int irq,
 +bool line_status)
 +{
 +if (irq != RTC_GSI || !line_status)
 +return false;
 Please move the check from rtc_irq_check_coalesced() to
 kvm_ioapic_set_irq() like this: if (irq == RTC_GSI  line_status 
 rtc_irq_check_coalesced(ioapic, irq, line_status)) 
 
 I was going to fix it myself while applying, but since there will be
 new posted interrupt series anyway you can as well fix this one too.
You mean fix it and send out it with posted interrupt series? Or just rebase 
the posted interrupt series on the top of this fix, but needn't to send out it?

 
 +
 +if (ioapic-rtc_status.pending_eoi  0)
 +return true; /* coalesced */
 +
 +return false;
 +}
 +
  static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx,
  bool line_status)
  {
 @@ -260,6 +280,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int 
 irq,
 bool line_status)
  {
  union kvm_ioapic_redirect_entry *entry = ioapic-redirtbl[irq];
  struct kvm_lapic_irq irqe;
 +int ret;
 
  ioapic_debug(dest=%x dest_mode=%x delivery_mode=%x 
   vector=%x trig_mode=%x\n,
 @@ -275,7 +296,15 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int
 irq, bool line_status)
  irqe.level = 1;
  irqe.shorthand = 0;
 -return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL);
 +if (irq == RTC_GSI  line_status) {
 +BUG_ON(ioapic-rtc_status.pending_eoi != 0);
 +ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe,
 +ioapic-rtc_status.dest_map);
 +ioapic-rtc_status.pending_eoi = ret;
 +} else
 +ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL);
 +
 +return ret;
  }
  
  int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int 
 irq_source_id,
 @@ -299,6 +328,11 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int
 irq, int irq_source_id,
  ret = 1;
  } else {
  int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG);
 +
 +if (rtc_irq_check_coalesced(ioapic, irq, line_status)) {
 +ret = 0; /* coalesced */
 +goto out;
 +}
  ioapic-irr |= mask;
  if ((edge  old_irr != ioapic-irr) ||
  (!edge  !entry.fields.remote_irr))
 @@ -306,6 +340,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
 irq,
 int irq_source_id,
  elseret = 0; /* report coalesced interrupt 
 */   } +out:
  trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0);
  spin_unlock(ioapic-lock);
 @@ -333,6 +368,8 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu
 *vcpu,
  if (ent-fields.vector != vector)
  continue;
 +if (i == RTC_GSI)
 +rtc_irq_eoi(ioapic, vcpu);
  /*
   * We are dropping lock while calling ack notifiers because ack
   * notifier callbacks for assigned devices call into IOAPIC
 --
 1.7.1
 
 --
   Gleb.


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v9 7/7] KVM: Use eoi to track RTC interrupt delivery status

2013-04-11 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-11:
 On Thu, Apr 11, 2013 at 07:54:01AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-11:
 On Wed, Apr 10, 2013 at 09:22:20PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Current interrupt coalescing logci which only used by RTC has conflict
 with Posted Interrupt.
 This patch introduces a new mechinism to use eoi to track interrupt:
 When delivering an interrupt to vcpu, the pending_eoi set to number of
 vcpu that received the interrupt. And decrease it when each vcpu writing
 eoi. No subsequent RTC interrupt can deliver to vcpu until all vcpus
 write eoi.
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  virt/kvm/ioapic.c |   39 ++- 1
  files changed, 38 insertions(+), 1 deletions(-)
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index a49fcd5..aeac154 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -147,6 +147,26 @@ static void kvm_rtc_eoi_tracking_restore_all(struct
 kvm_ioapic *ioapic)
__rtc_irq_eoi_tracking_restore_one(vcpu);
  }
 +static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu)
 +{
 +  if (test_and_clear_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map))
 +  --ioapic-rtc_status.pending_eoi;
 +
 +  WARN_ON(ioapic-rtc_status.pending_eoi  0);
 +}
 +
 +static bool rtc_irq_check_coalesced(struct kvm_ioapic *ioapic, int irq,
 +  bool line_status)
 +{
 +  if (irq != RTC_GSI || !line_status)
 +  return false;
 Please move the check from rtc_irq_check_coalesced() to
 kvm_ioapic_set_irq() like this: if (irq == RTC_GSI  line_status 
 rtc_irq_check_coalesced(ioapic, irq, line_status)) 
 
 I was going to fix it myself while applying, but since there will be
 new posted interrupt series anyway you can as well fix this one too.
 You mean fix it and send out it with posted interrupt series? Or just
 rebase the posted interrupt series on the top of this fix, but needn't
 to send out it?
 
 Send both series. RTC one with this change.
Sure.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v9 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt

2013-04-10 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-10:
 On Wed, Apr 10, 2013 at 09:22:57PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 If posted interrupt is avaliable, then uses it to inject virtual
 interrupt to guest.
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/kvm/lapic.c |   32 +---
  arch/x86/kvm/vmx.c   |2 +-
  arch/x86/kvm/x86.c   |1 +
  3 files changed, 23 insertions(+), 12 deletions(-)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 42a87ac..4fdb984 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -349,6 +349,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic
 *apic)
  if (!apic-irr_pending)
  return -1;
 +kvm_x86_ops-sync_pir_to_irr(apic-vcpu);
  result = apic_search_irr(apic);
  ASSERT(result == -1 || result = 16);
 @@ -679,18 +680,27 @@ static int __apic_accept_irq(struct kvm_lapic *apic, 
 int
 delivery_mode,
  if (dest_map)
  __set_bit(vcpu-vcpu_id, dest_map);
 -result = !apic_test_and_set_irr(vector, apic);
 -trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 -  trig_mode, vector, !result);
 -if (!result) {
 -if (trig_mode)
 -apic_debug(level trig mode repeatedly for 
 -vector %d, vector);
 -break;
 -}
 +if (kvm_x86_ops-deliver_posted_interrupt) {
 +result = 1;
 +kvm_x86_ops-deliver_posted_interrupt(vcpu, vector);
 +} else {
 +result = !apic_test_and_set_irr(vector, apic);
 +
 +trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 +trig_mode, vector, !result);
 +if (!result) {
 +if (trig_mode)
 +apic_debug(level trig mode repeatedly 
 +for vector %d, vector);
 +goto out;
 +}
 
 -kvm_make_request(KVM_REQ_EVENT, vcpu);
 -kvm_vcpu_kick(vcpu);
 +kvm_make_request(KVM_REQ_EVENT, vcpu);
 +kvm_vcpu_kick(vcpu);
 +}
 +out:
 +trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 +trig_mode, vector, !result);
 Sigh, now you trace it twice.
Yes. Forget to remove the old one. :(

 
  break;
  
  case APIC_DM_REMRD:
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 314b2ed..52b21da 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -84,7 +84,7 @@ module_param(vmm_exclusive, bool, S_IRUGO);
  static bool __read_mostly fasteoi = 1;
  module_param(fasteoi, bool, S_IRUGO);
 -static bool __read_mostly enable_apicv;
 +static bool __read_mostly enable_apicv = 1;
  module_param(enable_apicv, bool, S_IRUGO);
  
  /*
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 72be079..486f627 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2685,6 +2685,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
  static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,  

  struct kvm_lapic_state *s) { +  kvm_x86_ops-sync_pir_to_irr(vcpu);
  memcpy(s-regs, vcpu-arch.apic-regs, sizeof *s);
  
  return 0;
 --
 1.7.1
 
 --
   Gleb.


Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v9 0/7] KVM: VMX: Add Posted Interrupt supporting

2013-04-10 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-10:
 On Wed, Apr 10, 2013 at 09:22:50PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 The follwoing patches are adding the Posted Interrupt supporting to KVM:
 The first patch enables the feature 'acknowledge interrupt on vmexit'.Since
 it is required by Posted interrupt, we need to enable it firstly.
 
 And the subsequent patches are adding the posted interrupt supporting:
 Posted Interrupt allows APIC interrupts to inject into guest directly
 without any vmexit.
 
 - When delivering a interrupt to guest, if target vcpu is running,
   update Posted-interrupt requests bitmap and send a notification event
   to the vcpu. Then the vcpu will handle this interrupt automatically,
   without any software involvemnt.
 - If target vcpu is not running or there already a notification event
   pending in the vcpu, do nothing. The interrupt will be handled by
   next vm entry
 Changes from v8 to v9:
 * Add tracing in PI case when deliver interrupt.
 * Scan ioapic when updating SPIV register.
 Do not see it at the patch series. Have I missed it?
The change is in forth patch:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 6796218..4ccdc94 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -134,11 +134,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, 
u32 val)
static_key_slow_inc(apic_sw_disabled.key);
}
apic_set_reg(apic, APIC_SPIV, val);
-}
-
-static inline int apic_enabled(struct kvm_lapic *apic)
-{
-   return kvm_apic_sw_enabled(apic)  kvm_apic_hw_enabled(apic);
+   kvm_make_request(KVM_REQ_SCAN_IOAPIC, apic-vcpu);
 }

As you mentioned, since it will call apic_enabled() to check whether apic is 
enabled in vcpu_scan_ioapic. So we must ensure rescan ioapic when apic state 
changed.
And I found recalculate_apic_map() doesn't track the enable/disable apic by 
software approach. So make_scan_ioapic_request in recalculate_apic_map() is not 
enough.
We also should force rescan ioapic when apic state is changed via software 
approach(update spiv reg).

 
 * Rebase on top of KVM upstream + RTC eoi tracking patch.
 
 Changes from v7 to v8:
 * Remove unused memeber 'on' from struct pi_desc.
 * Register a dummy function to sync_pir_to_irr is apicv is disabled.
 * Minor fixup.
 * Rebase on top of KVM upstream + RTC eoi tracking patch.
 
 Yang Zhang (7):
   KVM: VMX: Enable acknowledge interupt on vmexit
   KVM: VMX: Register a new IPI for posted interrupt
   KVM: VMX: Check the posted interrupt capability
   KVM: Call common update function when ioapic entry changed.
   KVM: Set TMR when programming ioapic entry
   KVM: VMX: Add the algorithm of deliver posted interrupt
   KVM: VMX: Use posted interrupt to deliver virtual interrupt
  arch/ia64/kvm/lapic.h  |6 -
  arch/x86/include/asm/entry_arch.h  |4 +
  arch/x86/include/asm/hardirq.h |3 +
  arch/x86/include/asm/hw_irq.h  |1 +
  arch/x86/include/asm/irq_vectors.h |5 +
  arch/x86/include/asm/kvm_host.h|3 + arch/x86/include/asm/vmx.h
  |4 + arch/x86/kernel/entry_64.S |5 +
  arch/x86/kernel/irq.c  |   22 
  arch/x86/kernel/irqinit.c  |4 + arch/x86/kvm/lapic.c  
  |   66  arch/x86/kvm/lapic.h   |7
  ++ arch/x86/kvm/svm.c |   12 ++ arch/x86/kvm/vmx.c
  |  207 +++-
  arch/x86/kvm/x86.c |   19 +++-
  include/linux/kvm_host.h   |4 +- virt/kvm/ioapic.c
   |   32 -- virt/kvm/ioapic.h  |7 +-
  virt/kvm/irq_comm.c|4 +- virt/kvm/kvm_main.c  
   |5 +- 20 files changed, 341 insertions(+), 79 deletions(-)
 
 --
   Gleb.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v8 4/7] KVM: Add reset/restore rtc_status support

2013-04-09 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-09:
 On Mon, Apr 08, 2013 at 10:17:46PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/kvm/lapic.c |9 +++ arch/x86/kvm/lapic.h |2 +
  virt/kvm/ioapic.c|   60
  ++ virt/kvm/ioapic.h  
   |1 + 4 files changed, 72 insertions(+), 0 deletions(-)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 0b73402..6796218 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void *bitmap)
  return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
  }
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector)
 +{
 +struct kvm_lapic *apic = vcpu-arch.apic;
 +
 +return apic_test_vector(vector, apic-regs + APIC_ISR) ||
 +apic_test_vector(vector, apic-regs + APIC_IRR);
 +}
 +
  static inline void apic_set_vector(int vec, void *bitmap)
  {
  set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 @@ -1618,6 +1626,7 @@ void kvm_apic_post_state_restore(struct kvm_vcpu
 *vcpu,
  apic-highest_isr_cache = -1;
  kvm_x86_ops-hwapic_isr_update(vcpu-kvm,
  apic_find_highest_isr(apic));   kvm_make_request(KVM_REQ_EVENT, vcpu);
  +   kvm_rtc_eoi_tracking_restore_one(vcpu); }
  
  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 3e5a431..16304b1 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -166,4 +166,6 @@ static inline bool kvm_apic_has_events(struct
 kvm_vcpu *vcpu)
  return vcpu-arch.apic-pending_events;
  }
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 +
  #endif
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 27ae8dd..4699180 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -90,6 +90,64 @@ static unsigned long ioapic_read_indirect(struct
 kvm_ioapic *ioapic,
  return result;
  }
 +static void rtc_irq_reset(struct kvm_ioapic *ioapic)
 rtc_irq_eoi_tracking_reset()
Sure. 

 
 +{
 +ioapic-rtc_status.pending_eoi = 0;
 +bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS);
 +}
 +
 +static void __rtc_irq_eoi_tracking_restore_one(struct kvm_vcpu *vcpu,
 +int vector)
 +{
 +bool new_val, old_val;
 +struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic;
 +union kvm_ioapic_redirect_entry *e;
 +
 +e = ioapic-redirtbl[RTC_GSI];
 +if (!kvm_apic_match_dest(vcpu, NULL, 0, e-fields.dest_id,
 +e-fields.dest_mode))
 +return;
 +
 +new_val = kvm_apic_pending_eoi(vcpu, vector);
 +old_val = test_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map);
 +
 +if (new_val == old_val)
 +return;
 +
 +if (new_val) {
 +__set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map);
 +ioapic-rtc_status.pending_eoi++;
 +} else {
 +__clear_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map);
 +ioapic-rtc_status.pending_eoi--;
 +}
 WARN_ON(ioapic-rtc_status.pending_eoi  0);
Sure.

 
 +}
 +
 +void kvm_rtc_eoi_tracking_restore_one(struct kvm_vcpu *vcpu)
 +{
 +struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic;
 +int vector;
 +
 +vector = ioapic-redirtbl[RTC_GSI].fields.vector;
 Do not access ioapic outside of the lock. Also since you access
 ioapic-redirtbl[RTC_GSI] in __rtc_irq_eoi_tracking_restore_one()
 anyway what's the point passing vector to it?
Right. 

 
 +spin_lock(ioapic-lock);
 +__rtc_irq_eoi_tracking_restore_one(vcpu, vector);
 +spin_unlock(ioapic-lock);
 +}
 +
 +static void kvm_rtc_eoi_tracking_restore_all(struct kvm_ioapic *ioapic)
 +{
 +struct kvm_vcpu *vcpu;
 +int i, vector;
 +
 +if (RTC_GSI = IOAPIC_NUM_PINS)
 +return;
 +
 +rtc_irq_reset(ioapic);
 +vector = ioapic-redirtbl[RTC_GSI].fields.vector;
 +kvm_for_each_vcpu(i, vcpu, ioapic-kvm)
 +__rtc_irq_eoi_tracking_restore_one(vcpu, vector);
 +}
 +
  static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx)
  {   union kvm_ioapic_redirect_entry *pent; @@ -428,6 +486,7 @@ void
  kvm_ioapic_reset(struct kvm_ioapic *ioapic) ioapic-ioregsel = 0;
  ioapic-irr = 0;ioapic-id = 0; +   rtc_irq_reset(ioapic);
  update_handled_vectors(ioapic); }
 @@ -494,6 +553,7 @@ int kvm_set_ioapic(struct kvm *kvm, struct
 kvm_ioapic_state *state)
  memcpy(ioapic, state, sizeof(struct kvm_ioapic_state));
  update_handled_vectors(ioapic);
  kvm_ioapic_make_eoibitmap_request(kvm);
  +   kvm_rtc_eoi_tracking_restore_all(ioapic);
  spin_unlock(ioapic-lock); return 0; }
 diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
 index 761e5b5..313fc4e 100644
 --- a/virt/kvm/ioapic.h
 +++ b/virt/kvm/ioapic.h
 @@ -79,6 +79,7 @@ static inline struct kvm_ioapic *ioapic_irqchip(struct kvm
 *kvm)
  return 

RE: [PATCH v8 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt

2013-04-09 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-09:
 On Mon, Apr 08, 2013 at 10:23:22PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 If posted interrupt is avaliable, then uses it to inject virtual
 interrupt to guest.
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/kvm/lapic.c |   29 ++---
  arch/x86/kvm/vmx.c   |2 +-
  arch/x86/kvm/x86.c   |1 +
  3 files changed, 20 insertions(+), 12 deletions(-)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 8948979..46a4cca 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -353,6 +353,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic
 *apic)
  if (!apic-irr_pending)
  return -1;
 +kvm_x86_ops-sync_pir_to_irr(apic-vcpu);
  result = apic_search_irr(apic);
  ASSERT(result == -1 || result = 16);
 @@ -683,18 +684,24 @@ static int __apic_accept_irq(struct kvm_lapic *apic, 
 int
 delivery_mode,
  if (dest_map)
  __set_bit(vcpu-vcpu_id, dest_map);
 -result = !apic_test_and_set_irr(vector, apic);
 -trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 -  trig_mode, vector, !result);
 -if (!result) {
 -if (trig_mode)
 -apic_debug(level trig mode repeatedly for 
 -vector %d, vector);
 -break;
 -}
 +if (kvm_x86_ops-deliver_posted_interrupt) {
 +result = 1;
 +kvm_x86_ops-deliver_posted_interrupt(vcpu, vector);
 +} else {
 +result = !apic_test_and_set_irr(vector, apic);
 +
 +trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode,
 +trig_mode, vector, !result);
 Missed that in previous review. Do no drop tracing for PI case.
Hmm. I remember I have added the tracing for PI case. Don't know why it is not 
existing in this patch. Anyway, I will add it again.

 
 +if (!result) {
 +if (trig_mode)
 +apic_debug(level trig mode repeatedly 
 +for vector %d, vector);
 +break;
 +}
 
 -kvm_make_request(KVM_REQ_EVENT, vcpu);
 -kvm_vcpu_kick(vcpu);
 +kvm_make_request(KVM_REQ_EVENT, vcpu);
 +kvm_vcpu_kick(vcpu);
 +}
  break;
  
  case APIC_DM_REMRD:
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 3de2d7f..cd1c6ff 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -84,7 +84,7 @@ module_param(vmm_exclusive, bool, S_IRUGO);
  static bool __read_mostly fasteoi = 1;
  module_param(fasteoi, bool, S_IRUGO);
 -static bool __read_mostly enable_apicv;
 +static bool __read_mostly enable_apicv = 1;
  module_param(enable_apicv, bool, S_IRUGO);
  
  /*
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 72be079..486f627 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2685,6 +2685,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
  static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,  

  struct kvm_lapic_state *s) { +  kvm_x86_ops-sync_pir_to_irr(vcpu);
  memcpy(s-regs, vcpu-arch.apic-regs, sizeof *s);
  
  return 0;
 --
 1.7.1
 
 --
   Gleb.


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v8 6/7] KVM: VMX: Add the algorithm of deliver posted interrupt

2013-04-09 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2013-04-10:
 Il 08/04/2013 16:23, Yang Zhang ha scritto:
 + * interrupt from PIR in next vmentry.
 + */
 +static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 +{
 +struct vcpu_vmx *vmx = to_vmx(vcpu);
 +int r;
 +
 +if (pi_test_and_set_pir(vector, vmx-pi_desc))
 +return;
 +
 +r = pi_test_and_set_on(vmx-pi_desc);
 +kvm_make_request(KVM_REQ_EVENT, vcpu);
 +if (!r  (vcpu-mode == IN_GUEST_MODE))
 +apic-send_IPI_mask(get_cpu_mask(vcpu-cpu),
 +POSTED_INTR_VECTOR);
 +else
 +kvm_vcpu_kick(vcpu);
 +
 +return;
 +}
 
 No need for this return.
Right. Will remove it in next version.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v7 4/7] KVM: Add reset/restore rtc_status support

2013-04-08 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 01:16:51PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 01:05:02PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 12:39:32PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 02:30:15AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-04:
 On Mon, Apr 01, 2013 at 08:40:13AM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/kvm/lapic.c |9 + arch/x86/kvm/lapic.h | 2
  ++ virt/kvm/ioapic.c|   43
  +++ virt/kvm/ioapic.h
  | 1 + 4 files changed, 55 insertions(+), 0 deletions(-)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 96ab160..9c041fa 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void
 *bitmap)
  return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
  }
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector)
 +{
 +struct kvm_lapic *apic = vcpu-arch.apic;
 +
 +return apic_test_vector(vector, apic-regs + APIC_ISR) ||
 +apic_test_vector(vector, apic-regs + APIC_IRR);
 +}
 +
  static inline void apic_set_vector(int vec, void *bitmap)
  {
  set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 @@ -1665,6 +1673,7 @@ void kvm_apic_post_state_restore(struct
 kvm_vcpu
 *vcpu,
  apic-highest_isr_cache = -1;
  kvm_x86_ops-hwapic_isr_update(vcpu-kvm,
  apic_find_highest_isr(apic));   kvm_make_request(KVM_REQ_EVENT,
  vcpu); +kvm_rtc_irq_restore(vcpu); }
  
  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 967519c..004d2ad 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -170,4 +170,6 @@ static inline bool
 kvm_apic_has_events(struct
 kvm_vcpu *vcpu)
  return vcpu-arch.apic-pending_events;
  }
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 +
  #endif
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 8664812..0b12b17 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -90,6 +90,47 @@ static unsigned long
 ioapic_read_indirect(struct
 kvm_ioapic *ioapic,
  return result;
  }
 +static void rtc_irq_reset(struct kvm_ioapic *ioapic)
 +{
 +ioapic-rtc_status.pending_eoi = 0;
 +bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS);
 +}
 +
 +static void rtc_irq_restore(struct kvm_ioapic *ioapic)
 +{
 +struct kvm_vcpu *vcpu;
 +int vector, i, pending_eoi = 0;
 +
 +if (RTC_GSI = IOAPIC_NUM_PINS)
 +return;
 +
 +vector = ioapic-redirtbl[RTC_GSI].fields.vector;
 +kvm_for_each_vcpu(i, vcpu, ioapic-kvm) {
 +if (kvm_apic_pending_eoi(vcpu, vector)) {
 +pending_eoi++;
 +__set_bit(vcpu-vcpu_id,
 ioapic-rtc_status.dest_map);
 You should cleat dest_map at the beginning to get rid of stale bits.
 I thought kvm_set_ioapic is called only after save/restore or 
 migration.
 And
 the
 ioapic should be reset successfully before call it. So the
 dest_map is empty before call rtc_irq_restore().
 But it is possible kvm_set_ioapic is called beside save/restore or
 migration. Right?
 
 First of all userspace should not care when it calls kvm_set_ioapic()
 the kernel need to do the right thing. Second, believe it or not,
 kvm_ioapic_reset() is not called during system reset. Instead userspace
 reset it by calling kvm_set_ioapic() with ioapic state after reset.
 Ok. I see. As the logic you suggested, it will clear dest_map if no
 pending eoi in vcpu, so we don't need to do it again.
 
 You again rely on userspace doing thing in certain manner. What is
 set_lapic() is never called? Kernel internal state have to be correct
 after each ioctl call.
 Sorry. I cannot figure out what's the problem if don't clear dest_map?
 Can you elaborate it?
 
 What is not obvious about it? If there is a bit in dest_map that should
 be cleared after rtc_irq_restore() it will not.
 Why it will not? If new_val is false, and the old_val is true.
 __clear_bit() will clear the dest_map. Am I wrong?
 
 Ah, yes with new logic since we go over all vcpus and calculate new
 value for each one in theory it should be fine, but if we add cpu
 destruction this will be no longer true.
 
 new_val = kvm_apic_pending_eoi(vcpu, vector);
 Which reminds me there are more bugs in the current code. It is not
 enough to call kvm_apic_pending_eoi() to check the new value. You need to
 see if the entry is masked and vcpu is the destination of the interrupt too.
No. kvm_apic_pending_eoi() is the right way. IOAPIC entry may change after vcpu 
received it before issue EOI, and we should not rely on the entry.
For example:
vcpu A received the interrupt, pending in IRR
mask entry
migration happened

The only

RE: [PATCH v7 4/7] KVM: Add reset/restore rtc_status support

2013-04-08 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-08:
 On Mon, Apr 08, 2013 at 11:21:34AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 01:16:51PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 01:05:02PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 12:39:32PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 02:30:15AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-04:
 On Mon, Apr 01, 2013 at 08:40:13AM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/kvm/lapic.c |9 + arch/x86/kvm/lapic.h |
  2 ++ virt/kvm/ioapic.c|   43
  +++
  virt/kvm/ioapic.h | 1 + 4 files changed, 55 insertions(+), 0
  deletions(-)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 96ab160..9c041fa 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec,
 void
 *bitmap)
return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
  }
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector)
 +{
 +  struct kvm_lapic *apic = vcpu-arch.apic;
 +
 +  return apic_test_vector(vector, apic-regs + APIC_ISR) ||
 +  apic_test_vector(vector, apic-regs + APIC_IRR);
 +}
 +
  static inline void apic_set_vector(int vec, void *bitmap)
  {
set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 @@ -1665,6 +1673,7 @@ void
 kvm_apic_post_state_restore(struct
 kvm_vcpu
 *vcpu,
apic-highest_isr_cache = -1;
kvm_x86_ops-hwapic_isr_update(vcpu-kvm,
  apic_find_highest_isr(apic));
kvm_make_request(KVM_REQ_EVENT, vcpu);
  + kvm_rtc_irq_restore(vcpu); }
  
  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 967519c..004d2ad 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -170,4 +170,6 @@ static inline bool
 kvm_apic_has_events(struct
 kvm_vcpu *vcpu)
return vcpu-arch.apic-pending_events;
  }
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 +
  #endif
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 8664812..0b12b17 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -90,6 +90,47 @@ static unsigned long
 ioapic_read_indirect(struct
 kvm_ioapic *ioapic,
return result;
  }
 +static void rtc_irq_reset(struct kvm_ioapic *ioapic) +{
 +  ioapic-rtc_status.pending_eoi = 0;
 +  bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS); +}
 + +static void rtc_irq_restore(struct kvm_ioapic *ioapic) +{
 +  struct kvm_vcpu *vcpu; +int vector, i, pending_eoi = 0; 
 +
 +  if (RTC_GSI = IOAPIC_NUM_PINS) +   return; + + 
 vector =
 ioapic-redirtbl[RTC_GSI].fields.vector;
 +  kvm_for_each_vcpu(i, vcpu, ioapic-kvm) { + if
 (kvm_apic_pending_eoi(vcpu, vector)) { +   
 pending_eoi++;
 +  __set_bit(vcpu-vcpu_id,
 ioapic-rtc_status.dest_map);
 You should cleat dest_map at the beginning to get rid of stale bits.
 I thought kvm_set_ioapic is called only after save/restore or
 migration.
 And
 the
 ioapic should be reset successfully before call it. So the
 dest_map is empty before call rtc_irq_restore().
 But it is possible kvm_set_ioapic is called beside save/restore or
 migration. Right?
 
 First of all userspace should not care when it calls
 kvm_set_ioapic() the kernel need to do the right thing. Second,
 believe it or not, kvm_ioapic_reset() is not called during
 system reset. Instead userspace reset it by calling
 kvm_set_ioapic() with ioapic state after reset.
 Ok. I see. As the logic you suggested, it will clear dest_map if no
 pending eoi in vcpu, so we don't need to do it again.
 
 You again rely on userspace doing thing in certain manner. What is
 set_lapic() is never called? Kernel internal state have to be correct
 after each ioctl call.
 Sorry. I cannot figure out what's the problem if don't clear dest_map?
 Can you elaborate it?
 
 What is not obvious about it? If there is a bit in dest_map that should
 be cleared after rtc_irq_restore() it will not.
 Why it will not? If new_val is false, and the old_val is true.
 __clear_bit() will clear the dest_map. Am I wrong?
 
 Ah, yes with new logic since we go over all vcpus and calculate new
 value for each one in theory it should be fine, but if we add cpu
 destruction this will be no longer true.
 
 new_val = kvm_apic_pending_eoi(vcpu, vector);
 Which reminds me there are more bugs in the current code. It is not
 enough to call kvm_apic_pending_eoi() to check the new value. You need to
 see if the entry is masked and vcpu is the destination of the interrupt too.
 No. kvm_apic_pending_eoi() is the right way. IOAPIC entry may change
 after vcpu received it before issue EOI, and we should not rely on the
 entry. For example: vcpu A received

RE: [PATCH v7 4/7] KVM: Add reset/restore rtc_status support

2013-04-07 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-07:
 On Sun, Apr 07, 2013 at 02:30:15AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-04:
 On Mon, Apr 01, 2013 at 08:40:13AM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/kvm/lapic.c |9 + arch/x86/kvm/lapic.h |2 ++
  virt/kvm/ioapic.c|   43
  +++ virt/kvm/ioapic.h | 1 +
  4 files changed, 55 insertions(+), 0 deletions(-)
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 96ab160..9c041fa 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void 
 *bitmap)
return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
  }
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector)
 +{
 +  struct kvm_lapic *apic = vcpu-arch.apic;
 +
 +  return apic_test_vector(vector, apic-regs + APIC_ISR) ||
 +  apic_test_vector(vector, apic-regs + APIC_IRR);
 +}
 +
  static inline void apic_set_vector(int vec, void *bitmap)
  {
set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 @@ -1665,6 +1673,7 @@ void kvm_apic_post_state_restore(struct
 kvm_vcpu
 *vcpu,
apic-highest_isr_cache = -1;
kvm_x86_ops-hwapic_isr_update(vcpu-kvm,
  apic_find_highest_isr(apic)); kvm_make_request(KVM_REQ_EVENT,
  vcpu); +  kvm_rtc_irq_restore(vcpu); }
  
  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 967519c..004d2ad 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -170,4 +170,6 @@ static inline bool kvm_apic_has_events(struct
 kvm_vcpu *vcpu)
return vcpu-arch.apic-pending_events;
  }
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 +
  #endif
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 8664812..0b12b17 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -90,6 +90,47 @@ static unsigned long ioapic_read_indirect(struct
 kvm_ioapic *ioapic,
return result;
  }
 +static void rtc_irq_reset(struct kvm_ioapic *ioapic)
 +{
 +  ioapic-rtc_status.pending_eoi = 0;
 +  bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS);
 +}
 +
 +static void rtc_irq_restore(struct kvm_ioapic *ioapic)
 +{
 +  struct kvm_vcpu *vcpu;
 +  int vector, i, pending_eoi = 0;
 +
 +  if (RTC_GSI = IOAPIC_NUM_PINS)
 +  return;
 +
 +  vector = ioapic-redirtbl[RTC_GSI].fields.vector;
 +  kvm_for_each_vcpu(i, vcpu, ioapic-kvm) {
 +  if (kvm_apic_pending_eoi(vcpu, vector)) {
 +  pending_eoi++;
 +  __set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map);
 You should cleat dest_map at the beginning to get rid of stale bits.
 I thought kvm_set_ioapic is called only after save/restore or migration. And 
 the
 ioapic should be reset successfully before call it. So the dest_map is empty
 before call rtc_irq_restore().
 But it is possible kvm_set_ioapic is called beside save/restore or
 migration. Right?
 
 First of all userspace should not care when it calls kvm_set_ioapic()
 the kernel need to do the right thing. Second, believe it or not,
 kvm_ioapic_reset() is not called during system reset. Instead userspace
 reset it by calling kvm_set_ioapic() with ioapic state after reset.
Ok. I see. As the logic you suggested, it will clear dest_map if no pending eoi 
in vcpu, so we don't need to do it again.

 
 
 +  }
 +  }
 +  ioapic-rtc_status.pending_eoi = pending_eoi;
 +}
 +
 +void kvm_rtc_irq_restore(struct kvm_vcpu *vcpu)
 +{
 +  struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic;
 +  int vector;
 +
 +  if (!ioapic)
 +  return;
 +
 Can this be called if ioapic == NULL?
 Yes. IIRC, unit test will test lapic function without ioapic.
 Unit test is a guest code. This has nothing to do with a guest code.
 Unit test is not the one who creates lapic.
Ok. Then !ioapic is unnecessary.

 
 Should check for if (RTC_GSI = IOAPIC_NUM_PINS) here too.
 Not necessary. kvm_rtc_irq_restore is called from arch/x86/ and we
 have the defination: #ifdef CONFIG_X86 #define RTC_GSI 8
 
 The check will be false always. As the logic you suggested below, this check 
 is
 necessary for _all() not _one();
 OK.
 
 
 
 +  spin_lock(ioapic-lock);
 +  vector = ioapic-redirtbl[RTC_GSI].fields.vector;
 +  if (kvm_apic_pending_eoi(vcpu, vector)) {
 +  __set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map);
 +  ioapic-rtc_status.pending_eoi++;
 The bit may have been set already. The logic should be:
 Right.
 
 
 
 new_val = kvm_apic_pending_eoi(vcpu, vector)
 old_val = set_bit(vcpu_id, dest_map)
 
 if (new_val == old_val)
 return;
 
 if (new_val) {
 __set_bit(vcpu_id, dest_map);
 pending_eoi++;
 } else {
 __clear_bit(vcpu_id, dest_map);
 pending_eoi--;
 }
 
 The naming of above two functions are not good either. Call
 them something like kvm_rtc_eoi_tracking_restore_all

  1   2   3   4   >