from:"Zhang, Yang Z"

RE: [PATCH] kvm/vmx: EPTP switching test

2015-11-16 Thread Zhang, Yang Z

Michael S. Tsirkin wrote on 2015-11-16:
> This patch adds a new parameter: eptp_switching_test, which enables
> testing EPT switching on VMX if supported by hardware.  All EPT
> entries are initialized to the same value so this adds no useful
> functionality by itself, but can be used to test VMFUNC performance,
> and serve as a basis for future features based on EPTP switching.
> 
> Support for nested virt is not enabled.
> 
> This was tested using the following code within guest:
>   #define VMX_VMFUNC ".byte 0x0f,0x01,0xd4"
>   static void vmfunc(unsigned int nr, unsigned int ept)
>   {
>   asm volatile(VMX_VMFUNC
>:
>: "a"(nr), "c"(ept)
>: "memory");
>   }
> 
> VMFUNC instruction cost was measured at ~122 cycles.
> (Note: recent versions of gnu toolchain support  the vmfunc
> instruction - removing the need for writing  the bytecode manually).
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> I think I'd like to put this upstream so future eptp switching work
> can be implemented on top. Comments?

We have a different version in hand which is using separate EPTP. As you known, 
the patch will be more complex if using separate EPTP. And there are still lots 
of thing need to do in our version. We will send out for comments soon.

Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH] KVM: x86: always set accessed bit in shadow PTEs

2015-11-16 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-11-13:
> Commit 7a1638ce4220 ("nEPT: Redefine EPT-specific link_shadow_page()",
> 2013-08-05) says:
> 
> Since nEPT doesn't support A/D bit, we should not set those bit
> when building the shadow page table.
> but this is not necessary.  Even though nEPT doesn't support A/D
> bits, and hence the vmcs12 EPT pointer will never enable them,
> we always use them for shadow page tables if available (see
> construct_eptp in vmx.c).  So we can set the A/D bits freely
> in the shadow page table.
> 
> This patch hence basically reverts commit 7a1638ce4220.
> 
> Cc: Yang Zhang 
> Cc: Takuya Yoshikawa 
> Signed-off-by: Paolo Bonzini 

Looks good to me.

Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] KVM: x86: always set accessed bit in shadow PTEs

2015-11-16 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-11-13:
> Commit 7a1638ce4220 ("nEPT: Redefine EPT-specific link_shadow_page()",
> 2013-08-05) says:
> 
> Since nEPT doesn't support A/D bit, we should not set those bit
> when building the shadow page table.
> but this is not necessary.  Even though nEPT doesn't support A/D
> bits, and hence the vmcs12 EPT pointer will never enable them,
> we always use them for shadow page tables if available (see
> construct_eptp in vmx.c).  So we can set the A/D bits freely
> in the shadow page table.
> 
> This patch hence basically reverts commit 7a1638ce4220.
> 
> Cc: Yang Zhang 
> Cc: Takuya Yoshikawa 
> Signed-off-by: Paolo Bonzini 

Looks good to me.

Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] kvm/vmx: EPTP switching test

2015-11-16 Thread Zhang, Yang Z

Michael S. Tsirkin wrote on 2015-11-16:
> This patch adds a new parameter: eptp_switching_test, which enables
> testing EPT switching on VMX if supported by hardware.  All EPT
> entries are initialized to the same value so this adds no useful
> functionality by itself, but can be used to test VMFUNC performance,
> and serve as a basis for future features based on EPTP switching.
> 
> Support for nested virt is not enabled.
> 
> This was tested using the following code within guest:
>   #define VMX_VMFUNC ".byte 0x0f,0x01,0xd4"
>   static void vmfunc(unsigned int nr, unsigned int ept)
>   {
>   asm volatile(VMX_VMFUNC
>:
>: "a"(nr), "c"(ept)
>: "memory");
>   }
> 
> VMFUNC instruction cost was measured at ~122 cycles.
> (Note: recent versions of gnu toolchain support  the vmfunc
> instruction - removing the need for writing  the bytecode manually).
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> I think I'd like to put this upstream so future eptp switching work
> can be implemented on top. Comments?

We have a different version in hand which is using separate EPTP. As you known, 
the patch will be more complex if using separate EPTP. And there are still lots 
of thing need to do in our version. We will send out for comments soon.

Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-13 Thread Zhang, Yang Z

Zhang, Yang Z wrote on 2015-08-04:
> Paolo Bonzini wrote on 2015-08-04:
>> 
>> 
>> On 04/08/2015 02:46, Zhang, Yang Z wrote:
>>>> It is a problem for split irqchip, where the EOI exit bitmap can
>>>> be inferred from the IOAPIC routes but the TMR cannot.  The
>>>> hardware behavior on the other hand can be implemented purely within the 
>>>> LAPIC.
>>> 
>>> So updating the TMR within LAPIC is the only solution to handle it?
>> 
>> It's the simplest and the one that makes most sense.  Considering
>> that TMR is a pretty obscure feature, it's unlikely that it will be
>> accelerated in the future.
> 
> You may be right. It is safe if no future hardware plans to use it.
> Let me check with our hardware team to see whether it will be used or not in 
> future.

After checking with Jun, there is no guarantee that the guest running on 
another CPU will operate properly if hypervisor modify the vTMR from another 
CPU. So the hypervisor should not to do it.

> 
>> 
>> Paolo
> 
> 
> Best regards,
> Yang
>


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-13 Thread Zhang, Yang Z

Zhang, Yang Z wrote on 2015-08-04:
 Paolo Bonzini wrote on 2015-08-04:
 
 
 On 04/08/2015 02:46, Zhang, Yang Z wrote:
 It is a problem for split irqchip, where the EOI exit bitmap can
 be inferred from the IOAPIC routes but the TMR cannot.  The
 hardware behavior on the other hand can be implemented purely within the 
 LAPIC.
 
 So updating the TMR within LAPIC is the only solution to handle it?
 
 It's the simplest and the one that makes most sense.  Considering
 that TMR is a pretty obscure feature, it's unlikely that it will be
 accelerated in the future.
 
 You may be right. It is safe if no future hardware plans to use it.
 Let me check with our hardware team to see whether it will be used or not in 
 future.

After checking with Jun, there is no guarantee that the guest running on 
another CPU will operate properly if hypervisor modify the vTMR from another 
CPU. So the hypervisor should not to do it.

 
 
 Paolo
 
 
 Best regards,
 Yang



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-04 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-08-04:
> 
> 
> On 04/08/2015 02:46, Zhang, Yang Z wrote:
>>> It is a problem for split irqchip, where the EOI exit bitmap can be
>>> inferred from the IOAPIC routes but the TMR cannot.  The hardware
>>> behavior on the other hand can be implemented purely within the LAPIC.
>> 
>> So updating the TMR within LAPIC is the only solution to handle it?
> 
> It's the simplest and the one that makes most sense.  Considering that
> TMR is a pretty obscure feature, it's unlikely that it will be
> accelerated in the future.

You may be right. It is safe if no future hardware plans to use it. Let me 
check with our hardware team to see whether it will be used or not in future. 

> 
> Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-04 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-08-04:
 
 
 On 04/08/2015 02:46, Zhang, Yang Z wrote:
 It is a problem for split irqchip, where the EOI exit bitmap can be
 inferred from the IOAPIC routes but the TMR cannot.  The hardware
 behavior on the other hand can be implemented purely within the LAPIC.
 
 So updating the TMR within LAPIC is the only solution to handle it?
 
 It's the simplest and the one that makes most sense.  Considering that
 TMR is a pretty obscure feature, it's unlikely that it will be
 accelerated in the future.

You may be right. It is safe if no future hardware plans to use it. Let me 
check with our hardware team to see whether it will be used or not in future. 

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-03 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-08-03:
> 
> 
> On 03/08/2015 12:23, Zhang, Yang Z wrote:
>>> In any case, the TMR behavior introduced by the APICv patches is
>>> completely different from the hardware behavior, so it has to be fixed.
>> 
>> But any real problem with it?
> 
> It is a problem for split irqchip, where the EOI exit bitmap can be
> inferred from the IOAPIC routes but the TMR cannot.  The hardware
> behavior on the other hand can be implemented purely within the LAPIC.

So updating the TMR within LAPIC is the only solution to handle it?

> 
>>>  The alternative is to inject level-triggered interrupts
>>> synchronously, without using posted interrupts.
>>> 
>>> I'll write some testcases to understand the functioning of TMR in
>>> the virtual-APIC page, but the manual seems clear to me.
>> 
>> Currently, no existing hardware will use TMR and will not cause any
>> problem.(That's the reason why we leave it in Xen).But we don't know
>> whether future hardware will use it or not(SDM always keeps changing
>> :)).
> 
> But that would be covered by a different execution control (for
> backwards compatibility).  We'll get there when such a feature is introduced.

Yes, we can leave it in future. But one concern is that it may hard to handle 
it at that time if someone also develops feature which rely on it (like current 
patch to split irqchip). 

> 
>> And per 24.11.4's description, the perfect solution is don't modify
>> it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS
>> accesses are issued in right VMCS context.
> 
> Yes, that's correct.  It's just the TMR.
> 
> Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-03 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-08-03:
> 
> 
> On 03/08/2015 04:37, Zhang, Yang Z wrote:
>>>> Only virtualized APIC register reads use the virtual TMR
>>>> registers (SDM
>>>> 29.4.2 or 29.5), but these just read data from the corresponding
>>>> field in the virtual APIC page.
>> 
>> 24.11.4 Software Access to Related Structures In addition to data in
>> the VMCS region itself, VMX non-root operation can be controlled by
>> data structures that are referenced by pointers in a VMCS (for
>> example, the I/O bitmaps). While the pointers to these data
>> structures are parts of the VMCS, the data structures themselves are
>> not. They are not accessible using VMREAD and VMWRITE but by
>> ordinary memory writes.
> 
> I don't think the TMR fields of the virtual-APIC page are among the
> data structures that _controls_ VMX non-root operations, because:
> 
> * it is not part of the virtualized APIC state is listed in 29.1.1
> 
> * read accesses from the APIC-access page simply return data from the
> corresponding page offset on the virtual-APIC page using the memory
> access type stored in IA32_VMX_BASIC_MSR.  I think this explicitly
> says that the effects of 24.11.1 (especially non-deterministic
> behavior after a write) do not apply here.
> 
> In any case, the TMR behavior introduced by the APICv patches is
> completely different from the hardware behavior, so it has to be fixed.

But any real problem with it?

>  The alternative is to inject level-triggered interrupts
> synchronously, without using posted interrupts.
> 
> I'll write some testcases to understand the functioning of TMR in the
> virtual-APIC page, but the manual seems clear to me.

Currently, no existing hardware will use TMR and will not cause any 
problem.(That's the reason why we leave it in Xen).But we don't know whether 
future hardware will use it or not(SDM always keeps changing :)).And per 
24.11.4's description, the perfect solution is don't modify it. 
btw, IIRC, only TMR doesn't follow the rule. All other VMCS accesses are issued 
in right VMCS context.

> 
> Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-03 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-08-03:
 
 
 On 03/08/2015 12:23, Zhang, Yang Z wrote:
 In any case, the TMR behavior introduced by the APICv patches is
 completely different from the hardware behavior, so it has to be fixed.
 
 But any real problem with it?
 
 It is a problem for split irqchip, where the EOI exit bitmap can be
 inferred from the IOAPIC routes but the TMR cannot.  The hardware
 behavior on the other hand can be implemented purely within the LAPIC.

So updating the TMR within LAPIC is the only solution to handle it?

 
  The alternative is to inject level-triggered interrupts
 synchronously, without using posted interrupts.
 
 I'll write some testcases to understand the functioning of TMR in
 the virtual-APIC page, but the manual seems clear to me.
 
 Currently, no existing hardware will use TMR and will not cause any
 problem.(That's the reason why we leave it in Xen).But we don't know
 whether future hardware will use it or not(SDM always keeps changing
 :)).
 
 But that would be covered by a different execution control (for
 backwards compatibility).  We'll get there when such a feature is introduced.

Yes, we can leave it in future. But one concern is that it may hard to handle 
it at that time if someone also develops feature which rely on it (like current 
patch to split irqchip). 

 
 And per 24.11.4's description, the perfect solution is don't modify
 it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS
 accesses are issued in right VMCS context.
 
 Yes, that's correct.  It's just the TMR.
 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-03 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-08-03:
 
 
 On 03/08/2015 04:37, Zhang, Yang Z wrote:
 Only virtualized APIC register reads use the virtual TMR
 registers (SDM
 29.4.2 or 29.5), but these just read data from the corresponding
 field in the virtual APIC page.
 
 24.11.4 Software Access to Related Structures In addition to data in
 the VMCS region itself, VMX non-root operation can be controlled by
 data structures that are referenced by pointers in a VMCS (for
 example, the I/O bitmaps). While the pointers to these data
 structures are parts of the VMCS, the data structures themselves are
 not. They are not accessible using VMREAD and VMWRITE but by
 ordinary memory writes.
 
 I don't think the TMR fields of the virtual-APIC page are among the
 data structures that _controls_ VMX non-root operations, because:
 
 * it is not part of the virtualized APIC state is listed in 29.1.1
 
 * read accesses from the APIC-access page simply return data from the
 corresponding page offset on the virtual-APIC page using the memory
 access type stored in IA32_VMX_BASIC_MSR.  I think this explicitly
 says that the effects of 24.11.1 (especially non-deterministic
 behavior after a write) do not apply here.
 
 In any case, the TMR behavior introduced by the APICv patches is
 completely different from the hardware behavior, so it has to be fixed.

But any real problem with it?

  The alternative is to inject level-triggered interrupts
 synchronously, without using posted interrupts.
 
 I'll write some testcases to understand the functioning of TMR in the
 virtual-APIC page, but the manual seems clear to me.

Currently, no existing hardware will use TMR and will not cause any 
problem.(That's the reason why we leave it in Xen).But we don't know whether 
future hardware will use it or not(SDM always keeps changing :)).And per 
24.11.4's description, the perfect solution is don't modify it. 
btw, IIRC, only TMR doesn't follow the rule. All other VMCS accesses are issued 
in right VMCS context.

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-02 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-07-31:
> 
> 
> On 31/07/2015 04:49, Steve Rutherford wrote:
>> Oh... Yeah. That's a damn good point, given that the interrupt can be
>> injected from another thread while one is in that guest vcpu.
>> 
>> Easiest time to update the TMR should be on guest entry through
>> vcpu_scan_ioapic, as before.
>> 
>> The best way to go is probably to ditch the new per vcpu EOI exit
>> bitmap, and just update/use the TMR. There's no reason to duplicate
>> that data in the representation of the apic (I suspect that the
>> duplication was inspired by my mistaken notion of the TMR). The
>> IOAPIC exit check can use the TMR instead.
>> 
>> Based upon my reading of the SDM, the only reason that the eoi exit
>> bitmaps are not the exact same as the TMR is that it is possible to
>> have virtual-interrupt delivery enabled /without/ an apic access page
>> (Note: V-ID => EOI exit bitmap enabled).
>> 
>> Yang, do you happen to know if that is the case?
>> 
>> [Note: Just looked back at the old code for updating the EOI exit
>> bitmaps, which for some reason was configured to trigger EOI exits
>> for all IOAPIC interrupts, not just level-triggered IOAPIC
>> interrupts. Which is weird, and I believe totally unecessary.]
> 
> The RTC interrupt needs to trigger an EOI exit with the in-kernel
> IOAPIC, in order to detect coalesced interrupts.  This is necessary to
> avoid severe time drift in Windows guest.

Correct! EOI exit bitmap is not absolutely same with TMR. Some interrupts 
(especially timer interrupt) which is edge trigger but need a notification when 
it got injected into guest. 

> 
> Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-02 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-07-31:
> 
> 
> On 31/07/2015 01:26, Zhang, Yang Z wrote:
>>>> Do not compute TMR in advance.  Instead, set the TMR just before
>>>> the interrupt is accepted into the IRR.  This limits the coupling
>>>> between IOAPIC and LAPIC.
>> 
>> Uh.., it back to original way which is wrong. You cannot modify the
>> apic page(here is the TMR reg) directly when the corresponding VMCS
>> may be used at same time.
> 
> Where is this documented?  The TMR is not part of the set of virtualized
> APIC registers (TPR, PPR, EOI, ISR, IRR, ICR+ICR2; SDM 29.1.1).
> 
> Only virtualized APIC register reads use the virtual TMR registers (SDM
> 29.4.2 or 29.5), but these just read data from the corresponding field
> in the virtual APIC page.

It's not only for virtual apic page. All data structures that is used by 
VMCS(either directly inside in VMCS or referenced by pointer in a VMCS) need 
follow this rule:

24.11.4 Software Access to Related Structures
In addition to data in the VMCS region itself, VMX non-root operation can be 
controlled by data structures that are
referenced by pointers in a VMCS (for example, the I/O bitmaps). While the 
pointers to these data structures are
parts of the VMCS, the data structures themselves are not. They are not 
accessible using VMREAD and VMWRITE
but by ordinary memory writes.
Software should ensure that each such data structure is modified only when no 
logical processor with a current
VMCS that references it is in VMX non-root operation. Doing otherwise may lead 
to unpredictable behavior
(including behaviors identified in Section 24.11.1).

> 
> Paolo

Best regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-02 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-07-31:
 
 
 On 31/07/2015 01:26, Zhang, Yang Z wrote:
 Do not compute TMR in advance.  Instead, set the TMR just before
 the interrupt is accepted into the IRR.  This limits the coupling
 between IOAPIC and LAPIC.
 
 Uh.., it back to original way which is wrong. You cannot modify the
 apic page(here is the TMR reg) directly when the corresponding VMCS
 may be used at same time.
 
 Where is this documented?  The TMR is not part of the set of virtualized
 APIC registers (TPR, PPR, EOI, ISR, IRR, ICR+ICR2; SDM 29.1.1).
 
 Only virtualized APIC register reads use the virtual TMR registers (SDM
 29.4.2 or 29.5), but these just read data from the corresponding field
 in the virtual APIC page.

It's not only for virtual apic page. All data structures that is used by 
VMCS(either directly inside in VMCS or referenced by pointer in a VMCS) need 
follow this rule:

24.11.4 Software Access to Related Structures
In addition to data in the VMCS region itself, VMX non-root operation can be 
controlled by data structures that are
referenced by pointers in a VMCS (for example, the I/O bitmaps). While the 
pointers to these data structures are
parts of the VMCS, the data structures themselves are not. They are not 
accessible using VMREAD and VMWRITE
but by ordinary memory writes.
Software should ensure that each such data structure is modified only when no 
logical processor with a current
VMCS that references it is in VMX non-root operation. Doing otherwise may lead 
to unpredictable behavior
(including behaviors identified in Section 24.11.1).

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-08-02 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-07-31:
 
 
 On 31/07/2015 04:49, Steve Rutherford wrote:
 Oh... Yeah. That's a damn good point, given that the interrupt can be
 injected from another thread while one is in that guest vcpu.
 
 Easiest time to update the TMR should be on guest entry through
 vcpu_scan_ioapic, as before.
 
 The best way to go is probably to ditch the new per vcpu EOI exit
 bitmap, and just update/use the TMR. There's no reason to duplicate
 that data in the representation of the apic (I suspect that the
 duplication was inspired by my mistaken notion of the TMR). The
 IOAPIC exit check can use the TMR instead.
 
 Based upon my reading of the SDM, the only reason that the eoi exit
 bitmaps are not the exact same as the TMR is that it is possible to
 have virtual-interrupt delivery enabled /without/ an apic access page
 (Note: V-ID = EOI exit bitmap enabled).
 
 Yang, do you happen to know if that is the case?
 
 [Note: Just looked back at the old code for updating the EOI exit
 bitmaps, which for some reason was configured to trigger EOI exits
 for all IOAPIC interrupts, not just level-triggered IOAPIC
 interrupts. Which is weird, and I believe totally unecessary.]
 
 The RTC interrupt needs to trigger an EOI exit with the in-kernel
 IOAPIC, in order to detect coalesced interrupts.  This is necessary to
 avoid severe time drift in Windows guest.

Correct! EOI exit bitmap is not absolutely same with TMR. Some interrupts 
(especially timer interrupt) which is edge trigger but need a notification when 
it got injected into guest. 

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-07-30 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-07-29:
> Do not compute TMR in advance.  Instead, set the TMR just before the
> interrupt is accepted into the IRR.  This limits the coupling between
> IOAPIC and LAPIC.
> 

Uh.., it back to original way which is wrong. You cannot modify the apic 
page(here is the TMR reg) directly when the corresponding VMCS may be used at 
same time.


> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/ioapic.c |  9 ++---
>  arch/x86/kvm/ioapic.h |  3 +--
>  arch/x86/kvm/lapic.c  | 19 ++-
>  arch/x86/kvm/lapic.h  |  1 -
>  arch/x86/kvm/x86.c|  5 +
>  5 files changed, 14 insertions(+), 23 deletions(-)
> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> index 856f79105bb5..eaf4ec38d980 100644
> --- a/arch/x86/kvm/ioapic.c
> +++ b/arch/x86/kvm/ioapic.c
> @@ -246,8 +246,7 @@ static void update_handled_vectors(struct kvm_ioapic
> *ioapic)
>   smp_wmb();
>  }
> -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
> - u32 *tmr)
> +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
>  {
>   struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic;
>   union kvm_ioapic_redirect_entry *e;
> @@ -260,13 +259,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
> u64 *eoi_exit_bitmap,
>   kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, 
> index) ||
>   index == RTC_GSI) { if 
> (kvm_apic_match_dest(vcpu, NULL, 0,
> - e->fields.dest_id, e->fields.dest_mode)) {
> + e->fields.dest_id, e->fields.dest_mode))
>   __set_bit(e->fields.vector,
>   (unsigned long *)eoi_exit_bitmap);
> - if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG)
> - __set_bit(e->fields.vector, -   
> (unsigned long *)tmr); -  
>   }
>   }
>   }
>   spin_unlock(>lock);
> diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
> index ca0b0b4e6256..3dbd0e2aac4e 100644
> --- a/arch/x86/kvm/ioapic.h
> +++ b/arch/x86/kvm/ioapic.h
> @@ -120,7 +120,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct
> kvm_lapic *src,
>   struct kvm_lapic_irq *irq, unsigned long *dest_map);
>  int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
>  int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
> -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
> - u32 *tmr);
> +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
> 
>  #endif
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 2a5ca97c263b..9be64c77d6db 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -551,15 +551,6 @@ static void pv_eoi_clr_pending(struct kvm_vcpu
> *vcpu)
>   __clear_bit(KVM_APIC_PV_EOI_PENDING, >arch.apic_attention);
>  }
> -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr)
> -{
> - struct kvm_lapic *apic = vcpu->arch.apic;
> - int i;
> -
> - for (i = 0; i < 8; i++)
> - apic_set_reg(apic, APIC_TMR + 0x10 * i, tmr[i]);
> -}
> -
>  static void apic_update_ppr(struct kvm_lapic *apic)
>  {
>   u32 tpr, isrv, ppr, old_ppr;
> @@ -781,6 +772,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> delivery_mode,
>   case APIC_DM_LOWEST:
>   vcpu->arch.apic_arb_prio++;
>   case APIC_DM_FIXED:
> + if (unlikely(trig_mode && !level))
> + break;
> +
>   /* FIXME add logic for vcpu on reset */
>   if (unlikely(!apic_enabled(apic)))
>   break;
> @@ -790,6 +784,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic,
> int delivery_mode,
>   if (dest_map)
>   __set_bit(vcpu->vcpu_id, dest_map);
> + if (apic_test_vector(vector, apic->regs + APIC_TMR) != 
> !!trig_mode) {
> + if (trig_mode)
> + apic_set_vector(vector, apic->regs + APIC_TMR);
> + else
> + apic_clear_vector(vector, apic->regs + 
> APIC_TMR);
> + }
> +
>   if (kvm_x86_ops->deliver_posted_interrupt)
>   kvm_x86_ops->deliver_posted_interrupt(vcpu, vector);
>   else {
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 764037991d26..eb46d6bcaa75 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -57,7 +57,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64
> value);
>  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
>  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
> -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr);
>  void __kvm_apic_update_irr(u32 *pir, void *regs);
>  void

RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted

2015-07-30 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-07-29:
 Do not compute TMR in advance.  Instead, set the TMR just before the
 interrupt is accepted into the IRR.  This limits the coupling between
 IOAPIC and LAPIC.
 

Uh.., it back to original way which is wrong. You cannot modify the apic 
page(here is the TMR reg) directly when the corresponding VMCS may be used at 
same time.


 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/ioapic.c |  9 ++---
  arch/x86/kvm/ioapic.h |  3 +--
  arch/x86/kvm/lapic.c  | 19 ++-
  arch/x86/kvm/lapic.h  |  1 -
  arch/x86/kvm/x86.c|  5 +
  5 files changed, 14 insertions(+), 23 deletions(-)
 diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
 index 856f79105bb5..eaf4ec38d980 100644
 --- a/arch/x86/kvm/ioapic.c
 +++ b/arch/x86/kvm/ioapic.c
 @@ -246,8 +246,7 @@ static void update_handled_vectors(struct kvm_ioapic
 *ioapic)
   smp_wmb();
  }
 -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
 - u32 *tmr)
 +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
  {
   struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic;
   union kvm_ioapic_redirect_entry *e;
 @@ -260,13 +259,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
 u64 *eoi_exit_bitmap,
   kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, 
 index) ||
   index == RTC_GSI) { if 
 (kvm_apic_match_dest(vcpu, NULL, 0,
 - e-fields.dest_id, e-fields.dest_mode)) {
 + e-fields.dest_id, e-fields.dest_mode))
   __set_bit(e-fields.vector,
   (unsigned long *)eoi_exit_bitmap);
 - if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG)
 - __set_bit(e-fields.vector, -   
 (unsigned long *)tmr); -  
   }
   }
   }
   spin_unlock(ioapic-lock);
 diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
 index ca0b0b4e6256..3dbd0e2aac4e 100644
 --- a/arch/x86/kvm/ioapic.h
 +++ b/arch/x86/kvm/ioapic.h
 @@ -120,7 +120,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct
 kvm_lapic *src,
   struct kvm_lapic_irq *irq, unsigned long *dest_map);
  int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
  int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
 -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
 - u32 *tmr);
 +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
 
  #endif
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 2a5ca97c263b..9be64c77d6db 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -551,15 +551,6 @@ static void pv_eoi_clr_pending(struct kvm_vcpu
 *vcpu)
   __clear_bit(KVM_APIC_PV_EOI_PENDING, vcpu-arch.apic_attention);
  }
 -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr)
 -{
 - struct kvm_lapic *apic = vcpu-arch.apic;
 - int i;
 -
 - for (i = 0; i  8; i++)
 - apic_set_reg(apic, APIC_TMR + 0x10 * i, tmr[i]);
 -}
 -
  static void apic_update_ppr(struct kvm_lapic *apic)
  {
   u32 tpr, isrv, ppr, old_ppr;
 @@ -781,6 +772,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
 delivery_mode,
   case APIC_DM_LOWEST:
   vcpu-arch.apic_arb_prio++;
   case APIC_DM_FIXED:
 + if (unlikely(trig_mode  !level))
 + break;
 +
   /* FIXME add logic for vcpu on reset */
   if (unlikely(!apic_enabled(apic)))
   break;
 @@ -790,6 +784,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic,
 int delivery_mode,
   if (dest_map)
   __set_bit(vcpu-vcpu_id, dest_map);
 + if (apic_test_vector(vector, apic-regs + APIC_TMR) != 
 !!trig_mode) {
 + if (trig_mode)
 + apic_set_vector(vector, apic-regs + APIC_TMR);
 + else
 + apic_clear_vector(vector, apic-regs + 
 APIC_TMR);
 + }
 +
   if (kvm_x86_ops-deliver_posted_interrupt)
   kvm_x86_ops-deliver_posted_interrupt(vcpu, vector);
   else {
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 764037991d26..eb46d6bcaa75 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -57,7 +57,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64
 value);
  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr);
  void __kvm_apic_update_irr(u32 *pir, void *regs);
  void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
  int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct

RE: [PATCH] kvm/fpu: Enable eager restore kvm FPU for MPX

2015-05-20 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-05-20:
> 
> 
> On 20/05/2015 07:20, Zhang, Yang Z wrote:
>> Li, Liang Z wrote on 2015-05-20:
>>> The MPX feature requires eager KVM FPU restore support. We have
>>> verified that MPX cannot work correctly with the current lazy KVM
>>> FPU restore mechanism. Eager KVM FPU restore should be enabled if
>>> the MPX feature is exposed to VM.
>>> 
>>> Signed-off-by: Liang Li 
>>> ---
>>>  arch/x86/kvm/vmx.c | 2 ++
>>>  arch/x86/kvm/x86.c | 3 ++-
>>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
>>> f7b6168..e2cccbe 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -8445,6 +8445,8 @@ static struct kvm_vcpu
>>> *vmx_create_vcpu(struct
> kvm *kvm, unsigned int id)
>>> goto free_vmcs;
>>> }
>>> +   if (vmx_mpx_supported())
>>> +   vmx_fpu_activate(>vcpu);
>>> return >vcpu;
>>>  
>>>  free_vmcs:
>> 
>> Is it better to use guest_cpuid_has_mpx() instead of vmx_mpx_supported()?
> 
> CPUID hasn't been set yet, so I think it is okay to key it on
> vmx_mpx_supported().  It will be deactivated soon afterwards.
> 
> Or even do it unconditionally; just make sure to add a comment about it.

Correct! Unconditionally would be acceptable.

> 
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index
>>> 5f38188..5993f5f
>>> 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -7060,7 +7060,8 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
>>> fpu_save_init(>arch.guest_fpu);
>>> __kernel_fpu_end();
>>> ++vcpu->stat.fpu_reload;
>>> -   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
>>> +   if (!kvm_x86_ops->mpx_supported())
>>> +   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
>>> trace_kvm_fpu(0);
>>>  }
> 
> This is a hotter path.  Here it's definitely better to avoid the call
> to kvm_x86_ops->mpx_supported().  Especially because, with MPX
> enabled, you would call this on every userspace exit.
> 
> Yang's suggestion of using CPUID is actually more valuable here.  You
> could add a new field eager_fpu in kvm->arch and update it in 
> kvm_update_cpuid.

Good suggestion!

> 
> Thanks,
> 
> Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] kvm/fpu: Enable eager restore kvm FPU for MPX

2015-05-20 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-05-20:
 
 
 On 20/05/2015 07:20, Zhang, Yang Z wrote:
 Li, Liang Z wrote on 2015-05-20:
 The MPX feature requires eager KVM FPU restore support. We have
 verified that MPX cannot work correctly with the current lazy KVM
 FPU restore mechanism. Eager KVM FPU restore should be enabled if
 the MPX feature is exposed to VM.
 
 Signed-off-by: Liang Li liang.z...@intel.com
 ---
  arch/x86/kvm/vmx.c | 2 ++
  arch/x86/kvm/x86.c | 3 ++-
  2 files changed, 4 insertions(+), 1 deletion(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 f7b6168..e2cccbe 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -8445,6 +8445,8 @@ static struct kvm_vcpu
 *vmx_create_vcpu(struct
 kvm *kvm, unsigned int id)
 goto free_vmcs;
 }
 +   if (vmx_mpx_supported())
 +   vmx_fpu_activate(vmx-vcpu);
 return vmx-vcpu;
  
  free_vmcs:
 
 Is it better to use guest_cpuid_has_mpx() instead of vmx_mpx_supported()?
 
 CPUID hasn't been set yet, so I think it is okay to key it on
 vmx_mpx_supported().  It will be deactivated soon afterwards.
 
 Or even do it unconditionally; just make sure to add a comment about it.

Correct! Unconditionally would be acceptable.

 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index
 5f38188..5993f5f
 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -7060,7 +7060,8 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 fpu_save_init(vcpu-arch.guest_fpu);
 __kernel_fpu_end();
 ++vcpu-stat.fpu_reload;
 -   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
 +   if (!kvm_x86_ops-mpx_supported())
 +   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
 trace_kvm_fpu(0);
  }
 
 This is a hotter path.  Here it's definitely better to avoid the call
 to kvm_x86_ops-mpx_supported().  Especially because, with MPX
 enabled, you would call this on every userspace exit.
 
 Yang's suggestion of using CPUID is actually more valuable here.  You
 could add a new field eager_fpu in kvm-arch and update it in 
 kvm_update_cpuid.

Good suggestion!

 
 Thanks,
 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] kvm/fpu: Enable eager restore kvm FPU for MPX

2015-05-19 Thread Zhang, Yang Z

Li, Liang Z wrote on 2015-05-20:
> The MPX feature requires eager KVM FPU restore support. We have
> verified that MPX cannot work correctly with the current lazy KVM FPU
> restore mechanism. Eager KVM FPU restore should be enabled if the MPX
> feature is exposed to VM.
> 
> Signed-off-by: Liang Li 
> ---
>  arch/x86/kvm/vmx.c | 2 ++
>  arch/x86/kvm/x86.c | 3 ++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> f7b6168..e2cccbe 100644 --- a/arch/x86/kvm/vmx.c +++
> b/arch/x86/kvm/vmx.c @@ -8445,6 +8445,8 @@ static struct kvm_vcpu
> *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
>   goto free_vmcs;
>   }
> + if (vmx_mpx_supported())

Is it better to use guest_cpuid_has_mpx() instead of vmx_mpx_supported()?

> + vmx_fpu_activate(>vcpu);
>   return >vcpu;
>  
>  free_vmcs:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index
> 5f38188..5993f5f 100644 --- a/arch/x86/kvm/x86.c +++
> b/arch/x86/kvm/x86.c @@ -7060,7 +7060,8 @@ void kvm_put_guest_fpu(struct
> kvm_vcpu *vcpu)
>   fpu_save_init(>arch.guest_fpu);
>   __kernel_fpu_end();
>   ++vcpu->stat.fpu_reload;
> - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
> + if (!kvm_x86_ops->mpx_supported())
> + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
>   trace_kvm_fpu(0);
>  }


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 00/13] KVM: x86: SMM support

2015-05-19 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-05-19:
> 
> 
> On 19/05/2015 16:25, Zhang, Yang Z wrote:
>> Paolo Bonzini wrote on 2015-04-30:
>>> This patch series introduces system management mode support.
>> 
>> Just curious what's motivation to add vSMM supporting? Is there any
>> usage case inside guest requires SMM? Thanks.
> 
> SMM provides the trusted base for UEFI Secure Boot.
> 
> Paolo

OK, I see it. Thanks for your explaination.

Best regards,
Yang

RE: [RFC PATCH 00/13] KVM: x86: SMM support

2015-05-19 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-04-30:
> This patch series introduces system management mode support.

Just curious what's motivation to add vSMM supporting? Is there any usage case 
inside guest requires SMM? Thanks.

> There is still some work to do, namely: test without unrestricted
> guest support, test on AMD, disable the capability if !unrestricted
> guest and !emulate invalid guest state(*), test with a QEMU that
> understand KVM_MEM_X86_SMRAM, actually post QEMU patches that let you use 
> this.
> 
>   (*) newer chipsets moved away from legacy SMRAM at 0xa,
>   thus support for real mode CS base above 1M is necessary
> 
> Because legacy SMRAM is a mess, I have tried these patches with Q35's
> high SMRAM (at 0xfeda).  This means that right now this isn't the
> easiest thing to test; you need QEMU patches that add support for high
> SMRAM, and SeaBIOS patches to use high SMRAM.  Until QEMU support for
> KVM_MEM_X86_SMRAM is in place, also, I'm keeping SMRAM open in SeaBIOS.
> 
> That said, even this clumsy and incomplete userspace configuration is
> enough to test all patches except 11 and 12.
> 
> The series is structured as follows.
> 
> Patch 1 is an unrelated bugfix (I think).  Patches 2 to 6 extend some
> infrastructure functions.  Patches 1 to 4 could be committed right now.
> 
> Patches 7 to 9 implement basic support for SMM in the KVM API and
> teach KVM about doing the world switch on SMI and RSM.
> 
> Patch 10 touches all places in KVM that read/write guest memory to go
> through an x86-specific function.  The x86-specific function takes a
> VCPU rather than a struct kvm.  This is used in patches 11 and 12 to
> limits access to specially marked SMRAM slots unless the VCPU is in
> system management mode.
> 
> Finally, patch 13 exposes the new capability for userspace to probe.
> 


Best regards,
Yang

RE: [PATCH] kvm/fpu: Enable eager restore kvm FPU for MPX

2015-05-19 Thread Zhang, Yang Z

Li, Liang Z wrote on 2015-05-20:
 The MPX feature requires eager KVM FPU restore support. We have
 verified that MPX cannot work correctly with the current lazy KVM FPU
 restore mechanism. Eager KVM FPU restore should be enabled if the MPX
 feature is exposed to VM.
 
 Signed-off-by: Liang Li liang.z...@intel.com
 ---
  arch/x86/kvm/vmx.c | 2 ++
  arch/x86/kvm/x86.c | 3 ++-
  2 files changed, 4 insertions(+), 1 deletion(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 f7b6168..e2cccbe 100644 --- a/arch/x86/kvm/vmx.c +++
 b/arch/x86/kvm/vmx.c @@ -8445,6 +8445,8 @@ static struct kvm_vcpu
 *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
   goto free_vmcs;
   }
 + if (vmx_mpx_supported())

Is it better to use guest_cpuid_has_mpx() instead of vmx_mpx_supported()?

 + vmx_fpu_activate(vmx-vcpu);
   return vmx-vcpu;
  
  free_vmcs:
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index
 5f38188..5993f5f 100644 --- a/arch/x86/kvm/x86.c +++
 b/arch/x86/kvm/x86.c @@ -7060,7 +7060,8 @@ void kvm_put_guest_fpu(struct
 kvm_vcpu *vcpu)
   fpu_save_init(vcpu-arch.guest_fpu);
   __kernel_fpu_end();
   ++vcpu-stat.fpu_reload;
 - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
 + if (!kvm_x86_ops-mpx_supported())
 + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
   trace_kvm_fpu(0);
  }


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 00/13] KVM: x86: SMM support

2015-05-19 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-05-19:
 
 
 On 19/05/2015 16:25, Zhang, Yang Z wrote:
 Paolo Bonzini wrote on 2015-04-30:
 This patch series introduces system management mode support.
 
 Just curious what's motivation to add vSMM supporting? Is there any
 usage case inside guest requires SMM? Thanks.
 
 SMM provides the trusted base for UEFI Secure Boot.
 
 Paolo

OK, I see it. Thanks for your explaination.

Best regards,
Yang

RE: [RFC PATCH 00/13] KVM: x86: SMM support

2015-05-19 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-04-30:
 This patch series introduces system management mode support.

Just curious what's motivation to add vSMM supporting? Is there any usage case 
inside guest requires SMM? Thanks.

 There is still some work to do, namely: test without unrestricted
 guest support, test on AMD, disable the capability if !unrestricted
 guest and !emulate invalid guest state(*), test with a QEMU that
 understand KVM_MEM_X86_SMRAM, actually post QEMU patches that let you use 
 this.
 
   (*) newer chipsets moved away from legacy SMRAM at 0xa,
   thus support for real mode CS base above 1M is necessary
 
 Because legacy SMRAM is a mess, I have tried these patches with Q35's
 high SMRAM (at 0xfeda).  This means that right now this isn't the
 easiest thing to test; you need QEMU patches that add support for high
 SMRAM, and SeaBIOS patches to use high SMRAM.  Until QEMU support for
 KVM_MEM_X86_SMRAM is in place, also, I'm keeping SMRAM open in SeaBIOS.
 
 That said, even this clumsy and incomplete userspace configuration is
 enough to test all patches except 11 and 12.
 
 The series is structured as follows.
 
 Patch 1 is an unrelated bugfix (I think).  Patches 2 to 6 extend some
 infrastructure functions.  Patches 1 to 4 could be committed right now.
 
 Patches 7 to 9 implement basic support for SMM in the KVM API and
 teach KVM about doing the world switch on SMI and RSM.
 
 Patch 10 touches all places in KVM that read/write guest memory to go
 through an x86-specific function.  The x86-specific function takes a
 VCPU rather than a struct kvm.  This is used in patches 11 and 12 to
 limits access to specially marked SMRAM slots unless the VCPU is in
 system management mode.
 
 Finally, patch 13 exposes the new capability for userspace to probe.
 


Best regards,
Yang

RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-04-24:
> 
> 
> On 24/04/2015 09:46, Zhang, Yang Z wrote:
>>> On the other hand vmexit is lighter and lighter on newer
>>> processors; a Sandy Bridge has less than half the vmexit cost of a
>>> Core 2 (IIRC
>>> 1000 vs. 2500 clock cycles approximately).
>> 
>> 1000 cycles? I remember it takes about 4000 cycle even in HSW server.
> 
> I was going from memory, but I now measured it with the vmexit test of
> kvm-unit-tests.  With both SNB Xeon E5 and IVB Core i7, returns about
> 1400 clock cycles for a vmcall exit.  This includes the overhead of
> doing the cpuid itself.
> 
> Thus the vmexit cost is around 1300 cycles.  Of this the vmresume
> instruction is probably around 800 cycles, and the rest is introduced
> by KVM.  There are at least 4-5 memory barriers and locked instructions.

Yes, that's make sense. The average vmexit/vmentry handle cost is around 4000 
cycles. But I guess xsaveopt doesn't take so many cycles. Does anyone have the 
xsaveopt cost data?

> 
> Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-04-24:
> 
> 
> On 24/04/2015 03:16, Zhang, Yang Z wrote:
>>> This is interesting since previous measurements on KVM have had the
>>> exact opposite results.  I think we need to understand this a lot
>>> more.
>> 
>> What I can tell is that vmexit is heavy. So it is reasonable to see
>> the improvement under some cases, especially kernel is using eager
>> FPU now which means each schedule may trigger a vmexit.
> 
> On the other hand vmexit is lighter and lighter on newer processors; a
> Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000
> vs. 2500 clock cycles approximately).
> 

1000 cycles? I remember it takes about 4000 cycle even in HSW server.

> Also, measurement were done on Westmere but Sandy Bridge is the first
> processor to have XSAVEOPT and thus use eager FPU.
> 
> Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-04-24:
 
 
 On 24/04/2015 09:46, Zhang, Yang Z wrote:
 On the other hand vmexit is lighter and lighter on newer
 processors; a Sandy Bridge has less than half the vmexit cost of a
 Core 2 (IIRC
 1000 vs. 2500 clock cycles approximately).
 
 1000 cycles? I remember it takes about 4000 cycle even in HSW server.
 
 I was going from memory, but I now measured it with the vmexit test of
 kvm-unit-tests.  With both SNB Xeon E5 and IVB Core i7, returns about
 1400 clock cycles for a vmcall exit.  This includes the overhead of
 doing the cpuid itself.
 
 Thus the vmexit cost is around 1300 cycles.  Of this the vmresume
 instruction is probably around 800 cycles, and the rest is introduced
 by KVM.  There are at least 4-5 memory barriers and locked instructions.

Yes, that's make sense. The average vmexit/vmentry handle cost is around 4000 
cycles. But I guess xsaveopt doesn't take so many cycles. Does anyone have the 
xsaveopt cost data?

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-04-24:
 
 
 On 24/04/2015 03:16, Zhang, Yang Z wrote:
 This is interesting since previous measurements on KVM have had the
 exact opposite results.  I think we need to understand this a lot
 more.
 
 What I can tell is that vmexit is heavy. So it is reasonable to see
 the improvement under some cases, especially kernel is using eager
 FPU now which means each schedule may trigger a vmexit.
 
 On the other hand vmexit is lighter and lighter on newer processors; a
 Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000
 vs. 2500 clock cycles approximately).
 

1000 cycles? I remember it takes about 4000 cycle even in HSW server.

 Also, measurement were done on Westmere but Sandy Bridge is the first
 processor to have XSAVEOPT and thus use eager FPU.
 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-23 Thread Zhang, Yang Z

H. Peter Anvin wrote on 2015-04-24:
> On 04/23/2015 08:28 AM, Dave Hansen wrote:
>> On 04/23/2015 02:13 PM, Liang Li wrote:
>>> When compiling kernel on westmere, the performance of eager FPU is
>>> about 0.4% faster than lazy FPU.
>> 
>> Do you have an theory why this is?  What does the regression come from?
>> 
> 
> This is interesting since previous measurements on KVM have had the
> exact opposite results.  I think we need to understand this a lot more.

What I can tell is that vmexit is heavy. So it is reasonable to see the 
improvement under some cases, especially kernel is using eager FPU now which 
means each schedule may trigger a vmexit.

> 
>   -hpa
>


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-23 Thread Zhang, Yang Z

H. Peter Anvin wrote on 2015-04-24:
 On 04/23/2015 08:28 AM, Dave Hansen wrote:
 On 04/23/2015 02:13 PM, Liang Li wrote:
 When compiling kernel on westmere, the performance of eager FPU is
 about 0.4% faster than lazy FPU.
 
 Do you have an theory why this is?  What does the regression come from?
 
 
 This is interesting since previous measurements on KVM have had the
 exact opposite results.  I think we need to understand this a lot more.

What I can tell is that vmexit is heavy. So it is reasonable to see the 
improvement under some cases, especially kernel is using eager FPU now which 
means each schedule may trigger a vmexit.

 
   -hpa



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v4 6/6] KVM: nVMX: Enable nested posted interrupt processing

2015-02-02 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-02-03:
> 
> 
> On 02/02/2015 16:33, Wincy Van wrote:
>> static void vmx_accomp_nested_posted_intr(struct kvm_vcpu *vcpu) {
>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>> 
>> if (is_guest_mode(vcpu) &&
>> vmx->nested.posted_intr_nv != -1 &&
>> pi_test_on(vmx->nested.pi_desc))
>> kvm_apic_set_irr(vcpu,
>> vmx->nested.posted_intr_nv); }
>> Then we will get an nested-vmexit in vmx_check_nested_events, that
>> posted intr will be handled by L1 immediately.
>> This mechanism will also emulate the hardware's behavior: If a
>> posted intr was not accomplished by hardware, we will get an

Actually, we cannot say "not accomplished by hardware". It more like we don't 
do the job well. See my below answer.

>> interrupt with POSTED_INTR_NV.
> 
> Yes.

This is not enough. From L1's point, L2 is in vmx non-root mode. So we should 
emulate the posted interrupt in L0 correctly, say:
1. clear ON bit
2. ack interrupt
3, syn pir to virr
4. update RVI.
Then let the hardware(virtual interrupt delivery) to accomplish interrupt 
injection.

Force a vmexit more like a trick. It's better to follow the hardware's behavior 
unless we cannot do it. 

> 
>> Would this be better?
> 
> I think you do not even need a new bit.  You can use KVM_REQ_EVENT and
> (to complete my suggestion, which was not enough) do the above in
> vmx_check_nested_events.
> 
> Paolo

Best regards,
Yang

RE: [PATCH v4 6/6] KVM: nVMX: Enable nested posted interrupt processing

2015-02-02 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2015-02-03:
 
 
 On 02/02/2015 16:33, Wincy Van wrote:
 static void vmx_accomp_nested_posted_intr(struct kvm_vcpu *vcpu) {
 struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 if (is_guest_mode(vcpu) 
 vmx-nested.posted_intr_nv != -1 
 pi_test_on(vmx-nested.pi_desc))
 kvm_apic_set_irr(vcpu,
 vmx-nested.posted_intr_nv); }
 Then we will get an nested-vmexit in vmx_check_nested_events, that
 posted intr will be handled by L1 immediately.
 This mechanism will also emulate the hardware's behavior: If a
 posted intr was not accomplished by hardware, we will get an

Actually, we cannot say not accomplished by hardware. It more like we don't 
do the job well. See my below answer.

 interrupt with POSTED_INTR_NV.
 
 Yes.

This is not enough. From L1's point, L2 is in vmx non-root mode. So we should 
emulate the posted interrupt in L0 correctly, say:
1. clear ON bit
2. ack interrupt
3, syn pir to virr
4. update RVI.
Then let the hardware(virtual interrupt delivery) to accomplish interrupt 
injection.

Force a vmexit more like a trick. It's better to follow the hardware's behavior 
unless we cannot do it. 

 
 Would this be better?
 
 I think you do not even need a new bit.  You can use KVM_REQ_EVENT and
 (to complete my suggestion, which was not enough) do the above in
 vmx_check_nested_events.
 
 Paolo


Best regards,
Yang

RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-29:
> On Thu, Jan 29, 2015 at 10:54 AM, Zhang, Yang Z 
> wrote:
>>> -8646,7 +8750,8 @@ static void prepare_vmcs02(struct kvm_vcpu
>>> *vcpu, struct vmcs12 *vmcs12)
>>> else
>>> vmcs_write64(APIC_ACCESS_ADDR,
>>> 
>>> page_to_phys(vmx->nested.apic_access_page));
>>> -   } else if
>>> (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
>>> +   } else if
>>> + (!(nested_cpu_has_virt_x2apic_mode(vmcs12))
>>> + &&
>>> +
>>> + (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))) {
>>> exec_control |=
>> 
>> You don't load L2's apic_page in your patch correctly when x2apic
>> mode is
> used. Here is the right change for prepare_vmcs02()(maybe other place
> also need change too):
>> 
>> @@ -8585,7 +8585,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
>> struct vmcs12 *vmcs12)
>> 
> CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
>> exec_control |=
>> vmcs12->secondary_vm_exec_control;
>> 
>> -   if (exec_control &
>> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { +   if
>> (exec_control & (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + +
>> SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
>> /*
>>  * If translation failed, no matter: This
>> feature
> asks
>>  * to exit when accessing the given address,
>> and if it @@ -8594,7 +8595,8 @@ static void prepare_vmcs02(struct
> kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>>  */
>> if (!vmx->nested.apic_access_page)
>> exec_control &=
>> - ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; +  
>>   ~ (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + +
>> SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
>> else
>> vmcs_write64(APIC_ACCESS_ADDR,
>> page_to_phys(vmx->nested.apic_access_page));
>> 
> 
> I think we don't need to do this, if L1 enables x2apic mode, we have
> already checked that the vmcs12->SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES
> is 0. "exec_control |= vmcs12->secondary_vm_exec_control;" merged L1's
> settings, including x2apic mode. the special case is
> vm_need_virtualize_apic_accesses, if vm_need_virtualize_apic_accesses
> returns true, the nested_cpu_has_virt_x2apic_mode will prevent us to set
> the apic access bit.

I just realized that we are talking different thing. I am thinking about 
VIRTUAL_APIC_PAGE_ADDR(my mistake :)). And it has been handled correctly 
already. For APIC_ACCESS_ADDR, you are right. The current implementation can 
handle it well.

> 
> 
> Thanks,
> Wincy


Best regards,
Yang

RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
> When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
> performance, especially in apicv case.
> 
> This patch also introduces nested_vmx_check_apicv_controls for the nested
> apicv patches.
> 
> Signed-off-by: Wincy Van 
> ---
>  arch/x86/kvm/vmx.c |  114
> +++-
>  1 files changed, 112 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 787f886..9d11a93
> 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1108,6 +1108,11 @@ static inline bool nested_cpu_has_xsaves(struct
> vmcs12 *vmcs12)
> vmx_xsaves_supported();
>  }
> 
> +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
> +*vmcs12) {
> +   return nested_cpu_has2(vmcs12,
> +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
> +}
> +
>  static inline bool is_exception(u32 intr_info)  {
> return (intr_info & (INTR_INFO_INTR_TYPE_MASK |
> INTR_INFO_VALID_MASK)) @@ -2395,6 +2400,7 @@ static __init void
> nested_vmx_setup_ctls_msrs(void)
> nested_vmx_secondary_ctls_low = 0;
> nested_vmx_secondary_ctls_high &=
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
> +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
> SECONDARY_EXEC_WBINVD_EXITING |
> SECONDARY_EXEC_XSAVES;
> 
> @@ -4155,6 +4161,52 @@ static void
> __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
> }
>  }
> 
> +/*
> + * If a msr is allowed by L0, we should check whether it is allowed by L1.
> + * The corresponding bit will be cleared unless both of L0 and L1 allow it.
> + */
> +static void nested_vmx_disable_intercept_for_msr(unsigned long
> *msr_bitmap_l1,
> +  unsigned long
> *msr_bitmap_nested,
> +  u32 msr, int type) {
> +   int f = sizeof(unsigned long);
> +
> +   if (!cpu_has_vmx_msr_bitmap()) {
> +   WARN_ON(1);
> +   return;
> +   }
> +
> +   /*
> +* See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
> +* have the write-low and read-high bitmap offsets the wrong way
> round.
> +* We can control MSRs 0x-0x1fff and
> 0xc000-0xc0001fff.
> +*/
> +   if (msr <= 0x1fff) {
> +   if (type & MSR_TYPE_R &&
> +  !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
> +   /* read-low */
> +   __clear_bit(msr, msr_bitmap_nested + 0x000 / f);
> +
> +   if (type & MSR_TYPE_W &&
> +  !test_bit(msr, msr_bitmap_l1 + 0x800 / f))
> +   /* write-low */
> +   __clear_bit(msr, msr_bitmap_nested + 0x800 / f);
> +
> +   } else if ((msr >= 0xc000) && (msr <= 0xc0001fff)) {
> +   msr &= 0x1fff;
> +   if (type & MSR_TYPE_R &&
> +  !test_bit(msr, msr_bitmap_l1 + 0x400 / f))
> +   /* read-high */
> +   __clear_bit(msr, msr_bitmap_nested + 0x400 / f);
> +
> +   if (type & MSR_TYPE_W &&
> +  !test_bit(msr, msr_bitmap_l1 + 0xc00 / f))
> +   /* write-high */
> +   __clear_bit(msr, msr_bitmap_nested + 0xc00 / f);
> +
> +   }
> +}
> +
>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)  {
> if (!longmode_only)
> @@ -8350,7 +8402,59 @@ static int
> nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
> inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>struct vmcs12
> *vmcs12)  {
> -   return false;
> +   struct page *page;
> +   unsigned long *msr_bitmap;
> +
> +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
> +   return false;
> +
> +   page = nested_get_page(vcpu, vmcs12->msr_bitmap);
> +   if (!page) {
> +   WARN_ON(1);
> +   return false;
> +   }
> +   msr_bitmap = (unsigned long *)kmap(page);
> +   if (!msr_bitmap) {
> +   nested_release_page_clean(page);
> +   WARN_ON(1);
> +   return false;
> +   }
> +
> +   if (nested_cpu_has_virt_x2apic_mode(vmcs12)) {
> +   /* TPR is allowed */
> +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
> +   vmx_msr_bitmap_nested,
> +   APIC_BASE_MSR + (APIC_TASKPRI >>
> 4),
> +   MSR_TYPE_R | MSR_TYPE_W);
> +   } else
> +   __vmx_enable_intercept_for_msr(
> +   vmx_msr_bitmap_nested,
> +   APIC_BASE_MSR + (APIC_TASKPRI >>
> 4),
> +   MSR_TYPE_R | MSR_TYPE_W);
> +   kunmap(page);
> +

RE: [PATCH v4 0/6] KVM: nVMX: Enable nested apicv support

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
> v1 ---> v2:
>   Use spin lock to ensure vmcs12 is safe when doing nested
>   posted interrupt delivery.
> v2 ---> v3:
>   1. Add a new field in nested_vmx to avoid the spin lock in v2.
>   2. Drop send eoi to L1 when doing nested interrupt delivery.
>   3. Use hardware MSR bitmap to enable nested virtualize x2apic
>  mode.
> v3 ---> v4:
>   1. Optimize nested msr bitmap merging.
>   2. Allocate nested msr bitmap only when nested == 1.
>   3. Inline the nested vmx control checking functions.

This version looks good to me. Only minor comment: EXIT_REASON_APIC_WRITE 
vmexit is introduced by apic register virtualization not virtual interrupt 
delivery, so it's better add it in 4th patch not 5th patch.(If no other 
comments, I guess Paolo can help do it when applying it).

Reviewed-by: Yang Zhang 


> Wincy Van (6):
>   KVM: nVMX: Use hardware MSR bitmap
>   KVM: nVMX: Enable nested virtualize x2apic mode
>   KVM: nVMX: Make nested control MSRs per-cpu
>   KVM: nVMX: Enable nested apic register virtualization
>   KVM: nVMX: Enable nested virtual interrupt delivery
>   KVM: nVMX: Enable nested posted interrupt processing
>  arch/x86/kvm/vmx.c |  580
>  +++- 1 files changed,
>  480 insertions(+), 100 deletions(-)


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
> On Wed, Jan 28, 2015 at 7:25 PM, Zhang, Yang Z 
> wrote:
>> Wincy Van wrote on 2015-01-28:
>>> On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z
>>> 
>>> wrote:
>>>>> @@ -8344,7 +8394,68 @@ static int
>>>>> nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static
>>>>> inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>>>>>struct vmcs12
>>>>> *vmcs12)  { -   return false; +   struct page *page; +
>>>>> unsigned long *msr_bitmap; + +   if
>>>>> (!nested_cpu_has_virt_x2apic_mode(vmcs12)) +   return
>>>>> false; + +   page = nested_get_page(vcpu, vmcs12->msr_bitmap);
> +
>>>>> if (!page) { +   WARN_ON(1); +
> return
>>>>> false; +   } +   msr_bitmap = (unsigned long *)kmap(page); +
>>>>> if (!msr_bitmap) { +
>>>>> nested_release_page_clean(page); +   WARN_ON(1); +
>>>>>   return false; +   } + +
> memset(vmx_msr_bitmap_nested,
>>>>> 0xff, PAGE_SIZE); + +   if
>>>>> (nested_cpu_has_virt_x2apic_mode(vmcs12)) +   /* TPR is
>>>>> allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, +
>>>>>   vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI
>>>>>>> 4), + MSR_TYPE_R | MSR_TYPE_W);
>>>> 
>>>> I didn't understand what this function does? Per my understanding,
>>>> you only
>>> need to set the (vmx_msr_bitmap_nested = vmcs01->msr_bitmap |
>>> vmcs12->msr_bitmap) and inject the nested vmexit to L1 if the bit
>>> vmcs12->in msr_bitmap is setting. Am I missing some patches?
>>> 
>>> In the beginning, I want to do "vmcs01->msr_bitmap |
>>> vmcs12->msr_bitmap", but I remember that there isn't a instruction
>>> vmcs12->to
>>> do a bit or operation in two pages effectively, so I do the bit or
>>> operation in nested_vmx_disable_intercept_for_msr. If the hardware
>>> do not support this, I think it is faster if we deal with the bits on 
>>> demand.
>>> nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps,
>>> any features can put their logic here.
>> 
>> You construct the nested_msr_bitmap based on vmcs12->msr_bitmap, what
>> happens if vmcs01->msr_bitmap want to trap this msr?
>> 
> 
> If L0 wants to intercept a msr, we should set
> vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
> and that bitmaps should only be loaded in non-nested entry.
> Currently we only clear the corresponding bits if L1 enables
> virtualize x2apic mode, all the other bits are all set. On nested
> entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's 
> bitmap.
> In the nested entry, L0 only care about L2's msr access, not L1's. I
> think that in nested entry, nested_msr_bitmap is "vmcs01->msr_bitmap"
> as your mentioned above.

Mmm... I think if the bit is setting in vmcs01->msr_bitmap means whenever the 
VCPU(no matter in nested guest mode or not) accesses this msr should cause 
vmexit, not only L1. That's why need to construct the vmcs02->msr_bitmap based 
on (vmcs01->msr_bitmap ORed vmcs12->msr_bitmap).

> 
> Thanks,
> Wincy


Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-24:
> When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
> performance, especially in apicv case.
> 
> This patch also introduces nested_vmx_check_apicv_controls for the nested
> apicv patches.
> 
> Signed-off-by: Wincy Van 
> ---

...snip...

>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)  {
> if (!longmode_only)
> @@ -8344,7 +8394,68 @@ static int
> nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
> inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>struct vmcs12
> *vmcs12)  {
> -   return false;
> +   struct page *page;
> +   unsigned long *msr_bitmap;
> +
> +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
> +   return false;
> +
> +   page = nested_get_page(vcpu, vmcs12->msr_bitmap);
> +   if (!page) {
> +   WARN_ON(1);
> +   return false;
> +   }
> +   msr_bitmap = (unsigned long *)kmap(page);
> +   if (!msr_bitmap) {
> +   nested_release_page_clean(page);
> +   WARN_ON(1);
> +   return false;
> +   }
> +
> +   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);
> +
> +   if (nested_cpu_has_virt_x2apic_mode(vmcs12))
> +   /* TPR is allowed */
> +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
> +   vmx_msr_bitmap_nested,
> +   APIC_BASE_MSR + (APIC_TASKPRI >>
> 4),
> +   MSR_TYPE_R | MSR_TYPE_W);

I didn't understand what this function does? Per my understanding, you only 
need to set the (vmx_msr_bitmap_nested = vmcs01->msr_bitmap | 
vmcs12->msr_bitmap) and inject the nested vmexit to L1 if the bit in 
vmcs12->msr_bitmap is setting. Am I missing some patches?

Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
> On Wed, Jan 28, 2015 at 8:33 PM, Zhang, Yang Z 
> wrote:
>>>> 
>>>> You are right, but this is not fit for all the cases, we should
>>>> custom the nested_msr_bitmap.
>>>> e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
>>>>  if (enable_apicv) {
>>>> for (msr = 0x800; msr <= 0x8ff; msr++)
>>>> vmx_disable_intercept_msr_read_x2apic(msr); /*
>>>> According SDM, in x2apic mode, the whole id reg
>>>> is
>>> used.
>>>>  * But in KVM, it only use the highest eight bits.
>>>> Need
> to
>>>>  * intercept it */
>>>> vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT
>>>> */ vmx_enable_intercept_msr_read_x2apic(0x839); /*
>>>> TPR */ vmx_disable_intercept_msr_write_x2apic(0x808);
> /*
>>>> EOI
>>> */
>>>> vmx_disable_intercept_msr_write_x2apic(0x80b); /*
>>>> SELF-IPI */
>>>> vmx_disable_intercept_msr_write_x2apic(0x83f);
>>>> }
>>>> But L1 may not want this. So I think we are better to deal with
>>>> the
>>> 
>>> Actually, from L0's point, it is totally unaware of the L2. The only
>>> thing L0 aware is that the CPU should follow L0's configuration when
>>> VCPU is running. So if L0 wants to trap a msr, then the read operation
>>> to this msr should cause vmexit unconditionally no matter who is
>>> running(who means L1, L2, L3.).
>>> 
>>>> msrs seperately, there is not a common way suit for all the cases.
>>>> If other features want to intercept a msr in nested entry, they
>>>> can put the custom code in nested_vmx_merge_msr_bitmap.
>>> 
>>> Yes, if other features want to do it in 'nested' entry, they can
>>> fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it
>>> should be our responsibly to handle it correctly, how about add following 
>>> check:
>>> 
>>> if (type & MSR_TYPE_R && !test_bit(msr, vmcs01_msr_bitmap) &&
>>> !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
>>> __clear_bit(msr, msr_bitmap_nested + 0x000 / f);
>> 
>> 
>> Anyway, this is not necessary for your current patch. We can consider
>> it later if there really have other features will use it.
>> 
> 
> Yep, I know what you mean now, for other msrs which are not forwarded
> access by a mechanism like virtual-apic page, we should intercept it
> unconditionally. I think we should ensure the msr can be allowed
> before call nested_vmx_disable_intercept_for_msr, if L0 want to
> intercept it, just do not call nested_vmx_disable_intercept_for_msr.

Yes, this is a solution. But I prefer to handle it in nested code path since 
others may not familiar with nested and may block it by mistake.

> 
>  !test_bit(msr, vmcs01_msr_bitmap) will introduce a problem that some
> of the msrs will be affcted by vmcs01_msr_bitmap, TMCCT and TPR, for example.
> Intercept reading for these msrs is okay, but it is not efficient.

TMCCT is always trapped by most VMM. I don't think TPR is trapped in KVM.

> 
> Thanks,
> Wincy


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
> On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z 
> wrote:
>>>> 
>>> 
>>> If L0 wants to intercept a msr, we should set
>>> vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
>>> and that bitmaps should only be loaded in non-nested entry. Currently
>>> we only clear the corresponding bits if L1 enables virtualize x2apic
>>> mode, all the other bits are all set. On nested entry, we load
>>> nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In
>>> the nested entry, L0 only care about L2's msr access, not L1's. I
>>> think that in nested entry, nested_msr_bitmap is "vmcs01->msr_bitmap"
>>> as your mentioned above.
>> 
>> Mmm... I think if the bit is setting in vmcs01->msr_bitmap means
>> whenever
> the VCPU(no matter in nested guest mode or not) accesses this msr
> should cause vmexit, not only L1. That's why need to construct the
> vmcs02->msr_bitmap based on (vmcs01->msr_bitmap ORed
> vmcs12->msr_bitmap).
>> 
> 
> You are right, but this is not fit for all the cases, we should custom
> the nested_msr_bitmap.
> e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
>  if (enable_apicv) {
> for (msr = 0x800; msr <= 0x8ff; msr++)
> vmx_disable_intercept_msr_read_x2apic(msr);
> /* According SDM, in x2apic mode, the whole id reg is used.
>  * But in KVM, it only use the highest eight bits. Need to
>  * intercept it */
> vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */
> vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */
> vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */
> vmx_disable_intercept_msr_write_x2apic(0x80b); /*
> SELF-IPI */
> vmx_disable_intercept_msr_write_x2apic(0x83f);
> }
> But L1 may not want this. So I think we are better to deal with the

Actually, from L0's point, it is totally unaware of the L2. The only thing L0 
aware is that the CPU should follow L0's configuration when VCPU is running. So 
if L0 wants to trap a msr, then the read operation to this msr should cause 
vmexit unconditionally no matter who is running(who means L1, L2, L3.).

> msrs seperately, there is not a common way suit for all the cases. If
> other features want to intercept a msr in nested entry, they can put
> the custom code in nested_vmx_merge_msr_bitmap.

Yes, if other features want to do it in 'nested' entry, they can fill 
nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our 
responsibly to handle it correctly, how about add following check:

if (type & MSR_TYPE_R && !test_bit(msr, vmcs01_msr_bitmap) && !test_bit(msr, 
msr_bitmap_l1 + 0x000 / f))
__clear_bit(msr, msr_bitmap_nested + 0x000 / f);


> 
> 
> Thanks,
> Wincy


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
> On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z 
> wrote:
>>> @@ -8344,7 +8394,68 @@ static int
>>> nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
>>> inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>>>struct vmcs12
>>> *vmcs12)  { -   return false; +   struct page *page; +  
>>> unsigned long *msr_bitmap; + +   if
>>> (!nested_cpu_has_virt_x2apic_mode(vmcs12)) +   return
>>> false; + +   page = nested_get_page(vcpu, vmcs12->msr_bitmap); +  
>>> if (!page) { +   WARN_ON(1); +   return
>>> false; +   } +   msr_bitmap = (unsigned long *)kmap(page); +  
>>> if (!msr_bitmap) { +  
>>> nested_release_page_clean(page); +   WARN_ON(1); +
>>>   return false; +   } + +   memset(vmx_msr_bitmap_nested,
>>> 0xff, PAGE_SIZE); + +   if
>>> (nested_cpu_has_virt_x2apic_mode(vmcs12)) +   /* TPR is
>>> allowed */ +  
>>> nested_vmx_disable_intercept_for_msr(msr_bitmap, +
>>>   vmx_msr_bitmap_nested, +  
>>> APIC_BASE_MSR + (APIC_TASKPRI >> 4), +  
>>> MSR_TYPE_R | MSR_TYPE_W);
>> 
>> I didn't understand what this function does? Per my understanding,
>> you only
> need to set the (vmx_msr_bitmap_nested = vmcs01->msr_bitmap |
> vmcs12->msr_bitmap) and inject the nested vmexit to L1 if the bit in
> vmcs12->msr_bitmap is setting. Am I missing some patches?
> 
> In the beginning, I want to do "vmcs01->msr_bitmap |
> vmcs12->msr_bitmap", but I remember that there isn't a instruction to
> do a bit or operation in two pages effectively, so I do the bit or
> operation in nested_vmx_disable_intercept_for_msr. If the hardware do
> not support this, I think it is faster if we deal with the bits on demand.
> nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps,
> any features can put their logic here.

You construct the nested_msr_bitmap based on vmcs12->msr_bitmap, what happens 
if vmcs01->msr_bitmap want to trap this msr?

> 
> If there is a faster way, please teach me how to do it : )

You are right. Interception should be much faster.

> 
> Thanks,
> Wincy
> 
> 
>> 
>> Best regards,
>> Yang
>> 
>>


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Zhang, Yang Z wrote on 2015-01-28:
> Wincy Van wrote on 2015-01-28:
>> On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z
>> 
>> wrote:
>>>>> 
>>>> 
>>>> If L0 wants to intercept a msr, we should set
>>>> vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
>>>> and that bitmaps should only be loaded in non-nested entry. Currently
>>>> we only clear the corresponding bits if L1 enables virtualize x2apic
>>>> mode, all the other bits are all set. On nested entry, we load
>>>> nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In
>>>> the nested entry, L0 only care about L2's msr access, not L1's. I
>>>> think that in nested entry, nested_msr_bitmap is "vmcs01->msr_bitmap"
>>>> as your mentioned above.
>>> 
>>> Mmm... I think if the bit is setting in vmcs01->msr_bitmap means
>>> whenever
>> the VCPU(no matter in nested guest mode or not) accesses this msr
>> should cause vmexit, not only L1. That's why need to construct the
>> vmcs02->msr_bitmap based on (vmcs01->msr_bitmap ORed
>> vmcs12->msr_bitmap).
>>> 
>> 
>> You are right, but this is not fit for all the cases, we should
>> custom the nested_msr_bitmap.
>> e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
>>  if (enable_apicv) {
>> for (msr = 0x800; msr <= 0x8ff; msr++)
> vmx_disable_intercept_msr_read_x2apic(msr);
>> /* According SDM, in x2apic mode, the whole id reg
>> is
> used.
>>  * But in KVM, it only use the highest eight bits. Need to
>>  * intercept it */
>> vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT
>> */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR
>> */ vmx_disable_intercept_msr_write_x2apic(0x808); /*
>> EOI
> */
>> vmx_disable_intercept_msr_write_x2apic(0x80b); /*
>> SELF-IPI */
>> vmx_disable_intercept_msr_write_x2apic(0x83f);
>> }
>> But L1 may not want this. So I think we are better to deal with the
> 
> Actually, from L0's point, it is totally unaware of the L2. The only
> thing L0 aware is that the CPU should follow L0's configuration when
> VCPU is running. So if L0 wants to trap a msr, then the read operation
> to this msr should cause vmexit unconditionally no matter who is running(who 
> means L1, L2, L3.).
> 
>> msrs seperately, there is not a common way suit for all the cases.
>> If other features want to intercept a msr in nested entry, they can
>> put the custom code in nested_vmx_merge_msr_bitmap.
> 
> Yes, if other features want to do it in 'nested' entry, they can fill
> nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be
> our responsibly to handle it correctly, how about add following check:
> 
> if (type & MSR_TYPE_R && !test_bit(msr, vmcs01_msr_bitmap) &&
> !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
> __clear_bit(msr, msr_bitmap_nested + 0x000 / f);


Anyway, this is not necessary for your current patch. We can consider it later 
if there really have other features will use it.

> 
>> 
>> 
>> Thanks,
>> Wincy
> 
> 
> Best regards,
> Yang
>


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
> On Wed, Jan 28, 2015 at 4:00 PM, Zhang, Yang Z 
> wrote:
>>> @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void)
>>> (unsigned long
>>> *)__get_free_page(GFP_KERNEL);
>>> if (!vmx_msr_bitmap_longmode_x2apic)
>>> goto out4;
>>> +
>>> +   vmx_msr_bitmap_nested = (unsigned long
>>> *)__get_free_page(GFP_KERNEL);
>>> +   if (!vmx_msr_bitmap_nested)
>>> +   goto out5;
>>> +
>> 
>> Since the nested virtualization is off by default. It's better to
>> allocate the page only when nested is true. Maybe adding the following
>> check is better:
>> 
>> if (nested) {
>> vmx_msr_bitmap_nested = (unsigned long
>> *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested)
>> goto out5;
>> }
> 
> Agreed. Will do.
> 
>> 
>> ...snip...
>> 
>>> +
>>> +/*
>>> + * Merge L0's and L1's MSR bitmap, return false to indicate that
>>> + * we do not use the hardware.
>>> + */
>>> +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>>> +  struct vmcs12
>>> *vmcs12) {
>>> +   return false;
>>> +}
>>> +
>> 
>> The following patches have nothing to do with the MSR control. Why
>> leave the function empty here?
>> 
> 
> No. In patch "Enable nested virtualize x2apic mode", we will return
> false if L1 disabled virt_x2apic_mode, then the hardware MSR-bitmap control 
> is disabled.
> This is faster than rebuilding the vmx_msr_bitmap_nested.
> This function returns false here to indicate that we do not use the hardware.
> Since It is not only virtualize x2apic mode using this, other features
> may use this either. I think it isn't suitable to introduce this function in 
> other patches.

Yes, rebuilding is more costly. But your current implementation cannot leverage 
the APICv feature correctly. I replied in another thread.

> 
> 
>> Best regards,
>> Yang
>> 
>>


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-24:
> Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all of L2's
> msr access is intercepted by L0. Since many features like virtualize x2apic 
> mode
> has a complicated logic and it is difficult for us to emulate, we should use
> hardware and merge the bitmap.
> 
> This patch introduces nested_vmx_merge_msr_bitmap for future use.
> 
> Signed-off-by: Wincy Van 
> ---
>  arch/x86/kvm/vmx.c |   71
> +++
>  1 files changed, 60 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c987374..36d0724
> 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -798,6 +798,7 @@ static unsigned long *vmx_msr_bitmap_legacy;
> static unsigned long *vmx_msr_bitmap_longmode;  static unsigned long
> *vmx_msr_bitmap_legacy_x2apic;  static unsigned long
> *vmx_msr_bitmap_longmode_x2apic;
> +static unsigned long *vmx_msr_bitmap_nested;
>  static unsigned long *vmx_vmread_bitmap;  static unsigned long
> *vmx_vmwrite_bitmap;
> 
> @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void)
> (unsigned long
> *)__get_free_page(GFP_KERNEL);
> if (!vmx_msr_bitmap_longmode_x2apic)
> goto out4;
> +
> +   vmx_msr_bitmap_nested = (unsigned long
> *)__get_free_page(GFP_KERNEL);
> +   if (!vmx_msr_bitmap_nested)
> +   goto out5;
> +

Since the nested virtualization is off by default. It's better to allocate the 
page
only when nested is true. Maybe adding the following check is better:

if (nested) {
vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_msr_bitmap_nested)
goto out5;
}

...snip...

> +
> +/*
> + * Merge L0's and L1's MSR bitmap, return false to indicate that
> + * we do not use the hardware.
> + */
> +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
> +  struct vmcs12
> *vmcs12) {
> +   return false;
> +}
> +

The following patches have nothing to do with the MSR control. Why leave the 
function empty here?

Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Zhang, Yang Z wrote on 2015-01-28:
> Wincy Van wrote on 2015-01-24:
>> When L2 is using x2apic, we can use virtualize x2apic mode to gain
>> higher performance, especially in apicv case.
>> 
>> This patch also introduces nested_vmx_check_apicv_controls for the
>> nested apicv patches.

Sorry, replied on the wrong thread.:(

>> 
>> Signed-off-by: Wincy Van 
>> ---
> 
> ...snip...
> 
>>  static void vmx_disable_intercept_for_msr(u32 msr, bool
>> longmode_only)
> {
>> if (!longmode_only)
>> @@ -8344,7 +8394,68 @@ static int
>> nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
>> inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>>struct vmcs12
>> *vmcs12)  {
>> -   return false;
>> +   struct page *page;
>> +   unsigned long *msr_bitmap;
>> +
>> +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
>> +   return false;
>> +
>> +   page = nested_get_page(vcpu, vmcs12->msr_bitmap);
>> +   if (!page) {
>> +   WARN_ON(1);
>> +   return false;
>> +   }
>> +   msr_bitmap = (unsigned long *)kmap(page);
>> +   if (!msr_bitmap) {
>> +   nested_release_page_clean(page);
>> +   WARN_ON(1);
>> +   return false;
>> +   }
>> +
>> +   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);
>> +
>> +   if (nested_cpu_has_virt_x2apic_mode(vmcs12))
>> +   /* TPR is allowed */
>> +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
>> +   vmx_msr_bitmap_nested,
>> +   APIC_BASE_MSR + (APIC_TASKPRI >>
>> 4),
>> +   MSR_TYPE_R | MSR_TYPE_W);
> 
> I didn't understand what this function does? Per my understanding, you
> only need to set the (vmx_msr_bitmap_nested = vmcs01->msr_bitmap |
> vmcs12->msr_bitmap) and inject the nested vmexit to L1 if the bit in
> vmcs12->msr_bitmap is setting. Am I missing some patches?
> 
> Best regards,
> Yang
>


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-24:
 Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all of L2's
 msr access is intercepted by L0. Since many features like virtualize x2apic 
 mode
 has a complicated logic and it is difficult for us to emulate, we should use
 hardware and merge the bitmap.
 
 This patch introduces nested_vmx_merge_msr_bitmap for future use.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
  arch/x86/kvm/vmx.c |   71
 +++
  1 files changed, 60 insertions(+), 11 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c987374..36d0724
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -798,6 +798,7 @@ static unsigned long *vmx_msr_bitmap_legacy;
 static unsigned long *vmx_msr_bitmap_longmode;  static unsigned long
 *vmx_msr_bitmap_legacy_x2apic;  static unsigned long
 *vmx_msr_bitmap_longmode_x2apic;
 +static unsigned long *vmx_msr_bitmap_nested;
  static unsigned long *vmx_vmread_bitmap;  static unsigned long
 *vmx_vmwrite_bitmap;
 
 @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void)
 (unsigned long
 *)__get_free_page(GFP_KERNEL);
 if (!vmx_msr_bitmap_longmode_x2apic)
 goto out4;
 +
 +   vmx_msr_bitmap_nested = (unsigned long
 *)__get_free_page(GFP_KERNEL);
 +   if (!vmx_msr_bitmap_nested)
 +   goto out5;
 +

Since the nested virtualization is off by default. It's better to allocate the 
page
only when nested is true. Maybe adding the following check is better:

if (nested) {
vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_msr_bitmap_nested)
goto out5;
}

...snip...

 +
 +/*
 + * Merge L0's and L1's MSR bitmap, return false to indicate that
 + * we do not use the hardware.
 + */
 +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 +  struct vmcs12
 *vmcs12) {
 +   return false;
 +}
 +

The following patches have nothing to do with the MSR control. Why leave the 
function empty here?

Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Zhang, Yang Z wrote on 2015-01-28:
 Wincy Van wrote on 2015-01-24:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain
 higher performance, especially in apicv case.
 
 This patch also introduces nested_vmx_check_apicv_controls for the
 nested apicv patches.

Sorry, replied on the wrong thread.:(

 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
 
 ...snip...
 
  static void vmx_disable_intercept_for_msr(u32 msr, bool
 longmode_only)
 {
 if (!longmode_only)
 @@ -8344,7 +8394,68 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  {
 -   return false;
 +   struct page *page;
 +   unsigned long *msr_bitmap;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return false;
 +
 +   page = nested_get_page(vcpu, vmcs12-msr_bitmap);
 +   if (!page) {
 +   WARN_ON(1);
 +   return false;
 +   }
 +   msr_bitmap = (unsigned long *)kmap(page);
 +   if (!msr_bitmap) {
 +   nested_release_page_clean(page);
 +   WARN_ON(1);
 +   return false;
 +   }
 +
 +   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);
 +
 +   if (nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   /* TPR is allowed */
 +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
 +   vmx_msr_bitmap_nested,
 +   APIC_BASE_MSR + (APIC_TASKPRI 
 4),
 +   MSR_TYPE_R | MSR_TYPE_W);
 
 I didn't understand what this function does? Per my understanding, you
 only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap |
 vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in
 vmcs12-msr_bitmap is setting. Am I missing some patches?
 
 Best regards,
 Yang



Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 4:00 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void)
 (unsigned long
 *)__get_free_page(GFP_KERNEL);
 if (!vmx_msr_bitmap_longmode_x2apic)
 goto out4;
 +
 +   vmx_msr_bitmap_nested = (unsigned long
 *)__get_free_page(GFP_KERNEL);
 +   if (!vmx_msr_bitmap_nested)
 +   goto out5;
 +
 
 Since the nested virtualization is off by default. It's better to
 allocate the page only when nested is true. Maybe adding the following
 check is better:
 
 if (nested) {
 vmx_msr_bitmap_nested = (unsigned long
 *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested)
 goto out5;
 }
 
 Agreed. Will do.
 
 
 ...snip...
 
 +
 +/*
 + * Merge L0's and L1's MSR bitmap, return false to indicate that
 + * we do not use the hardware.
 + */
 +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 +  struct vmcs12
 *vmcs12) {
 +   return false;
 +}
 +
 
 The following patches have nothing to do with the MSR control. Why
 leave the function empty here?
 
 
 No. In patch Enable nested virtualize x2apic mode, we will return
 false if L1 disabled virt_x2apic_mode, then the hardware MSR-bitmap control 
 is disabled.
 This is faster than rebuilding the vmx_msr_bitmap_nested.
 This function returns false here to indicate that we do not use the hardware.
 Since It is not only virtualize x2apic mode using this, other features
 may use this either. I think it isn't suitable to introduce this function in 
 other patches.

Yes, rebuilding is more costly. But your current implementation cannot leverage 
the APICv feature correctly. I replied in another thread.

 
 
 Best regards,
 Yang
 



Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Zhang, Yang Z wrote on 2015-01-28:
 Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z
 yang.z.zh...@intel.com
 wrote:
 
 
 If L0 wants to intercept a msr, we should set
 vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
 and that bitmaps should only be loaded in non-nested entry. Currently
 we only clear the corresponding bits if L1 enables virtualize x2apic
 mode, all the other bits are all set. On nested entry, we load
 nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In
 the nested entry, L0 only care about L2's msr access, not L1's. I
 think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap
 as your mentioned above.
 
 Mmm... I think if the bit is setting in vmcs01-msr_bitmap means
 whenever
 the VCPU(no matter in nested guest mode or not) accesses this msr
 should cause vmexit, not only L1. That's why need to construct the
 vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed
 vmcs12-msr_bitmap).
 
 
 You are right, but this is not fit for all the cases, we should
 custom the nested_msr_bitmap.
 e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
  if (enable_apicv) {
 for (msr = 0x800; msr = 0x8ff; msr++)
 vmx_disable_intercept_msr_read_x2apic(msr);
 /* According SDM, in x2apic mode, the whole id reg
 is
 used.
  * But in KVM, it only use the highest eight bits. Need to
  * intercept it */
 vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT
 */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR
 */ vmx_disable_intercept_msr_write_x2apic(0x808); /*
 EOI
 */
 vmx_disable_intercept_msr_write_x2apic(0x80b); /*
 SELF-IPI */
 vmx_disable_intercept_msr_write_x2apic(0x83f);
 }
 But L1 may not want this. So I think we are better to deal with the
 
 Actually, from L0's point, it is totally unaware of the L2. The only
 thing L0 aware is that the CPU should follow L0's configuration when
 VCPU is running. So if L0 wants to trap a msr, then the read operation
 to this msr should cause vmexit unconditionally no matter who is running(who 
 means L1, L2, L3.).
 
 msrs seperately, there is not a common way suit for all the cases.
 If other features want to intercept a msr in nested entry, they can
 put the custom code in nested_vmx_merge_msr_bitmap.
 
 Yes, if other features want to do it in 'nested' entry, they can fill
 nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be
 our responsibly to handle it correctly, how about add following check:
 
 if (type  MSR_TYPE_R  !test_bit(msr, vmcs01_msr_bitmap) 
 !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
 __clear_bit(msr, msr_bitmap_nested + 0x000 / f);


Anyway, this is not necessary for your current patch. We can consider it later 
if there really have other features will use it.

 
 
 
 Thanks,
 Wincy
 
 
 Best regards,
 Yang



Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 @@ -8344,7 +8394,68 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  { -   return false; +   struct page *page; +  
 unsigned long *msr_bitmap; + +   if
 (!nested_cpu_has_virt_x2apic_mode(vmcs12)) +   return
 false; + +   page = nested_get_page(vcpu, vmcs12-msr_bitmap); +  
 if (!page) { +   WARN_ON(1); +   return
 false; +   } +   msr_bitmap = (unsigned long *)kmap(page); +  
 if (!msr_bitmap) { +  
 nested_release_page_clean(page); +   WARN_ON(1); +
   return false; +   } + +   memset(vmx_msr_bitmap_nested,
 0xff, PAGE_SIZE); + +   if
 (nested_cpu_has_virt_x2apic_mode(vmcs12)) +   /* TPR is
 allowed */ +  
 nested_vmx_disable_intercept_for_msr(msr_bitmap, +
   vmx_msr_bitmap_nested, +  
 APIC_BASE_MSR + (APIC_TASKPRI  4), +  
 MSR_TYPE_R | MSR_TYPE_W);
 
 I didn't understand what this function does? Per my understanding,
 you only
 need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap |
 vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in
 vmcs12-msr_bitmap is setting. Am I missing some patches?
 
 In the beginning, I want to do vmcs01-msr_bitmap |
 vmcs12-msr_bitmap, but I remember that there isn't a instruction to
 do a bit or operation in two pages effectively, so I do the bit or
 operation in nested_vmx_disable_intercept_for_msr. If the hardware do
 not support this, I think it is faster if we deal with the bits on demand.
 nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps,
 any features can put their logic here.

You construct the nested_msr_bitmap based on vmcs12-msr_bitmap, what happens 
if vmcs01-msr_bitmap want to trap this msr?

 
 If there is a faster way, please teach me how to do it : )

You are right. Interception should be much faster.

 
 Thanks,
 Wincy
 
 
 
 Best regards,
 Yang
 



Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 
 
 If L0 wants to intercept a msr, we should set
 vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
 and that bitmaps should only be loaded in non-nested entry. Currently
 we only clear the corresponding bits if L1 enables virtualize x2apic
 mode, all the other bits are all set. On nested entry, we load
 nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In
 the nested entry, L0 only care about L2's msr access, not L1's. I
 think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap
 as your mentioned above.
 
 Mmm... I think if the bit is setting in vmcs01-msr_bitmap means
 whenever
 the VCPU(no matter in nested guest mode or not) accesses this msr
 should cause vmexit, not only L1. That's why need to construct the
 vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed
 vmcs12-msr_bitmap).
 
 
 You are right, but this is not fit for all the cases, we should custom
 the nested_msr_bitmap.
 e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
  if (enable_apicv) {
 for (msr = 0x800; msr = 0x8ff; msr++)
 vmx_disable_intercept_msr_read_x2apic(msr);
 /* According SDM, in x2apic mode, the whole id reg is used.
  * But in KVM, it only use the highest eight bits. Need to
  * intercept it */
 vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */
 vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */
 vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */
 vmx_disable_intercept_msr_write_x2apic(0x80b); /*
 SELF-IPI */
 vmx_disable_intercept_msr_write_x2apic(0x83f);
 }
 But L1 may not want this. So I think we are better to deal with the

Actually, from L0's point, it is totally unaware of the L2. The only thing L0 
aware is that the CPU should follow L0's configuration when VCPU is running. So 
if L0 wants to trap a msr, then the read operation to this msr should cause 
vmexit unconditionally no matter who is running(who means L1, L2, L3.).

 msrs seperately, there is not a common way suit for all the cases. If
 other features want to intercept a msr in nested entry, they can put
 the custom code in nested_vmx_merge_msr_bitmap.

Yes, if other features want to do it in 'nested' entry, they can fill 
nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our 
responsibly to handle it correctly, how about add following check:

if (type  MSR_TYPE_R  !test_bit(msr, vmcs01_msr_bitmap)  !test_bit(msr, 
msr_bitmap_l1 + 0x000 / f))
__clear_bit(msr, msr_bitmap_nested + 0x000 / f);


 
 
 Thanks,
 Wincy


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 8:33 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 
 You are right, but this is not fit for all the cases, we should
 custom the nested_msr_bitmap.
 e.g.  Currently L0 wants to intercept some of the x2apic msrs' reading:
  if (enable_apicv) {
 for (msr = 0x800; msr = 0x8ff; msr++)
 vmx_disable_intercept_msr_read_x2apic(msr); /*
 According SDM, in x2apic mode, the whole id reg
 is
 used.
  * But in KVM, it only use the highest eight bits.
 Need
 to
  * intercept it */
 vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT
 */ vmx_enable_intercept_msr_read_x2apic(0x839); /*
 TPR */ vmx_disable_intercept_msr_write_x2apic(0x808);
 /*
 EOI
 */
 vmx_disable_intercept_msr_write_x2apic(0x80b); /*
 SELF-IPI */
 vmx_disable_intercept_msr_write_x2apic(0x83f);
 }
 But L1 may not want this. So I think we are better to deal with
 the
 
 Actually, from L0's point, it is totally unaware of the L2. The only
 thing L0 aware is that the CPU should follow L0's configuration when
 VCPU is running. So if L0 wants to trap a msr, then the read operation
 to this msr should cause vmexit unconditionally no matter who is
 running(who means L1, L2, L3.).
 
 msrs seperately, there is not a common way suit for all the cases.
 If other features want to intercept a msr in nested entry, they
 can put the custom code in nested_vmx_merge_msr_bitmap.
 
 Yes, if other features want to do it in 'nested' entry, they can
 fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it
 should be our responsibly to handle it correctly, how about add following 
 check:
 
 if (type  MSR_TYPE_R  !test_bit(msr, vmcs01_msr_bitmap) 
 !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
 __clear_bit(msr, msr_bitmap_nested + 0x000 / f);
 
 
 Anyway, this is not necessary for your current patch. We can consider
 it later if there really have other features will use it.
 
 
 Yep, I know what you mean now, for other msrs which are not forwarded
 access by a mechanism like virtual-apic page, we should intercept it
 unconditionally. I think we should ensure the msr can be allowed
 before call nested_vmx_disable_intercept_for_msr, if L0 want to
 intercept it, just do not call nested_vmx_disable_intercept_for_msr.

Yes, this is a solution. But I prefer to handle it in nested code path since 
others may not familiar with nested and may block it by mistake.

 
  !test_bit(msr, vmcs01_msr_bitmap) will introduce a problem that some
 of the msrs will be affcted by vmcs01_msr_bitmap, TMCCT and TPR, for example.
 Intercept reading for these msrs is okay, but it is not efficient.

TMCCT is always trapped by most VMM. I don't think TPR is trapped in KVM.

 
 Thanks,
 Wincy


Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-24:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance, especially in apicv case.
 
 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---

...snip...

  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)  {
 if (!longmode_only)
 @@ -8344,7 +8394,68 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  {
 -   return false;
 +   struct page *page;
 +   unsigned long *msr_bitmap;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return false;
 +
 +   page = nested_get_page(vcpu, vmcs12-msr_bitmap);
 +   if (!page) {
 +   WARN_ON(1);
 +   return false;
 +   }
 +   msr_bitmap = (unsigned long *)kmap(page);
 +   if (!msr_bitmap) {
 +   nested_release_page_clean(page);
 +   WARN_ON(1);
 +   return false;
 +   }
 +
 +   memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE);
 +
 +   if (nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   /* TPR is allowed */
 +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
 +   vmx_msr_bitmap_nested,
 +   APIC_BASE_MSR + (APIC_TASKPRI 
 4),
 +   MSR_TYPE_R | MSR_TYPE_W);

I didn't understand what this function does? Per my understanding, you only 
need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | 
vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in 
vmcs12-msr_bitmap is setting. Am I missing some patches?

Best regards,
Yang

RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 7:25 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 Wincy Van wrote on 2015-01-28:
 On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z
 yang.z.zh...@intel.com
 wrote:
 @@ -8344,7 +8394,68 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  { -   return false; +   struct page *page; +
 unsigned long *msr_bitmap; + +   if
 (!nested_cpu_has_virt_x2apic_mode(vmcs12)) +   return
 false; + +   page = nested_get_page(vcpu, vmcs12-msr_bitmap);
 +
 if (!page) { +   WARN_ON(1); +
 return
 false; +   } +   msr_bitmap = (unsigned long *)kmap(page); +
 if (!msr_bitmap) { +
 nested_release_page_clean(page); +   WARN_ON(1); +
   return false; +   } + +
 memset(vmx_msr_bitmap_nested,
 0xff, PAGE_SIZE); + +   if
 (nested_cpu_has_virt_x2apic_mode(vmcs12)) +   /* TPR is
 allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, +
   vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI
 4), + MSR_TYPE_R | MSR_TYPE_W);
 
 I didn't understand what this function does? Per my understanding,
 you only
 need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap |
 vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit
 vmcs12-in msr_bitmap is setting. Am I missing some patches?
 
 In the beginning, I want to do vmcs01-msr_bitmap |
 vmcs12-msr_bitmap, but I remember that there isn't a instruction
 vmcs12-to
 do a bit or operation in two pages effectively, so I do the bit or
 operation in nested_vmx_disable_intercept_for_msr. If the hardware
 do not support this, I think it is faster if we deal with the bits on 
 demand.
 nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps,
 any features can put their logic here.
 
 You construct the nested_msr_bitmap based on vmcs12-msr_bitmap, what
 happens if vmcs01-msr_bitmap want to trap this msr?
 
 
 If L0 wants to intercept a msr, we should set
 vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic),
 and that bitmaps should only be loaded in non-nested entry.
 Currently we only clear the corresponding bits if L1 enables
 virtualize x2apic mode, all the other bits are all set. On nested
 entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's 
 bitmap.
 In the nested entry, L0 only care about L2's msr access, not L1's. I
 think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap
 as your mentioned above.

Mmm... I think if the bit is setting in vmcs01-msr_bitmap means whenever the 
VCPU(no matter in nested guest mode or not) accesses this msr should cause 
vmexit, not only L1. That's why need to construct the vmcs02-msr_bitmap based 
on (vmcs01-msr_bitmap ORed vmcs12-msr_bitmap).

 
 Thanks,
 Wincy


Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH v4 0/6] KVM: nVMX: Enable nested apicv support

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
 v1 --- v2:
   Use spin lock to ensure vmcs12 is safe when doing nested
   posted interrupt delivery.
 v2 --- v3:
   1. Add a new field in nested_vmx to avoid the spin lock in v2.
   2. Drop send eoi to L1 when doing nested interrupt delivery.
   3. Use hardware MSR bitmap to enable nested virtualize x2apic
  mode.
 v3 --- v4:
   1. Optimize nested msr bitmap merging.
   2. Allocate nested msr bitmap only when nested == 1.
   3. Inline the nested vmx control checking functions.

This version looks good to me. Only minor comment: EXIT_REASON_APIC_WRITE 
vmexit is introduced by apic register virtualization not virtual interrupt 
delivery, so it's better add it in 4th patch not 5th patch.(If no other 
comments, I guess Paolo can help do it when applying it).

Reviewed-by: Yang Zhang yang.z.zh...@intel.com


 Wincy Van (6):
   KVM: nVMX: Use hardware MSR bitmap
   KVM: nVMX: Enable nested virtualize x2apic mode
   KVM: nVMX: Make nested control MSRs per-cpu
   KVM: nVMX: Enable nested apic register virtualization
   KVM: nVMX: Enable nested virtual interrupt delivery
   KVM: nVMX: Enable nested posted interrupt processing
  arch/x86/kvm/vmx.c |  580
  +++- 1 files changed,
  480 insertions(+), 100 deletions(-)


Best regards,
Yang

RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-28:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance, especially in apicv case.
 
 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
  arch/x86/kvm/vmx.c |  114
 +++-
  1 files changed, 112 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 787f886..9d11a93
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1108,6 +1108,11 @@ static inline bool nested_cpu_has_xsaves(struct
 vmcs12 *vmcs12)
 vmx_xsaves_supported();
  }
 
 +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
 +*vmcs12) {
 +   return nested_cpu_has2(vmcs12,
 +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 +}
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2395,6 +2400,7 @@ static __init void
 nested_vmx_setup_ctls_msrs(void)
 nested_vmx_secondary_ctls_low = 0;
 nested_vmx_secondary_ctls_high =
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 SECONDARY_EXEC_WBINVD_EXITING |
 SECONDARY_EXEC_XSAVES;
 
 @@ -4155,6 +4161,52 @@ static void
 __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
 }
  }
 
 +/*
 + * If a msr is allowed by L0, we should check whether it is allowed by L1.
 + * The corresponding bit will be cleared unless both of L0 and L1 allow it.
 + */
 +static void nested_vmx_disable_intercept_for_msr(unsigned long
 *msr_bitmap_l1,
 +  unsigned long
 *msr_bitmap_nested,
 +  u32 msr, int type) {
 +   int f = sizeof(unsigned long);
 +
 +   if (!cpu_has_vmx_msr_bitmap()) {
 +   WARN_ON(1);
 +   return;
 +   }
 +
 +   /*
 +* See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
 +* have the write-low and read-high bitmap offsets the wrong way
 round.
 +* We can control MSRs 0x-0x1fff and
 0xc000-0xc0001fff.
 +*/
 +   if (msr = 0x1fff) {
 +   if (type  MSR_TYPE_R 
 +  !test_bit(msr, msr_bitmap_l1 + 0x000 / f))
 +   /* read-low */
 +   __clear_bit(msr, msr_bitmap_nested + 0x000 / f);
 +
 +   if (type  MSR_TYPE_W 
 +  !test_bit(msr, msr_bitmap_l1 + 0x800 / f))
 +   /* write-low */
 +   __clear_bit(msr, msr_bitmap_nested + 0x800 / f);
 +
 +   } else if ((msr = 0xc000)  (msr = 0xc0001fff)) {
 +   msr = 0x1fff;
 +   if (type  MSR_TYPE_R 
 +  !test_bit(msr, msr_bitmap_l1 + 0x400 / f))
 +   /* read-high */
 +   __clear_bit(msr, msr_bitmap_nested + 0x400 / f);
 +
 +   if (type  MSR_TYPE_W 
 +  !test_bit(msr, msr_bitmap_l1 + 0xc00 / f))
 +   /* write-high */
 +   __clear_bit(msr, msr_bitmap_nested + 0xc00 / f);
 +
 +   }
 +}
 +
  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)  {
 if (!longmode_only)
 @@ -8350,7 +8402,59 @@ static int
 nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu,  static
 inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12
 *vmcs12)  {
 -   return false;
 +   struct page *page;
 +   unsigned long *msr_bitmap;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return false;
 +
 +   page = nested_get_page(vcpu, vmcs12-msr_bitmap);
 +   if (!page) {
 +   WARN_ON(1);
 +   return false;
 +   }
 +   msr_bitmap = (unsigned long *)kmap(page);
 +   if (!msr_bitmap) {
 +   nested_release_page_clean(page);
 +   WARN_ON(1);
 +   return false;
 +   }
 +
 +   if (nested_cpu_has_virt_x2apic_mode(vmcs12)) {
 +   /* TPR is allowed */
 +   nested_vmx_disable_intercept_for_msr(msr_bitmap,
 +   vmx_msr_bitmap_nested,
 +   APIC_BASE_MSR + (APIC_TASKPRI 
 4),
 +   MSR_TYPE_R | MSR_TYPE_W);
 +   } else
 +   __vmx_enable_intercept_for_msr(
 +   vmx_msr_bitmap_nested,
 +   APIC_BASE_MSR + (APIC_TASKPRI 
 4),
 +   MSR_TYPE_R | MSR_TYPE_W);
 +   kunmap(page);
 +   nested_release_page_clean(page);
 +
 +   return true;
 +}
 +
 +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu,
 +

RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode

2015-01-28 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-29:
 On Thu, Jan 29, 2015 at 10:54 AM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 -8646,7 +8750,8 @@ static void prepare_vmcs02(struct kvm_vcpu
 *vcpu, struct vmcs12 *vmcs12)
 else
 vmcs_write64(APIC_ACCESS_ADDR,
 
 page_to_phys(vmx-nested.apic_access_page));
 -   } else if
 (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
 +   } else if
 + (!(nested_cpu_has_virt_x2apic_mode(vmcs12))
 + 
 +
 + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) {
 exec_control |=
 
 You don't load L2's apic_page in your patch correctly when x2apic
 mode is
 used. Here is the right change for prepare_vmcs02()(maybe other place
 also need change too):
 
 @@ -8585,7 +8585,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
 
 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
 exec_control |=
 vmcs12-secondary_vm_exec_control;
 
 -   if (exec_control 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { +   if
 (exec_control  (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + +
 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
 /*
  * If translation failed, no matter: This
 feature
 asks
  * to exit when accessing the given address,
 and if it @@ -8594,7 +8595,8 @@ static void prepare_vmcs02(struct
 kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
  */
 if (!vmx-nested.apic_access_page)
 exec_control =
 - ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; +  
   ~ (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + +
 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 else
 vmcs_write64(APIC_ACCESS_ADDR,
 page_to_phys(vmx-nested.apic_access_page));
 
 
 I think we don't need to do this, if L1 enables x2apic mode, we have
 already checked that the vmcs12-SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES
 is 0. exec_control |= vmcs12-secondary_vm_exec_control; merged L1's
 settings, including x2apic mode. the special case is
 vm_need_virtualize_apic_accesses, if vm_need_virtualize_apic_accesses
 returns true, the nested_cpu_has_virt_x2apic_mode will prevent us to set
 the apic access bit.

I just realized that we are talking different thing. I am thinking about 
VIRTUAL_APIC_PAGE_ADDR(my mistake :)). And it has been handled correctly 
already. For APIC_ACCESS_ADDR, you are right. The current implementation can 
handle it well.

 
 
 Thanks,
 Wincy


Best regards,
Yang

RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

2015-01-21 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-21:
> On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z 
> wrote:
>>> +   if (vector == vmcs12->posted_intr_nv && +  
>>> nested_cpu_has_posted_intr(vmcs12)) { +   if (vcpu->mode
>>> == IN_GUEST_MODE) + apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), +
>>>   POSTED_INTR_VECTOR); +   else {
>>> +   r = -1; +   goto out; +   
>>>} + +   /* +* if posted intr is
>>> done by hardware, the +* corresponding eoi was sent to
>>> L0. Thus +* we should send eoi to L1 manually. +  
>>>  */ +   kvm_apic_set_eoi_accelerated(vcpu, +  
>>> vmcs12->posted_intr_nv);
>> 
>> Why this is necessary? As your comments mentioned, it is done by
>> hardware not L1, why L1 should aware of it?
>> 
> 
> According to SDM 29.6, if the processor recognizes a posted interrupt,
> it will send an EOI to LAPIC.
> If the posted intr is done by hardware, the processor will send eoi to
> hardware LAPIC, not L1's, just like the none-nested case(the physical
> interrupt is dismissed). So we should take care of the L1's LAPIC and send an 
> eoi to it.

No. You are not emulating the PI feature. You just reuse the hardware's 
capability. So you don't need to let L1 know it.

> 
> 
> Thanks,
> 
> Wincy


Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.

2015-01-21 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-16:
> When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
> performance.
> 
> This patch also introduces nested_vmx_check_apicv_controls for the nested
> apicv patches.
> 
> Signed-off-by: Wincy Van 

To enable x2apic, should you to consider the behavior changes to rdmsr and 
wrmsr. I didn't see your patch do it. Is it correct?
BTW, this patch has nothing to do with APICv, it's better to not use x2apic 
here and change to apicv in following patch.

> ---
>  arch/x86/kvm/vmx.c |   49
> -
>  1 files changed, 48 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee
> 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct
> vmcs12 *vmcs12)
> vmx_xsaves_supported();
>  }
> 
> +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
> +*vmcs12) {
> +   return nested_cpu_has2(vmcs12,
> +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
> +}
> +
>  static inline bool is_exception(u32 intr_info)  {
> return (intr_info & (INTR_INFO_INTR_TYPE_MASK |
> INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void
> nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
> vmx->nested.nested_vmx_secondary_ctls_low = 0;
> vmx->nested.nested_vmx_secondary_ctls_high &=
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
> +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
> SECONDARY_EXEC_WBINVD_EXITING |
> SECONDARY_EXEC_XSAVES;
> 
> @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct
> kvm_vcpu *vcpu)
> case EXIT_REASON_APIC_ACCESS:
> return nested_cpu_has2(vmcs12,
> 
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
> +   case EXIT_REASON_APIC_WRITE:
> +   /* apic_write should exit unconditionally. */
> +   return 1;

APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize 
x2apic. Move it to next patch.

> case EXIT_REASON_EPT_VIOLATION:
> /*
>  * L0 always deals with the EPT violation. If nested EPT is
> @@ -8356,6 +8365,38 @@ static void vmx_start_preemption_timer(struct
> kvm_vcpu *vcpu)
>   ns_to_ktime(preemption_timeout),
> HRTIMER_MODE_REL);  }
> 
> +static inline int nested_vmx_check_virt_x2apic(struct kvm_vcpu *vcpu,
> +  struct vmcs12
> *vmcs12) {
> +   if (nested_cpu_has2(vmcs12,
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
> +   return -EINVAL;
> +   return 0;
> +}
> +
> +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu,
> +  struct vmcs12 *vmcs12) {
> +   int r;
> +
> +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
> +   return 0;
> +
> +   r = nested_vmx_check_virt_x2apic(vcpu, vmcs12);
> +   if (r)
> +   goto fail;
> +
> +   /* tpr shadow is needed by all apicv features. */
> +   if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
> +   r = -EINVAL;
> +   goto fail;
> +   }
> +
> +   return 0;
> +
> +fail:
> +   return r;
> +}
> +
>  static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu,
>unsigned long count_field,
>unsigned long addr_field, @@
> -8649,7 +8690,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> struct vmcs12 *vmcs12)
> else
> vmcs_write64(APIC_ACCESS_ADDR,
> 
> page_to_phys(vmx->nested.apic_access_page));
> -   } else if
> (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) {
> +   } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) &&
> +
> + (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))) {
> exec_control |=
> 
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> kvm_vcpu_reload_apic_access_page(vcpu);
> @@ -8856,6 +8898,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu,
> bool launch)
> return 1;
> }
> 
> +   if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) {
> +   nested_vmx_failValid(vcpu,
> VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> +   return 1;
> +   }
> +
> if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) {
> nested_vmx_failValid(vcpu,
> VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> return 1;
> --
> 1.7.1

Best regards,
Yang

RE: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.

2015-01-21 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-16:
> To enable nested apicv support, we need per-cpu vmx control MSRs:
>   1. If in-kernel irqchip is enabled, we can enable nested
>  posted interrupt, we should set posted intr bit in the
>  nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled,
>  we can not enable nested posted interrupt, the posted intr bit in
>  the nested_vmx_pinbased_ctls_high will be cleared.
> Since there would be different settings about in-kernel irqchip
> between VMs, different nested control MSRs are needed.

I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr 
to per vcpu.

Best regards,
Yang

RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

2015-01-21 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-20:
> If vcpu has a interrupt in vmx non-root mode, we will kick that vcpu
> to inject interrupt timely. With posted interrupt processing, the kick
> intr is not needed, and interrupts are fully taken care of by hardware.
> 
> In nested vmx, this feature avoids much more vmexits than non-nested vmx.
> 
> This patch use L0's POSTED_INTR_NV to avoid unexpected interrupt if
> L1's vector is different with L0's. If vcpu is in hardware's non-root
> mode, we use a physical ipi to deliver posted interrupts, otherwise we
> will deliver that interrupt to L1 and kick that vcpu out of nested non-root 
> mode.
> 
> Signed-off-by: Wincy Van 
> ---
>  arch/x86/kvm/vmx.c |  136
>  ++-- 1 files changed,
>  132 insertions(+), 4 deletions(-)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> ea56e9f..cda9133 100644 --- a/arch/x86/kvm/vmx.c +++
> b/arch/x86/kvm/vmx.c @@ -215,6 +215,7 @@ struct __packed vmcs12 {
> u64 tsc_offset; u64 virtual_apic_page_addr; u64
> apic_access_addr; +   u64 posted_intr_desc_addr; u64
> ept_pointer; u64 eoi_exit_bitmap0; u64 eoi_exit_bitmap1; @@
> -334,6 +335,7 @@ struct __packed vmcs12 { u32
> vmx_preemption_timer_value; u32 padding32[7]; /* room for future
> expansion */ u16 virtual_processor_id; +   u16
> posted_intr_nv; u16 guest_es_selector; u16 guest_cs_selector;
> u16 guest_ss_selector; @@ -387,6 +389,7 @@ struct nested_vmx {
> /* The host-usable pointer to the above */ struct page
> *current_vmcs12_page; struct vmcs12 *current_vmcs12; +  
> spinlock_t vmcs12_lock; struct vmcs *current_shadow_vmcs; /*
>  * Indicates if the shadow vmcs must be updated with the @@
>  -406,6 +409,8 @@ struct nested_vmx { */
> struct page *apic_access_page;
> struct page *virtual_apic_page;
> +   struct page *pi_desc_page;
> +   struct pi_desc *pi_desc;
> u64 msr_ia32_feature_control;
> 
> struct hrtimer preemption_timer; @@ -621,6 +626,7 @@ static
> int max_shadow_read_write_fields =
> 
>  static const unsigned short vmcs_field_to_offset_table[] = {
> FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), +  
> FIELD(POSTED_INTR_NV, posted_intr_nv), FIELD(GUEST_ES_SELECTOR,
> guest_es_selector), FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
> FIELD(GUEST_SS_SELECTOR, guest_ss_selector), @@ -646,6 +652,7 @@
> static const unsigned short vmcs_field_to_offset_table[] = {
> FIELD64(TSC_OFFSET, tsc_offset), FIELD64(VIRTUAL_APIC_PAGE_ADDR,
> virtual_apic_page_addr), FIELD64(APIC_ACCESS_ADDR,
> apic_access_addr), +   FIELD64(POSTED_INTR_DESC_ADDR,
> posted_intr_desc_addr), FIELD64(EPT_POINTER, ept_pointer),
> FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
> FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), @@ -798,6 +805,7
> @@ static void kvm_cpu_vmxon(u64 addr);  static void
> kvm_cpu_vmxoff(void); static bool vmx_mpx_supported(void);  static
> bool vmx_xsaves_supported(void);
> +static int vmx_vm_has_apicv(struct kvm *kvm);
>  static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
> static void vmx_set_segment(struct kvm_vcpu *vcpu,
> struct kvm_segment *var, int seg); @@
> -1159,6 +1167,11 @@ static inline bool nested_cpu_has_vid(struct
> vmcs12 *vmcs12)
> return nested_cpu_has2(vmcs12,
> SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>  }
> +static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) {
> +   return vmcs12->pin_based_vm_exec_control &
> +PIN_BASED_POSTED_INTR; }
> +
>  static inline bool is_exception(u32 intr_info)  {
> return (intr_info & (INTR_INFO_INTR_TYPE_MASK |
> INTR_INFO_VALID_MASK)) @@ -2362,6 +2375,9 @@ static void
> nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
> vmx->nested.nested_vmx_pinbased_ctls_high |=
> PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
> PIN_BASED_VMX_PREEMPTION_TIMER;
> +   if (vmx_vm_has_apicv(vmx->vcpu.kvm))
> +   vmx->nested.nested_vmx_pinbased_ctls_high |=
> +   PIN_BASED_POSTED_INTR;
> 
> /* exit controls */ rdmsr(MSR_IA32_VMX_EXIT_CTLS, @@ -4267,6
> +4283,46 @@ static int vmx_vm_has_apicv(struct kvm *kvm) return
> enable_apicv && irqchip_in_kernel(kvm);  }
> +static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
> +   int vector) {
> +   int r = 0;
> +   struct vmcs12 *vmcs12;
> +
> +   /*
> +* Since posted intr delivery is async,
> +* we must aquire a spin-lock to avoid
> +* the race of vmcs12.
> +*/
> +   spin_lock(_vmx(vcpu)->nested.vmcs12_lock);
> +   vmcs12 = get_vmcs12(vcpu);
> +   if (!is_guest_mode(vcpu) || !vmcs12) {
> +   r = -1;
> +

RE: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.

2015-01-21 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-16:
 To enable nested apicv support, we need per-cpu vmx control MSRs:
   1. If in-kernel irqchip is enabled, we can enable nested
  posted interrupt, we should set posted intr bit in the
  nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled,
  we can not enable nested posted interrupt, the posted intr bit in
  the nested_vmx_pinbased_ctls_high will be cleared.
 Since there would be different settings about in-kernel irqchip
 between VMs, different nested control MSRs are needed.

I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr 
to per vcpu.

Best regards,
Yang

RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

2015-01-21 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-21:
 On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z yang.z.zh...@intel.com
 wrote:
 +   if (vector == vmcs12-posted_intr_nv  +  
 nested_cpu_has_posted_intr(vmcs12)) { +   if (vcpu-mode
 == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), +
   POSTED_INTR_VECTOR); +   else {
 +   r = -1; +   goto out; +   
} + +   /* +* if posted intr is
 done by hardware, the +* corresponding eoi was sent to
 L0. Thus +* we should send eoi to L1 manually. +  
  */ +   kvm_apic_set_eoi_accelerated(vcpu, +  
 vmcs12-posted_intr_nv);
 
 Why this is necessary? As your comments mentioned, it is done by
 hardware not L1, why L1 should aware of it?
 
 
 According to SDM 29.6, if the processor recognizes a posted interrupt,
 it will send an EOI to LAPIC.
 If the posted intr is done by hardware, the processor will send eoi to
 hardware LAPIC, not L1's, just like the none-nested case(the physical
 interrupt is dismissed). So we should take care of the L1's LAPIC and send an 
 eoi to it.

No. You are not emulating the PI feature. You just reuse the hardware's 
capability. So you don't need to let L1 know it.

 
 
 Thanks,
 
 Wincy


Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.

2015-01-21 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-20:
 If vcpu has a interrupt in vmx non-root mode, we will kick that vcpu
 to inject interrupt timely. With posted interrupt processing, the kick
 intr is not needed, and interrupts are fully taken care of by hardware.
 
 In nested vmx, this feature avoids much more vmexits than non-nested vmx.
 
 This patch use L0's POSTED_INTR_NV to avoid unexpected interrupt if
 L1's vector is different with L0's. If vcpu is in hardware's non-root
 mode, we use a physical ipi to deliver posted interrupts, otherwise we
 will deliver that interrupt to L1 and kick that vcpu out of nested non-root 
 mode.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com
 ---
  arch/x86/kvm/vmx.c |  136
  ++-- 1 files changed,
  132 insertions(+), 4 deletions(-)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 ea56e9f..cda9133 100644 --- a/arch/x86/kvm/vmx.c +++
 b/arch/x86/kvm/vmx.c @@ -215,6 +215,7 @@ struct __packed vmcs12 {
 u64 tsc_offset; u64 virtual_apic_page_addr; u64
 apic_access_addr; +   u64 posted_intr_desc_addr; u64
 ept_pointer; u64 eoi_exit_bitmap0; u64 eoi_exit_bitmap1; @@
 -334,6 +335,7 @@ struct __packed vmcs12 { u32
 vmx_preemption_timer_value; u32 padding32[7]; /* room for future
 expansion */ u16 virtual_processor_id; +   u16
 posted_intr_nv; u16 guest_es_selector; u16 guest_cs_selector;
 u16 guest_ss_selector; @@ -387,6 +389,7 @@ struct nested_vmx {
 /* The host-usable pointer to the above */ struct page
 *current_vmcs12_page; struct vmcs12 *current_vmcs12; +  
 spinlock_t vmcs12_lock; struct vmcs *current_shadow_vmcs; /*
  * Indicates if the shadow vmcs must be updated with the @@
  -406,6 +409,8 @@ struct nested_vmx { */
 struct page *apic_access_page;
 struct page *virtual_apic_page;
 +   struct page *pi_desc_page;
 +   struct pi_desc *pi_desc;
 u64 msr_ia32_feature_control;
 
 struct hrtimer preemption_timer; @@ -621,6 +626,7 @@ static
 int max_shadow_read_write_fields =
 
  static const unsigned short vmcs_field_to_offset_table[] = {
 FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), +  
 FIELD(POSTED_INTR_NV, posted_intr_nv), FIELD(GUEST_ES_SELECTOR,
 guest_es_selector), FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
 FIELD(GUEST_SS_SELECTOR, guest_ss_selector), @@ -646,6 +652,7 @@
 static const unsigned short vmcs_field_to_offset_table[] = {
 FIELD64(TSC_OFFSET, tsc_offset), FIELD64(VIRTUAL_APIC_PAGE_ADDR,
 virtual_apic_page_addr), FIELD64(APIC_ACCESS_ADDR,
 apic_access_addr), +   FIELD64(POSTED_INTR_DESC_ADDR,
 posted_intr_desc_addr), FIELD64(EPT_POINTER, ept_pointer),
 FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
 FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), @@ -798,6 +805,7
 @@ static void kvm_cpu_vmxon(u64 addr);  static void
 kvm_cpu_vmxoff(void); static bool vmx_mpx_supported(void);  static
 bool vmx_xsaves_supported(void);
 +static int vmx_vm_has_apicv(struct kvm *kvm);
  static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 struct kvm_segment *var, int seg); @@
 -1159,6 +1167,11 @@ static inline bool nested_cpu_has_vid(struct
 vmcs12 *vmcs12)
 return nested_cpu_has2(vmcs12,
 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
  }
 +static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) {
 +   return vmcs12-pin_based_vm_exec_control 
 +PIN_BASED_POSTED_INTR; }
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2362,6 +2375,9 @@ static void
 nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 vmx-nested.nested_vmx_pinbased_ctls_high |=
 PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
 PIN_BASED_VMX_PREEMPTION_TIMER;
 +   if (vmx_vm_has_apicv(vmx-vcpu.kvm))
 +   vmx-nested.nested_vmx_pinbased_ctls_high |=
 +   PIN_BASED_POSTED_INTR;
 
 /* exit controls */ rdmsr(MSR_IA32_VMX_EXIT_CTLS, @@ -4267,6
 +4283,46 @@ static int vmx_vm_has_apicv(struct kvm *kvm) return
 enable_apicv  irqchip_in_kernel(kvm);  }
 +static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 +   int vector) {
 +   int r = 0;
 +   struct vmcs12 *vmcs12;
 +
 +   /*
 +* Since posted intr delivery is async,
 +* we must aquire a spin-lock to avoid
 +* the race of vmcs12.
 +*/
 +   spin_lock(to_vmx(vcpu)-nested.vmcs12_lock);
 +   vmcs12 = get_vmcs12(vcpu);
 +   if (!is_guest_mode(vcpu) || !vmcs12) {
 +   r = -1;
 +   goto out;
 +   }
 +   if (vector == vmcs12-posted_intr_nv 
 +

RE: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.

2015-01-21 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-16:
 When L2 is using x2apic, we can use virtualize x2apic mode to gain higher
 performance.
 
 This patch also introduces nested_vmx_check_apicv_controls for the nested
 apicv patches.
 
 Signed-off-by: Wincy Van fanwenyi0...@gmail.com

To enable x2apic, should you to consider the behavior changes to rdmsr and 
wrmsr. I didn't see your patch do it. Is it correct?
BTW, this patch has nothing to do with APICv, it's better to not use x2apic 
here and change to apicv in following patch.

 ---
  arch/x86/kvm/vmx.c |   49
 -
  1 files changed, 48 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee
 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct
 vmcs12 *vmcs12)
 vmx_xsaves_supported();
  }
 
 +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12
 +*vmcs12) {
 +   return nested_cpu_has2(vmcs12,
 +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 +}
 +
  static inline bool is_exception(u32 intr_info)  {
 return (intr_info  (INTR_INFO_INTR_TYPE_MASK |
 INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void
 nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 vmx-nested.nested_vmx_secondary_ctls_low = 0;
 vmx-nested.nested_vmx_secondary_ctls_high =
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
 +   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 SECONDARY_EXEC_WBINVD_EXITING |
 SECONDARY_EXEC_XSAVES;
 
 @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct
 kvm_vcpu *vcpu)
 case EXIT_REASON_APIC_ACCESS:
 return nested_cpu_has2(vmcs12,
 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 +   case EXIT_REASON_APIC_WRITE:
 +   /* apic_write should exit unconditionally. */
 +   return 1;

APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize 
x2apic. Move it to next patch.

 case EXIT_REASON_EPT_VIOLATION:
 /*
  * L0 always deals with the EPT violation. If nested EPT is
 @@ -8356,6 +8365,38 @@ static void vmx_start_preemption_timer(struct
 kvm_vcpu *vcpu)
   ns_to_ktime(preemption_timeout),
 HRTIMER_MODE_REL);  }
 
 +static inline int nested_vmx_check_virt_x2apic(struct kvm_vcpu *vcpu,
 +  struct vmcs12
 *vmcs12) {
 +   if (nested_cpu_has2(vmcs12,
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 +   return -EINVAL;
 +   return 0;
 +}
 +
 +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu,
 +  struct vmcs12 *vmcs12) {
 +   int r;
 +
 +   if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
 +   return 0;
 +
 +   r = nested_vmx_check_virt_x2apic(vcpu, vmcs12);
 +   if (r)
 +   goto fail;
 +
 +   /* tpr shadow is needed by all apicv features. */
 +   if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
 +   r = -EINVAL;
 +   goto fail;
 +   }
 +
 +   return 0;
 +
 +fail:
 +   return r;
 +}
 +
  static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu,
unsigned long count_field,
unsigned long addr_field, @@
 -8649,7 +8690,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
 struct vmcs12 *vmcs12)
 else
 vmcs_write64(APIC_ACCESS_ADDR,
 
 page_to_phys(vmx-nested.apic_access_page));
 -   } else if
 (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) {
 +   } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) 
 +
 + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) {
 exec_control |=
 
 SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 kvm_vcpu_reload_apic_access_page(vcpu);
 @@ -8856,6 +8898,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu,
 bool launch)
 return 1;
 }
 
 +   if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) {
 +   nested_vmx_failValid(vcpu,
 VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 +   return 1;
 +   }
 +
 if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) {
 nested_vmx_failValid(vcpu,
 VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 return 1;
 --
 1.7.1

Best regards,
Yang

RE: [PATCH 0/5] KVM: nVMX: Enable nested apicv support.

2015-01-19 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-20:
> Hi, Yang,
> 
> Could you please have a look at this patch set?
> Your comment is very appreciated!

Sure. I will take a look. 

> 
> 
> Thanks,
> 
> Wincy


Best regards,
Yang

RE: [PATCH 0/5] KVM: nVMX: Enable nested apicv support.

2015-01-19 Thread Zhang, Yang Z

Wincy Van wrote on 2015-01-20:
 Hi, Yang,
 
 Could you please have a look at this patch set?
 Your comment is very appreciated!

Sure. I will take a look. 

 
 
 Thanks,
 
 Wincy


Best regards,
Yang

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-24:
> 
> 
> Zhang, Yang Z wrote on 2014-12-24:
>> Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
>> KVM list; Eric Auger
>> Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for
>> VT-d Posted-Interrupts
>> 
>> Jiang Liu wrote on 2014-12-24:
>>> On 2014/12/24 9:38, Zhang, Yang Z wrote:
>>>> Paolo Bonzini wrote on 2014-12-23:
>>>>> 
>>>>> 
>>>>> On 23/12/2014 10:07, Wu, Feng wrote:
>>>>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>>>>>>> I don't quite understand it. If user set an interrupt's affinity
>>>>>>>> to a CPU, but he still see the interrupt delivers to other CPUs
>>>>>>>> in host. Do you think it is a right behavior?
>>>>>>> 
>>>>>>> No, the interrupt is not delivered at all in the host. Normally
>>>>>>> you'd have:
>>>>>>> 
>>>>>>> - interrupt delivered to CPU from host affinity
>>>>>>> 
>>>>>>> - VFIO interrupt handler writes to irqfd
>>>>>>> 
>>>>>>> - interrupt delivered to vCPU from guest affinity
>>>>>>> 
>>>>>>> Here, you just skip the first two steps.  The interrupt is
>>>>>>> delivered to the thread that is running the vCPU directly, so
>>>>>>> the host affinity is bypassed entirely.
>>>>>>> 
>>>>>>> ... unless you are considering the case where the vCPU is
>>>>>>> blocked and the host is processing the posted interrupt wakeup vector.
>>>>>>> In that case yes, it would be better to set NDST to a CPU
>>>>>>> matching the host
>>> affinity.
>>>>>> 
>>>>>> In my understanding, wakeup vector should have no relationship
>>>>>> with the host affinity of the irq. Wakeup notification event
>>>>>> should be delivered to the pCPU which the vCPU was blocked on.
>>>>>> And in kernel's point of view, the irq is not associated with
>>>>>> the wakeup vector,
>> right?
>>>>> 
>>>>> That is correct indeed.  It is not associated to the wakeup
>>>>> vector, hence this patch is right, I think.
>>>>> 
>>>>> However, the wakeup vector has the same function as the VFIO
>>>>> interrupt handler, so you could argue that it is tied to the
>>>>> host affinity rather than the guest.  Let's wait for Yang to answer.
>>>> 
>>>> Actually, that's my original question too. I am wondering what
>>>> happens if the
>>> user changes the assigned device's affinity in host's /proc/irq/? If
>>> ignore it is acceptable, then this patch is ok. But it seems the
>>> discussion out of my scope, need some experts to tell us their idea
>>> since it will impact the user experience. Hi Yang,
>> 
>> Hi Jiang,
>> 
>>> Originally we have a proposal to return failure when user sets
>>> IRQ affinity through native OS interfaces if an IRQ is in PI mode.
>>> But that proposal will break CPU hot-removal because OS needs to
>>> migrate away all IRQs binding to the CPU to be offlined. Then we
>>> propose saving user IRQ affinity setting without changing hardware
>>> configuration (keeping PI configuration). Later when PI mode is
>>> disabled, the cached affinity setting will be used to setup IRQ
>>> destination for native OS. On the other hand, for IRQ in PI mode,
>>> it won't be delivered to native OS, so user may not sense that the
>>> IRQ is
>> delivered to CPUs other than those in the affinity set.
>> 
>> The IRQ is still there but will be delivered to host in the form of
>> PI event(if the VCPU is running in root-mode). I am not sure whether
>> those interrupts should be reflected in /proc/interrupts? If the
>> answer is yes, then which entries should be used, a new PI entry or
>> use the
> original IRQ entry?
> 
> Even though, setting the affinity of the IRQ in host should not affect
> the destination of the PI event (normal notification event of wakeup

This is your implementation. To me, disable PI if the VCPU is going to 
run in the CPU out of IRQ affinity bitmap also is acceptable. And it will 
keep the user interface looks the same as before. 

Hi Thomas, Ingo, Peter

Can you guys help to review this patch? Really appreciate if you can give
some feedbacks.

> notification event), because the destination of the PI event is
> determined in NDST field of Posted-interrupts descriptor and PI
> notification vector is global. Just had a discussion with Jiang
> offline, maybe we can add the statistics information for the notification 
> vector in /proc/interrupts just like any other global interrupts.
> 
> Thanks,
> Feng
> 
>> 
>>> In that aspect, I think it's acceptable:) Regards!
>> 
>> Yes, if all of you guys(especially the IRQ maintainer) are think it
>> is acceptable then we can follow current implementation and document it.
>> 
>>> Gerry
>>>> 
>>>>> 
>>>>> Paolo
>>>> 
>>>> 
>>>> Best regards,
>>>> Yang
>>>> 
>>>> 
>> 
>> 
>> Best regards,
>> Yang
>>


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z

Jiang Liu wrote on 2014-12-24:
> On 2014/12/24 9:38, Zhang, Yang Z wrote:
>> Paolo Bonzini wrote on 2014-12-23:
>>> 
>>> 
>>> On 23/12/2014 10:07, Wu, Feng wrote:
>>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>>>>> I don't quite understand it. If user set an interrupt's affinity
>>>>>> to a CPU, but he still see the interrupt delivers to other CPUs in host.
>>>>>> Do you think it is a right behavior?
>>>>> 
>>>>> No, the interrupt is not delivered at all in the host.  Normally you'd 
>>>>> have:
>>>>> 
>>>>> - interrupt delivered to CPU from host affinity
>>>>> 
>>>>> - VFIO interrupt handler writes to irqfd
>>>>> 
>>>>> - interrupt delivered to vCPU from guest affinity
>>>>> 
>>>>> Here, you just skip the first two steps.  The interrupt is
>>>>> delivered to the thread that is running the vCPU directly, so the
>>>>> host affinity is bypassed entirely.
>>>>> 
>>>>> ... unless you are considering the case where the vCPU is blocked
>>>>> and the host is processing the posted interrupt wakeup vector.
>>>>> In that case yes, it would be better to set NDST to a CPU
>>>>> matching the host
> affinity.
>>>> 
>>>> In my understanding, wakeup vector should have no relationship
>>>> with the host affinity of the irq. Wakeup notification event
>>>> should be delivered to the pCPU which the vCPU was blocked on. And
>>>> in kernel's point of view, the irq is not associated with the wakeup 
>>>> vector, right?
>>> 
>>> That is correct indeed.  It is not associated to the wakeup vector,
>>> hence this patch is right, I think.
>>> 
>>> However, the wakeup vector has the same function as the VFIO
>>> interrupt handler, so you could argue that it is tied to the host
>>> affinity rather than the guest.  Let's wait for Yang to answer.
>> 
>> Actually, that's my original question too. I am wondering what
>> happens if the
> user changes the assigned device's affinity in host's /proc/irq/? If
> ignore it is acceptable, then this patch is ok. But it seems the
> discussion out of my scope, need some experts to tell us their idea since it 
> will impact the user experience.
> Hi Yang,

Hi Jiang,

>   Originally we have a proposal to return failure when user sets IRQ
> affinity through native OS interfaces if an IRQ is in PI mode. But
> that proposal will break CPU hot-removal because OS needs to migrate
> away all IRQs binding to the CPU to be offlined. Then we propose
> saving user IRQ affinity setting without changing hardware
> configuration (keeping PI configuration). Later when PI mode is
> disabled, the cached affinity setting will be used to setup IRQ
> destination for native OS. On the other hand, for IRQ in PI mode, it
> won't be delivered to native OS, so user may not sense that the IRQ is 
> delivered to CPUs other than those in the affinity set.

The IRQ is still there but will be delivered to host in the form of PI event(if 
the VCPU is running in root-mode). I am not sure whether those interrupts 
should be reflected in /proc/interrupts? If the answer is yes, then which 
entries should be used, a new PI entry or use the original IRQ entry?

> In that aspect, I think it's acceptable:) Regards!

Yes, if all of you guys(especially the IRQ maintainer) are think it is 
acceptable then we can follow current implementation and document it.

> Gerry
>> 
>>> 
>>> Paolo
>> 
>> 
>> Best regards,
>> Yang
>> 
>>


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-23:
> 
> 
> On 23/12/2014 10:07, Wu, Feng wrote:
>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>>> I don't quite understand it. If user set an interrupt's affinity
>>>> to a CPU, but he still see the interrupt delivers to other CPUs in host.
>>>> Do you think it is a right behavior?
>>> 
>>> No, the interrupt is not delivered at all in the host.  Normally you'd have:
>>> 
>>> - interrupt delivered to CPU from host affinity
>>> 
>>> - VFIO interrupt handler writes to irqfd
>>> 
>>> - interrupt delivered to vCPU from guest affinity
>>> 
>>> Here, you just skip the first two steps.  The interrupt is
>>> delivered to the thread that is running the vCPU directly, so the
>>> host affinity is bypassed entirely.
>>> 
>>> ... unless you are considering the case where the vCPU is blocked
>>> and the host is processing the posted interrupt wakeup vector.  In
>>> that case yes, it would be better to set NDST to a CPU matching the host 
>>> affinity.
>> 
>> In my understanding, wakeup vector should have no relationship with
>> the host affinity of the irq. Wakeup notification event should be
>> delivered to the pCPU which the vCPU was blocked on. And in kernel's
>> point of view, the irq is not associated with the wakeup vector, right?
> 
> That is correct indeed.  It is not associated to the wakeup vector,
> hence this patch is right, I think.
> 
> However, the wakeup vector has the same function as the VFIO interrupt
> handler, so you could argue that it is tied to the host affinity
> rather than the guest.  Let's wait for Yang to answer.

Actually, that's my original question too. I am wondering what happens if the 
user changes the assigned device's affinity in host's /proc/irq/? If ignore it 
is acceptable, then this patch is ok. But it seems the discussion out of my 
scope, need some experts to tell us their idea since it will impact the user 
experience. 

> 
> Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-23:
 
 
 On 23/12/2014 10:07, Wu, Feng wrote:
 On 23/12/2014 01:37, Zhang, Yang Z wrote:
 I don't quite understand it. If user set an interrupt's affinity
 to a CPU, but he still see the interrupt delivers to other CPUs in host.
 Do you think it is a right behavior?
 
 No, the interrupt is not delivered at all in the host.  Normally you'd have:
 
 - interrupt delivered to CPU from host affinity
 
 - VFIO interrupt handler writes to irqfd
 
 - interrupt delivered to vCPU from guest affinity
 
 Here, you just skip the first two steps.  The interrupt is
 delivered to the thread that is running the vCPU directly, so the
 host affinity is bypassed entirely.
 
 ... unless you are considering the case where the vCPU is blocked
 and the host is processing the posted interrupt wakeup vector.  In
 that case yes, it would be better to set NDST to a CPU matching the host 
 affinity.
 
 In my understanding, wakeup vector should have no relationship with
 the host affinity of the irq. Wakeup notification event should be
 delivered to the pCPU which the vCPU was blocked on. And in kernel's
 point of view, the irq is not associated with the wakeup vector, right?
 
 That is correct indeed.  It is not associated to the wakeup vector,
 hence this patch is right, I think.
 
 However, the wakeup vector has the same function as the VFIO interrupt
 handler, so you could argue that it is tied to the host affinity
 rather than the guest.  Let's wait for Yang to answer.

Actually, that's my original question too. I am wondering what happens if the 
user changes the assigned device's affinity in host's /proc/irq/? If ignore it 
is acceptable, then this patch is ok. But it seems the discussion out of my 
scope, need some experts to tell us their idea since it will impact the user 
experience. 

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z

Jiang Liu wrote on 2014-12-24:
 On 2014/12/24 9:38, Zhang, Yang Z wrote:
 Paolo Bonzini wrote on 2014-12-23:
 
 
 On 23/12/2014 10:07, Wu, Feng wrote:
 On 23/12/2014 01:37, Zhang, Yang Z wrote:
 I don't quite understand it. If user set an interrupt's affinity
 to a CPU, but he still see the interrupt delivers to other CPUs in host.
 Do you think it is a right behavior?
 
 No, the interrupt is not delivered at all in the host.  Normally you'd 
 have:
 
 - interrupt delivered to CPU from host affinity
 
 - VFIO interrupt handler writes to irqfd
 
 - interrupt delivered to vCPU from guest affinity
 
 Here, you just skip the first two steps.  The interrupt is
 delivered to the thread that is running the vCPU directly, so the
 host affinity is bypassed entirely.
 
 ... unless you are considering the case where the vCPU is blocked
 and the host is processing the posted interrupt wakeup vector.
 In that case yes, it would be better to set NDST to a CPU
 matching the host
 affinity.
 
 In my understanding, wakeup vector should have no relationship
 with the host affinity of the irq. Wakeup notification event
 should be delivered to the pCPU which the vCPU was blocked on. And
 in kernel's point of view, the irq is not associated with the wakeup 
 vector, right?
 
 That is correct indeed.  It is not associated to the wakeup vector,
 hence this patch is right, I think.
 
 However, the wakeup vector has the same function as the VFIO
 interrupt handler, so you could argue that it is tied to the host
 affinity rather than the guest.  Let's wait for Yang to answer.
 
 Actually, that's my original question too. I am wondering what
 happens if the
 user changes the assigned device's affinity in host's /proc/irq/? If
 ignore it is acceptable, then this patch is ok. But it seems the
 discussion out of my scope, need some experts to tell us their idea since it 
 will impact the user experience.
 Hi Yang,

Hi Jiang,

   Originally we have a proposal to return failure when user sets IRQ
 affinity through native OS interfaces if an IRQ is in PI mode. But
 that proposal will break CPU hot-removal because OS needs to migrate
 away all IRQs binding to the CPU to be offlined. Then we propose
 saving user IRQ affinity setting without changing hardware
 configuration (keeping PI configuration). Later when PI mode is
 disabled, the cached affinity setting will be used to setup IRQ
 destination for native OS. On the other hand, for IRQ in PI mode, it
 won't be delivered to native OS, so user may not sense that the IRQ is 
 delivered to CPUs other than those in the affinity set.

The IRQ is still there but will be delivered to host in the form of PI event(if 
the VCPU is running in root-mode). I am not sure whether those interrupts 
should be reflected in /proc/interrupts? If the answer is yes, then which 
entries should be used, a new PI entry or use the original IRQ entry?

 In that aspect, I think it's acceptable:) Regards!

Yes, if all of you guys(especially the IRQ maintainer) are think it is 
acceptable then we can follow current implementation and document it.

 Gerry
 
 
 Paolo
 
 
 Best regards,
 Yang
 



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-23 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-24:

 Zhang, Yang Z wrote on 2014-12-24:
 Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
 KVM list; Eric Auger
 Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for
 VT-d Posted-Interrupts

 Jiang Liu wrote on 2014-12-24:
 On 2014/12/24 9:38, Zhang, Yang Z wrote:
 Paolo Bonzini wrote on 2014-12-23:

 On 23/12/2014 10:07, Wu, Feng wrote:
 On 23/12/2014 01:37, Zhang, Yang Z wrote:
 I don't quite understand it. If user set an interrupt's affinity
 to a CPU, but he still see the interrupt delivers to other CPUs
 in host. Do you think it is a right behavior?

 No, the interrupt is not delivered at all in the host. Normally
 you'd have:

 - interrupt delivered to CPU from host affinity

 - VFIO interrupt handler writes to irqfd

 - interrupt delivered to vCPU from guest affinity

 Here, you just skip the first two steps.  The interrupt is
 delivered to the thread that is running the vCPU directly, so
 the host affinity is bypassed entirely.

 ... unless you are considering the case where the vCPU is
 blocked and the host is processing the posted interrupt wakeup vector.
 In that case yes, it would be better to set NDST to a CPU
 matching the host
 affinity.

 In my understanding, wakeup vector should have no relationship
 with the host affinity of the irq. Wakeup notification event
 should be delivered to the pCPU which the vCPU was blocked on.
 And in kernel's point of view, the irq is not associated with
 the wakeup vector,
 right?

 That is correct indeed.  It is not associated to the wakeup
 vector, hence this patch is right, I think.

 However, the wakeup vector has the same function as the VFIO
 interrupt handler, so you could argue that it is tied to the
 host affinity rather than the guest.  Let's wait for Yang to answer.

 Actually, that's my original question too. I am wondering what
 happens if the
 user changes the assigned device's affinity in host's /proc/irq/? If
 ignore it is acceptable, then this patch is ok. But it seems the
 discussion out of my scope, need some experts to tell us their idea
 since it will impact the user experience. Hi Yang,

 Hi Jiang,

 Originally we have a proposal to return failure when user sets
 IRQ affinity through native OS interfaces if an IRQ is in PI mode.
 But that proposal will break CPU hot-removal because OS needs to
 migrate away all IRQs binding to the CPU to be offlined. Then we
 propose saving user IRQ affinity setting without changing hardware
 configuration (keeping PI configuration). Later when PI mode is
 disabled, the cached affinity setting will be used to setup IRQ
 destination for native OS. On the other hand, for IRQ in PI mode,
 it won't be delivered to native OS, so user may not sense that the
 IRQ is
 delivered to CPUs other than those in the affinity set.

 The IRQ is still there but will be delivered to host in the form of
 PI event(if the VCPU is running in root-mode). I am not sure whether
 those interrupts should be reflected in /proc/interrupts? If the
 answer is yes, then which entries should be used, a new PI entry or
 use the
 original IRQ entry?

 Even though, setting the affinity of the IRQ in host should not affect
 the destination of the PI event (normal notification event of wakeup

This is your implementation. To me, disable PI if the VCPU is going to 
run in the CPU out of IRQ affinity bitmap also is acceptable. And it will 
keep the user interface looks the same as before. 

Hi Thomas, Ingo, Peter

Can you guys help to review this patch? Really appreciate if you can give
some feedbacks.

 notification event), because the destination of the PI event is
 determined in NDST field of Posted-interrupts descriptor and PI
 notification vector is global. Just had a discussion with Jiang
 offline, maybe we can add the statistics information for the notification 
 vector in /proc/interrupts just like any other global interrupts.

 Thanks,
 Feng

 In that aspect, I think it's acceptable:) Regards!

 Yes, if all of you guys(especially the IRQ maintainer) are think it
 is acceptable then we can follow current implementation and document it.

 Gerry

 Paolo

 Best regards,
 Yang

 Best regards,
 Yang

Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-22 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-23:
>> The problem is we still need to support PI with lowest priority
>> delivery mode
> even if guest does not configure irq affinity via /proc/irq/. Don't we?
> 
> Yes, but we can get the basic support working first.
> 
> I and Feng talked on irc and agreed to start with a simple
> implementation. Then he can add support for vector hashing for all
> interrupts, both vt-d pi and normal LAPIC interrupts.

Agree. Obviously, current PI has some limitations to achieve highest 
performance. We can start with a simple implementation but must ensure we don't 
change hardware's behavior(From guest's point). Feng's current implementation 
or use the way I suggested are both ok since both of them cannot solve the 
problem well.

> 
> Paolo
>


Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-22 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-19:
> 
> 
> On 19/12/2014 02:46, Zhang, Yang Z wrote:
>>> If the IRQ is posted, its affinity is controlled by guest (irq
>>> <---> vCPU <> pCPU), it has no effect when host changes its affinity.
>> 
>> That's the problem: User is able to changes it in host but it never
>> takes effect since it is actually controlled by guest. I guess it
>> will break the IRQ balance too.
> 
> I don't think that's a problem.
> 
> Controlling the affinity in the host affects which CPU in the host
> takes care of signaling the guest.
> 
> If this signaling is done directly by the chipset, there is no need to
> do anything in the host and thus the host affinity can be bypassed.

I don't quite understand it. If user set an interrupt's affinity to a CPU, but 
he still see the interrupt delivers to other CPUs in host. Do you think it is a 
right behavior?

> 
> Paolo


Best regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-22 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-19:
 
 
 On 19/12/2014 02:46, Zhang, Yang Z wrote:
 If the IRQ is posted, its affinity is controlled by guest (irq
 --- vCPU  pCPU), it has no effect when host changes its affinity.
 
 That's the problem: User is able to changes it in host but it never
 takes effect since it is actually controlled by guest. I guess it
 will break the IRQ balance too.
 
 I don't think that's a problem.
 
 Controlling the affinity in the host affects which CPU in the host
 takes care of signaling the guest.
 
 If this signaling is done directly by the chipset, there is no need to
 do anything in the host and thus the host affinity can be bypassed.

I don't quite understand it. If user set an interrupt's affinity to a CPU, but 
he still see the interrupt delivers to other CPUs in host. Do you think it is a 
right behavior?

 
 Paolo


Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-22 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-23:
 The problem is we still need to support PI with lowest priority
 delivery mode
 even if guest does not configure irq affinity via /proc/irq/. Don't we?
 
 Yes, but we can get the basic support working first.
 
 I and Feng talked on irc and agreed to start with a simple
 implementation. Then he can add support for vector hashing for all
 interrupts, both vt-d pi and normal LAPIC interrupts.

Agree. Obviously, current PI has some limitations to achieve highest 
performance. We can start with a simple implementation but must ensure we don't 
change hardware's behavior(From guest's point). Feng's current implementation 
or use the way I suggested are both ok since both of them cannot solve the 
problem well.

 
 Paolo



Best regards,
Yang


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:
> 
> 
> Zhang, Yang Z wrote on 2014-12-19:
>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>> set
>> 
>> Wu, Feng wrote on 2014-12-19:
>>> 
>>> 
>>> Zhang, Yang Z wrote on 2014-12-19:
>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>> is set
>>>> 
>>>> Wu, Feng wrote on 2014-12-19:
>>>>> 
>>>>> 
>>>>> Zhang, Yang Z wrote on 2014-12-19:
>>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>>>> is set
>>>>>> 
>>>>>> Wu, Feng wrote on 2014-12-19:
>>>>>>> 
>>>>>>> 
>>>>>>> iommu-boun...@lists.linux-foundation.org wrote on
>>>>>> mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
>>>>>>>> Cc: io...@lists.linux-foundation.org;
>>>>>>>> linux-kernel@vger.kernel.org; k...@vger.kernel.org
>>>>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>>>>>> is set
>>>>>>>> 
>>>>>>>> Paolo Bonzini wrote on 2014-12-18:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 18/12/2014 04:14, Wu, Feng wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> linux-kernel-ow...@vger.kernel.org wrote on
>>>>>>>> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
>>>>>>>>>>> x...@kernel.org; Gleb Natapov; Paolo Bonzini;
>>>>>>>>>>> dw...@infradead.org;
>>>>>>>>>>> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
>>>>>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc:
>>>>>>>>>>> io...@lists.linux-foundation.org;
>>>>>>>>>>> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org;
> KVM
>> list;
>>>>>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress
>>>>>>>>>>> posted-interrupt when 'SN' is set
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 12/12/2014 16:14, Feng Wu wrote:
>>>>>>>>>>>> Currently, we don't support urgent interrupt, all
>>>>>>>>>>>> interrupts are recognized as non-urgent interrupt, so we
>>>>>>>>>>>> cannot send posted-interrupt when 'SN' is set.
>>>>>>>>>>> 
>>>>>>>>>>> Can this happen?  If the vcpu is in guest mode, it cannot
>>>>>>>>>>> have been scheduled out, and that's the only case when SN is set.
>>>>>>>>>>> 
>>>>>>>>>>> Paolo
>>>>>>>>>> 
>>>>>>>>>> Currently, the only place where SN is set is vCPU is
>>>>>>>>>> preempted and
>>>>>>>> 
>>>>>>>> If the vCPU is preempted, shouldn't the subsequent be ignored?
>>>>>>>> What happens if a PI is occurs when vCPU is preempted?
>>>>>>> 
>>>>>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent
>>>>>>> interrupts are suppressed for posting.
>>>>>> 
>>>>>> I mean what happens if we don't set SN bit. From my point, if
>>>>>> preempter already disabled the interrupt, it is ok to leave SN
>>>>>> bit as zero. But if preempter enabled the interrupt, doesn't
>>>>>> this mean he allow interrupt to happen? BTW, since there
>>>>>> already has ON bit, so this means there only have one interrupt
>>>>>> arrived at most and it doesn't hurt performance. Do we really need to 
>>>>>> set SN bit?
>>>>> 
>>>>> 
>>>>> See this scenario:
>>>>> vCPU0 is running on pCPU0
>>>>> --> vCPU0 is preempted by vCPU1
>>>>> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for
>>>>> --> schedule in runqueue
>>>>> 
>>>>> If the we don't set SN for vCPU0,

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:
> 
> 
> Zhang, Yang Z wrote on 2014-12-19:
>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>> set
>> 
>> Wu, Feng wrote on 2014-12-19:
>>> 
>>> 
>>> Zhang, Yang Z wrote on 2014-12-19:
>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>> is set
>>>> 
>>>> Wu, Feng wrote on 2014-12-19:
>>>>> 
>>>>> 
>>>>> iommu-boun...@lists.linux-foundation.org wrote on
>>>> mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
>>>>>> Cc: io...@lists.linux-foundation.org;
>>>>>> linux-kernel@vger.kernel.org; k...@vger.kernel.org
>>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>>>> is set
>>>>>> 
>>>>>> Paolo Bonzini wrote on 2014-12-18:
>>>>>>> 
>>>>>>> 
>>>>>>> On 18/12/2014 04:14, Wu, Feng wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> linux-kernel-ow...@vger.kernel.org wrote on
>>>>>> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
>>>>>>>>> x...@kernel.org; Gleb Natapov; Paolo Bonzini;
>>>>>>>>> dw...@infradead.org;
>>>>>>>>> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
>>>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc:
>>>>>>>>> io...@lists.linux-foundation.org;
>>>>>>>>> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
>>>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress
>>>>>>>>> posted-interrupt when 'SN' is set
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 12/12/2014 16:14, Feng Wu wrote:
>>>>>>>>>> Currently, we don't support urgent interrupt, all
>>>>>>>>>> interrupts are recognized as non-urgent interrupt, so we
>>>>>>>>>> cannot send posted-interrupt when 'SN' is set.
>>>>>>>>> 
>>>>>>>>> Can this happen?  If the vcpu is in guest mode, it cannot
>>>>>>>>> have been scheduled out, and that's the only case when SN is set.
>>>>>>>>> 
>>>>>>>>> Paolo
>>>>>>>> 
>>>>>>>> Currently, the only place where SN is set is vCPU is
>>>>>>>> preempted and
>>>>>> 
>>>>>> If the vCPU is preempted, shouldn't the subsequent be ignored?
>>>>>> What happens if a PI is occurs when vCPU is preempted?
>>>>> 
>>>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent
>>>>> interrupts are suppressed for posting.
>>>> 
>>>> I mean what happens if we don't set SN bit. From my point, if
>>>> preempter already disabled the interrupt, it is ok to leave SN
>>>> bit as zero. But if preempter enabled the interrupt, doesn't this
>>>> mean he allow interrupt to happen? BTW, since there already has
>>>> ON bit, so this means there only have one interrupt arrived at
>>>> most and it doesn't hurt performance. Do we really need to set SN bit?
>>> 
>>> 
>>> See this scenario:
>>> vCPU0 is running on pCPU0
>>> --> vCPU0 is preempted by vCPU1
>>> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for
>>> --> schedule in runqueue
>>> 
>>> If the we don't set SN for vCPU0, then all subsequent interrupts
>>> for
>>> vCPU0 is posted to vCPU1, this will consume hardware and software
>> 
>> The PI vector for vCPU1 is notification vector, but the PI vector
>> for
>> vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event?
> 
> Wakeup vector is only used for blocking case, when vCPU is preempted
> and waiting in the runqueue, the NV is the notification vector.

I see your point. But from performance point, if we can schedule the vCPU to 
another PCPU to handle the interrupt, it would helpful. But I remember current 
KVM will not schedule the vCPU in run queue (even though it got preempted) to 
another pCPU to run(Am I right?). So it may hard to do it.

> 
> Thanks,
> Feng
> 
>> 
>>> efforts and in fact it is not needed at all. If SN is set for
>>> vCPU0, VT-d hardware will not issue Notification Event for vCPU0
>>> when an interrupt is for it, but just setting the related PIR bit.
>>> 
>>> Thanks,
>>> Feng
>>> 
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Feng
>>>>> 
>>>>>> 
>>>>>>>> waiting for the next scheduling in the runqueue. But I am not
>>>>>>>> sure whether we need to set SN for other purpose in future.
>>>>>>>> Adding SN checking here is just to follow the Spec.
>>>>>>>> non-urgent interrupts are suppressed
>>>>>>> when SN is set.
>>>>>>> 
>>>>>>> I would change that to a WARN_ON_ONCE then.
>>>>>> 
>>>>>> 
>>>>>> Best regards,
>>>>>> Yang
>>>>>> 
>>>>>> 
>>>>>> ___
>>>>>> iommu mailing list
>>>>>> io...@lists.linux-foundation.org
>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>> 
>>>> 
>>>> Best regards,
>>>> Yang
>>>> 
>> 
>> 
>> Best regards,
>> Yang
>>


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:
> 
> 
> Zhang, Yang Z wrote on 2014-12-19:
>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>> set
>> 
>> Wu, Feng wrote on 2014-12-19:
>>> 
>>> 
>>> iommu-boun...@lists.linux-foundation.org wrote on
>> mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
>>>> Cc: io...@lists.linux-foundation.org;
>>>> linux-kernel@vger.kernel.org; k...@vger.kernel.org
>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>> is set
>>>> 
>>>> Paolo Bonzini wrote on 2014-12-18:
>>>>> 
>>>>> 
>>>>> On 18/12/2014 04:14, Wu, Feng wrote:
>>>>>> 
>>>>>> 
>>>>>> linux-kernel-ow...@vger.kernel.org wrote on
>>>> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
>>>>>>> x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org;
>>>>>>> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc:
>>>>>>> io...@lists.linux-foundation.org;
>>>>>>> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt
>>>>>>> when 'SN' is set
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 12/12/2014 16:14, Feng Wu wrote:
>>>>>>>> Currently, we don't support urgent interrupt, all interrupts
>>>>>>>> are recognized as non-urgent interrupt, so we cannot send
>>>>>>>> posted-interrupt when 'SN' is set.
>>>>>>> 
>>>>>>> Can this happen?  If the vcpu is in guest mode, it cannot have
>>>>>>> been scheduled out, and that's the only case when SN is set.
>>>>>>> 
>>>>>>> Paolo
>>>>>> 
>>>>>> Currently, the only place where SN is set is vCPU is preempted
>>>>>> and
>>>> 
>>>> If the vCPU is preempted, shouldn't the subsequent be ignored?
>>>> What happens if a PI is occurs when vCPU is preempted?
>>> 
>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent
>>> interrupts are suppressed for posting.
>> 
>> I mean what happens if we don't set SN bit. From my point, if
>> preempter already disabled the interrupt, it is ok to leave SN bit
>> as zero. But if preempter enabled the interrupt, doesn't this mean
>> he allow interrupt to happen? BTW, since there already has ON bit,
>> so this means there only have one interrupt arrived at most and it
>> doesn't hurt performance. Do we really need to set SN bit?
> 
> 
> See this scenario:
> vCPU0 is running on pCPU0
> --> vCPU0 is preempted by vCPU1
> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule
> --> in runqueue
> 
> If the we don't set SN for vCPU0, then all subsequent interrupts for
> vCPU0 is posted to vCPU1, this will consume hardware and software

The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 
should be wakeup vector. Why vCPU1 will consume this PI event?

> efforts and in fact it is not needed at all. If SN is set for vCPU0,
> VT-d hardware will not issue Notification Event for vCPU0 when an
> interrupt is for it, but just setting the related PIR bit.
> 
> Thanks,
> Feng
> 
>> 
>>> 
>>> Thanks,
>>> Feng
>>> 
>>>> 
>>>>>> waiting for the next scheduling in the runqueue. But I am not
>>>>>> sure whether we need to set SN for other purpose in future.
>>>>>> Adding SN checking here is just to follow the Spec. non-urgent
>>>>>> interrupts are suppressed
>>>>> when SN is set.
>>>>> 
>>>>> I would change that to a WARN_ON_ONCE then.
>>>> 
>>>> 
>>>> Best regards,
>>>> Yang
>>>> 
>>>> 
>>>> ___
>>>> iommu mailing list
>>>> io...@lists.linux-foundation.org
>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>> 
>> 
>> Best regards,
>> Yang
>>


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:
> 
> 
> iommu-boun...@lists.linux-foundation.org wrote on 
> mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
>> Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
>> k...@vger.kernel.org
>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>> set
>> 
>> Paolo Bonzini wrote on 2014-12-18:
>>> 
>>> 
>>> On 18/12/2014 04:14, Wu, Feng wrote:
 
 
 linux-kernel-ow...@vger.kernel.org wrote on
>> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
> x...@kernel.org; Gleb Natapov; Paolo Bonzini;
> dw...@infradead.org;
> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
> joro-zLv9SwRftAIdnm+Jiang
> Liu
> Cc: io...@lists.linux-foundation.org;
> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
> Eric Auger
> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
> is set
> 
> 
> 
> On 12/12/2014 16:14, Feng Wu wrote:
>> Currently, we don't support urgent interrupt, all interrupts
>> are recognized as non-urgent interrupt, so we cannot send
>> posted-interrupt when 'SN' is set.
> 
> Can this happen?  If the vcpu is in guest mode, it cannot have
> been scheduled out, and that's the only case when SN is set.
> 
> Paolo
 
 Currently, the only place where SN is set is vCPU is preempted
 and
>> 
>> If the vCPU is preempted, shouldn't the subsequent be ignored? What
>> happens if a PI is occurs when vCPU is preempted?
> 
> If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts
> are suppressed for posting.

I mean what happens if we don't set SN bit. From my point, if preempter already 
disabled the interrupt, it is ok to leave SN bit as zero. But if preempter 
enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, 
since there already has ON bit, so this means there only have one interrupt 
arrived at most and it doesn't hurt performance. Do we really need to set SN 
bit?

> 
> Thanks,
> Feng
> 
>> 
 waiting for the next scheduling in the runqueue. But I am not
 sure whether we need to set SN for other purpose in future.
 Adding SN checking here is just to follow the Spec. non-urgent
 interrupts are suppressed
>>> when SN is set.
>>> 
>>> I would change that to a WARN_ON_ONCE then.
>> 
>> 
>> Best regards,
>> Yang
>> 
>> 
>> ___
>> iommu mailing list
>> io...@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:
> 
> 
> Paolo Bonzini wrote on 2014-12-19:
>> jiang@linux.intel.com
>> Cc: eric.au...@linaro.org; linux-kernel@vger.kernel.org;
>> io...@lists.linux-foundation.org; k...@vger.kernel.org
>> Subject: Re: [v3 13/26] KVM: Define a new interface
>> kvm_find_dest_vcpu() for VT-d PI
>> 
>> 
>> 
>> On 18/12/2014 15:49, Zhang, Yang Z wrote:
>>>>> Here, we introduce a similar way with 'apic_arb_prio' to handle
>>>>> guest lowest priority interrtups when VT-d PI is used. Here is
>>>>> the
>>>>> ideas: - Each vCPU has a counter 'round_robin_counter'. - When
>>>>> guests sets an interrupts to lowest priority, we choose the vCPU
>>>>> with smallest 'round_robin_counter' as the destination, then
>>>>> increase it.
>>> 
>>> How this can work well? All subsequent interrupts are delivered to
>>> one vCPU? It shouldn't be the best solution, need more consideration.
>> 
>> Well, it's a hardware limitation.  The alternative (which is easy to
>> implement) is to only do PI for single-CPU interrupts.  This should
>> work well for multiqueue NICs (and of course for UP guests :)), so
>> perhaps it's a good idea to only support that as a first attempt.
>> 
>> Paolo
> 
> Paolo, what do you mean by "single-CPU interrupts"? Do you mean we

It should be same idea as I mentioned on another thread: deliver the interrupt 
to a single CPU(maybe the first matched VCPU?)

> don't support lowest priority interrupts for PI? But Linux OS uses
> lowest priority for most of the case? If so, we can hardly get benefit
> from this feature for Linux guest OS.
> 
> Thanks,
> Feng
> 
>> 
>>> Also, I think you should take the apic_arb_prio into consider
>>> since the priority is for the whole vCPU not for one interrupt.


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:
> 
> 
> Zhang, Yang Z wrote on 2014-12-18:
>> jiang@linux.intel.com
>> Cc: eric.au...@linaro.org; linux-kernel@vger.kernel.org;
>> io...@lists.linux-foundation.org; k...@vger.kernel.org; Wu, Feng
>> Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for
>> VT-d Posted-Interrupts
>> 
>> Feng Wu wrote on 2014-12-12:
>>> We don't need to migrate the irqs for VT-d Posted-Interrupts here.
>>> When 'pst' is set in IRTE, the associated irq will be posted to
>>> guests instead of interrupt remapping. The destination of the
>>> interrupt is set in Posted-Interrupts Descriptor, and the
>>> migration happens during vCPU scheduling.
>>> 
>>> However, we still update the cached irte here, which can be used
>>> when changing back to remapping mode.
>>> 
>>> Signed-off-by: Feng Wu 
>>> Reviewed-by: Jiang Liu 
>>> ---
>>>  drivers/iommu/intel_irq_remapping.c | 6 +-
>>>  1 file changed, 5 insertions(+), 1 deletion(-) diff --git
>>> a/drivers/iommu/intel_irq_remapping.c
>>> b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a
>>> 100644
>>> --- a/drivers/iommu/intel_irq_remapping.c +++
>>> b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
>>> intel_ir_set_affinity(struct irq_data *data, const struct cpumask
>>> *mask,  {
>>> struct intel_ir_data *ir_data = data->chip_data;struct irte 
>>> *irte =
>>>  _data->irte_entry; +struct irte_pi *irte_pi = (struct irte_pi
>>>  *)irte;struct irq_cfg *cfg = irqd_cfg(data);   struct irq_data *parent
>>>  = data->parent_data;   int ret;
>>> @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
>>> const struct cpumask *mask,
>>>  */
>>> irte->vector = cfg->vector;
>>> irte->dest_id = IRTE_DEST(cfg->dest_apicid);
>>> -   modify_irte(_data->irq_2_iommu, irte);
>>> +
>>> +   /* We don't need to modify irte if the interrupt is for posting. */
>>> +   if (irte_pi->pst != 1)
>>> +   modify_irte(_data->irq_2_iommu, irte);
>> 
>> What happens if user changes the IRQ affinity manually?
> 
> If the IRQ is posted, its affinity is controlled by guest (irq <--->
> vCPU <> pCPU), it has no effect when host changes its affinity.

That's the problem: User is able to changes it in host but it never takes 
effect since it is actually controlled by guest. I guess it will break the IRQ 
balance too.

> 
> Thanks,
> Feng
> 
>> 
>>> 
>>> /*
>>>  * After this point, all the interrupts will start arriving
>> 
>> 
>> Best regards,
>> Yang
>>


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-19:
> 
> 
> On 18/12/2014 15:49, Zhang, Yang Z wrote:
>>>> Here, we introduce a similar way with 'apic_arb_prio' to handle
>>>> guest lowest priority interrtups when VT-d PI is used. Here is the
>>>> ideas: - Each vCPU has a counter 'round_robin_counter'. - When
>>>> guests sets an interrupts to lowest priority, we choose the vCPU
>>>> with smallest 'round_robin_counter' as the destination, then
>>>> increase it.
>> 
>> How this can work well? All subsequent interrupts are delivered to
>> one vCPU? It shouldn't be the best solution, need more consideration.
> 
> Well, it's a hardware limitation.  The alternative (which is easy to

Agree, it is limited by hardware. But lowest priority distributes the interrupt 
more efficient than fixed mode. And current implementation more likes to switch 
the lowest priority mode to fixed mode. In case of interrupt intensive 
environment, this may be a bottleneck and VM may not benefit greatly from VT-d 
PI. But agree again, it is really a hardware limitation.

> implement) is to only do PI for single-CPU interrupts.  This should
> work well for multiqueue NICs (and of course for UP guests :)), so
> perhaps it's a good idea to only support that as a first attempt.

The more easy way is to deliver the interrupt to the first matched VCPU we 
find. The round_robin_counter really helps nothing here since the interrupt is 
delivered by hardware directly.

> 
> Paolo
> 
>> Also, I think you should take the apic_arb_prio into consider since
>> the priority is for the whole vCPU not for one interrupt.

Best regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 12/26] KVM: Initialize VT-d Posted-Interrupts Descriptor

2014-12-18 Thread Zhang, Yang Z

Feng Wu wrote on 2014-12-12:
> This patch initializes the VT-d Posted-Interrupts Descriptor.
> 
> Signed-off-by: Feng Wu 
> ---
>  arch/x86/kvm/vmx.c | 27 +++
>  1 file changed, 27 insertions(+)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> 0b1383e..66ca275 100644 --- a/arch/x86/kvm/vmx.c +++
> b/arch/x86/kvm/vmx.c @@ -45,6 +45,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "trace.h"
> @@ -4433,6 +4434,30 @@ static void ept_set_mmio_spte_mask(void)
>   kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);  }
> +static void pi_desc_init(struct vcpu_vmx *vmx) {
> + unsigned int dest;
> +
> + if (!irq_remapping_cap(IRQ_POSTING_CAP))
> + return;
> +
> + /*
> +  * Initialize Posted-Interrupt Descriptor
> +  */
> +
> + pi_clear_sn(>pi_desc);
> + vmx->pi_desc.nv = POSTED_INTR_VECTOR;

Here.

> +
> + /* Physical mode for Notificaiton Event */
> + vmx->pi_desc.ndm = 0;

And from here..

> + dest = cpu_physical_id(vmx->vcpu.cpu);
> +
> + if (x2apic_enabled())
> + vmx->pi_desc.ndst = dest;
> + else
> + vmx->pi_desc.ndst = (dest << 8) & 0xFF00; }
> +

..to here are useless. The right place to update PI descriptor is where vcpu 
got loaded not in initialization.

>  /*
>   * Sets up the vmcs for emulated real mode.
>   */
> @@ -4476,6 +4501,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> 
>   vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>   vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((>pi_desc)));
> +
> + pi_desc_init(vmx);
>   }
>  
>   if (ple_gap) {


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-18:
> 
> 
> On 18/12/2014 04:14, Wu, Feng wrote:
>> 
>> 
>> linux-kernel-ow...@vger.kernel.org wrote on 
>> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
>>> x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org;
>>> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
>>> joro-zLv9SwRftAIdnm+Jiang
>>> Liu
>>> Cc: io...@lists.linux-foundation.org;
>>> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
>>> Eric Auger
>>> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>>> set
>>> 
>>> 
>>> 
>>> On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all interrupts are
 recognized as non-urgent interrupt, so we cannot send
 posted-interrupt when 'SN' is set.
>>> 
>>> Can this happen?  If the vcpu is in guest mode, it cannot have been
>>> scheduled out, and that's the only case when SN is set.
>>> 
>>> Paolo
>> 
>> Currently, the only place where SN is set is vCPU is preempted and

If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if 
a PI is occurs when vCPU is preempted?

>> waiting for the next scheduling in the runqueue. But I am not sure
>> whether we need to set SN for other purpose in future. Adding SN
>> checking here is just to follow the Spec. non-urgent interrupts are
>> suppressed
> when SN is set.
> 
> I would change that to a WARN_ON_ONCE then.


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 21/26] x86, irq: Define a global vector for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z

Feng Wu wrote on 2014-12-12:
> Currently, we use a global vector as the Posted-Interrupts
> Notification Event for all the vCPUs in the system. We need to
> introduce another global vector for VT-d Posted-Interrtups, which will
> be used to wakeup the sleep vCPU when an external interrupt from a 
> direct-assigned device happens for that vCPU.
> 

Hi Feng, 

Since the idea of two global vectors mechanism is from me, please add me to the 
comments.

> Signed-off-by: Feng Wu 
> ---
>  arch/x86/include/asm/entry_arch.h  |  2 ++
>  arch/x86/include/asm/hardirq.h |  1 +
>  arch/x86/include/asm/hw_irq.h  |  2 ++
>  arch/x86/include/asm/irq_vectors.h |  1 +
>  arch/x86/kernel/entry_64.S |  2 ++
>  arch/x86/kernel/irq.c  | 27 +++
>  arch/x86/kernel/irqinit.c  |  2 ++
>  7 files changed, 37 insertions(+)
> diff --git a/arch/x86/include/asm/entry_arch.h
> b/arch/x86/include/asm/entry_arch.h index dc5fa66..27ca0af 100644 ---
> a/arch/x86/include/asm/entry_arch.h +++
> b/arch/x86/include/asm/entry_arch.h @@ -23,6 +23,8 @@
> BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)  #ifdef
> CONFIG_HAVE_KVM BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR,
>smp_kvm_posted_intr_ipi)
> +BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR,
> +  smp_kvm_posted_intr_wakeup_ipi)
>  #endif
>  
>  /*
> diff --git a/arch/x86/include/asm/hardirq.h
> b/arch/x86/include/asm/hardirq.h index 0f5fb6b..9866065 100644
> --- a/arch/x86/include/asm/hardirq.h
> +++ b/arch/x86/include/asm/hardirq.h
> @@ -14,6 +14,7 @@ typedef struct {
>  #endif #ifdef CONFIG_HAVE_KVMunsigned int kvm_posted_intr_ipis;
>  +unsigned int kvm_posted_intr_wakeup_ipis; #endifunsigned int
>  x86_platform_ipis;   /* arch dependent */unsigned int apic_perf_irqs;
> diff --git a/arch/x86/include/asm/hw_irq.h
> b/arch/x86/include/asm/hw_irq.h index e7ae6eb..38fac9b 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -29,6 +29,7 @@
>  extern asmlinkage void apic_timer_interrupt(void);  extern asmlinkage
> void x86_platform_ipi(void);  extern asmlinkage void
> kvm_posted_intr_ipi(void); +extern asmlinkage void
> kvm_posted_intr_wakeup_ipi(void);
>  extern asmlinkage void error_interrupt(void);  extern asmlinkage void
> irq_work_interrupt(void);
> 
> @@ -92,6 +93,7 @@ extern void
> trace_call_function_single_interrupt(void);
>  #define trace_irq_move_cleanup_interrupt  irq_move_cleanup_interrupt
> #define trace_reboot_interrupt  reboot_interrupt  #define
> trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
> +#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi
>  #endif /* CONFIG_TRACING */
>  
>  struct irq_domain;
> diff --git a/arch/x86/include/asm/irq_vectors.h
> b/arch/x86/include/asm/irq_vectors.h index b26cb12..dca94f2 100644 ---
> a/arch/x86/include/asm/irq_vectors.h +++
> b/arch/x86/include/asm/irq_vectors.h @@ -105,6 +105,7 @@
>  /* Vector for KVM to deliver posted interrupt IPI */  #ifdef
>  CONFIG_HAVE_KVM #define POSTED_INTR_VECTOR   0xf2 +#define
>  POSTED_INTR_WAKEUP_VECTOR0xf1 #endif
>  
>  /*
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index e61c14a..a598447 100644 --- a/arch/x86/kernel/entry_64.S +++
> b/arch/x86/kernel/entry_64.S @@ -960,6 +960,8 @@ apicinterrupt
> X86_PLATFORM_IPI_VECTOR \  #ifdef CONFIG_HAVE_KVM
>  apicinterrupt3 POSTED_INTR_VECTOR \
>   kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
> +apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \
> + kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi
>  #endif
>  
>  #ifdef CONFIG_X86_MCE_THRESHOLD
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index
> 922d285..47408c3 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs
> *regs)  }
> 
>  #ifdef CONFIG_HAVE_KVM
> +void (*wakeup_handler_callback)(void) = NULL;
> +EXPORT_SYMBOL_GPL(wakeup_handler_callback); +
>  /*
>   * Handler for POSTED_INTERRUPT_VECTOR.
>   */
> @@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct
> pt_regs *regs)
> 
>   set_irq_regs(old_regs);
>  }
> +
> +/*
> + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> + */
> +__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs) {
> + struct pt_regs *old_regs = set_irq_regs(regs);
> +
> + ack_APIC_irq();
> +
> + irq_enter();
> +
> + exit_idle();
> +
> + inc_irq_stat(kvm_posted_intr_wakeup_ipis);
> +
> + if (wakeup_handler_callback)
> + wakeup_handler_callback();
> +
> + irq_exit();
> +
> + set_irq_regs(old_regs);
> +}
> +
>  #endif
>  
>  __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) diff
> --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index
> 70e181e..844673c 100644 --- a/arch/x86/kernel/irqinit.c +++
> b/arch/x86/kernel/irqinit.c @@ -144,6 +144,8

RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z

Feng Wu wrote on 2014-12-12:
> This patch defines a new interface kvm_find_dest_vcpu for
> VT-d PI, which can returns the destination vCPU of the
> interrupt for guests.
> 
> Since VT-d PI cannot handle broadcast/multicast interrupt,
> Here we only handle Fixed and Lowest priority interrupts.
> 
> The current method of handling guest lowest priority interrtups
> is to use a counter 'apic_arb_prio' for each vCPU, we choose the
> vCPU with smallest 'apic_arb_prio' and then increase it by 1.
> However, for VT-d PI, we cannot re-use this, since we no longer
> have control to 'apic_arb_prio' with posted interrupt direct
> delivery by Hardware.
> 
> Here, we introduce a similar way with 'apic_arb_prio' to handle guest
> lowest priority interrtups when VT-d PI is used. Here is the ideas: -
> Each vCPU has a counter 'round_robin_counter'. - When guests sets an
> interrupts to lowest priority, we choose the vCPU with smallest
> 'round_robin_counter' as the destination, then increase it.
 
How this can work well? All subsequent interrupts are delivered to one vCPU? It 
shouldn't be the best solution, need more consideration. Also, I think you 
should take the apic_arb_prio into consider since the priority is for the whole 
vCPU not for one interrupt.

Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z

Feng Wu wrote on 2014-12-12:
> We don't need to migrate the irqs for VT-d Posted-Interrupts here.
> When 'pst' is set in IRTE, the associated irq will be posted to guests
> instead of interrupt remapping. The destination of the interrupt is
> set in Posted-Interrupts Descriptor, and the migration happens during
> vCPU scheduling.
> 
> However, we still update the cached irte here, which can be used when
> changing back to remapping mode.
> 
> Signed-off-by: Feng Wu 
> Reviewed-by: Jiang Liu 
> ---
>  drivers/iommu/intel_irq_remapping.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> diff --git a/drivers/iommu/intel_irq_remapping.c
> b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 ---
> a/drivers/iommu/intel_irq_remapping.c +++
> b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
> intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  {
>   struct intel_ir_data *ir_data = data->chip_data;struct irte 
> *irte =
>  _data->irte_entry; +  struct irte_pi *irte_pi = (struct irte_pi
>  *)irte;  struct irq_cfg *cfg = irqd_cfg(data);   struct irq_data *parent
>  = data->parent_data; int ret;
> @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
> const struct cpumask *mask,
>*/
>   irte->vector = cfg->vector;
>   irte->dest_id = IRTE_DEST(cfg->dest_apicid);
> - modify_irte(_data->irq_2_iommu, irte);
> +
> + /* We don't need to modify irte if the interrupt is for posting. */
> + if (irte_pi->pst != 1)
> + modify_irte(_data->irq_2_iommu, irte);

What happens if user changes the IRQ affinity manually?

> 
>   /*
>* After this point, all the interrupts will start arriving


Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z

Feng Wu wrote on 2014-12-12:
 We don't need to migrate the irqs for VT-d Posted-Interrupts here.
 When 'pst' is set in IRTE, the associated irq will be posted to guests
 instead of interrupt remapping. The destination of the interrupt is
 set in Posted-Interrupts Descriptor, and the migration happens during
 vCPU scheduling.
 
 However, we still update the cached irte here, which can be used when
 changing back to remapping mode.
 
 Signed-off-by: Feng Wu feng...@intel.com
 Reviewed-by: Jiang Liu jiang@linux.intel.com
 ---
  drivers/iommu/intel_irq_remapping.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)
 diff --git a/drivers/iommu/intel_irq_remapping.c
 b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 ---
 a/drivers/iommu/intel_irq_remapping.c +++
 b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
 intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
  {
   struct intel_ir_data *ir_data = data-chip_data;struct irte 
 *irte =
  ir_data-irte_entry; +  struct irte_pi *irte_pi = (struct irte_pi
  *)irte;  struct irq_cfg *cfg = irqd_cfg(data);   struct irq_data *parent
  = data-parent_data; int ret;
 @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
 const struct cpumask *mask,
*/
   irte-vector = cfg-vector;
   irte-dest_id = IRTE_DEST(cfg-dest_apicid);
 - modify_irte(ir_data-irq_2_iommu, irte);
 +
 + /* We don't need to modify irte if the interrupt is for posting. */
 + if (irte_pi-pst != 1)
 + modify_irte(ir_data-irq_2_iommu, irte);

What happens if user changes the IRQ affinity manually?

 
   /*
* After this point, all the interrupts will start arriving


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z

Feng Wu wrote on 2014-12-12:
 This patch defines a new interface kvm_find_dest_vcpu for
 VT-d PI, which can returns the destination vCPU of the
 interrupt for guests.
 
 Since VT-d PI cannot handle broadcast/multicast interrupt,
 Here we only handle Fixed and Lowest priority interrupts.
 
 The current method of handling guest lowest priority interrtups
 is to use a counter 'apic_arb_prio' for each vCPU, we choose the
 vCPU with smallest 'apic_arb_prio' and then increase it by 1.
 However, for VT-d PI, we cannot re-use this, since we no longer
 have control to 'apic_arb_prio' with posted interrupt direct
 delivery by Hardware.
 
 Here, we introduce a similar way with 'apic_arb_prio' to handle guest
 lowest priority interrtups when VT-d PI is used. Here is the ideas: -
 Each vCPU has a counter 'round_robin_counter'. - When guests sets an
 interrupts to lowest priority, we choose the vCPU with smallest
 'round_robin_counter' as the destination, then increase it.
 
How this can work well? All subsequent interrupts are delivered to one vCPU? It 
shouldn't be the best solution, need more consideration. Also, I think you 
should take the apic_arb_prio into consider since the priority is for the whole 
vCPU not for one interrupt.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 21/26] x86, irq: Define a global vector for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z

Feng Wu wrote on 2014-12-12:
 Currently, we use a global vector as the Posted-Interrupts
 Notification Event for all the vCPUs in the system. We need to
 introduce another global vector for VT-d Posted-Interrtups, which will
 be used to wakeup the sleep vCPU when an external interrupt from a 
 direct-assigned device happens for that vCPU.
 

Hi Feng, 

Since the idea of two global vectors mechanism is from me, please add me to the 
comments.

 Signed-off-by: Feng Wu feng...@intel.com
 ---
  arch/x86/include/asm/entry_arch.h  |  2 ++
  arch/x86/include/asm/hardirq.h |  1 +
  arch/x86/include/asm/hw_irq.h  |  2 ++
  arch/x86/include/asm/irq_vectors.h |  1 +
  arch/x86/kernel/entry_64.S |  2 ++
  arch/x86/kernel/irq.c  | 27 +++
  arch/x86/kernel/irqinit.c  |  2 ++
  7 files changed, 37 insertions(+)
 diff --git a/arch/x86/include/asm/entry_arch.h
 b/arch/x86/include/asm/entry_arch.h index dc5fa66..27ca0af 100644 ---
 a/arch/x86/include/asm/entry_arch.h +++
 b/arch/x86/include/asm/entry_arch.h @@ -23,6 +23,8 @@
 BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)  #ifdef
 CONFIG_HAVE_KVM BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR,
smp_kvm_posted_intr_ipi)
 +BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR,
 +  smp_kvm_posted_intr_wakeup_ipi)
  #endif
  
  /*
 diff --git a/arch/x86/include/asm/hardirq.h
 b/arch/x86/include/asm/hardirq.h index 0f5fb6b..9866065 100644
 --- a/arch/x86/include/asm/hardirq.h
 +++ b/arch/x86/include/asm/hardirq.h
 @@ -14,6 +14,7 @@ typedef struct {
  #endif #ifdef CONFIG_HAVE_KVMunsigned int kvm_posted_intr_ipis;
  +unsigned int kvm_posted_intr_wakeup_ipis; #endifunsigned int
  x86_platform_ipis;   /* arch dependent */unsigned int apic_perf_irqs;
 diff --git a/arch/x86/include/asm/hw_irq.h
 b/arch/x86/include/asm/hw_irq.h index e7ae6eb..38fac9b 100644
 --- a/arch/x86/include/asm/hw_irq.h
 +++ b/arch/x86/include/asm/hw_irq.h
 @@ -29,6 +29,7 @@
  extern asmlinkage void apic_timer_interrupt(void);  extern asmlinkage
 void x86_platform_ipi(void);  extern asmlinkage void
 kvm_posted_intr_ipi(void); +extern asmlinkage void
 kvm_posted_intr_wakeup_ipi(void);
  extern asmlinkage void error_interrupt(void);  extern asmlinkage void
 irq_work_interrupt(void);
 
 @@ -92,6 +93,7 @@ extern void
 trace_call_function_single_interrupt(void);
  #define trace_irq_move_cleanup_interrupt  irq_move_cleanup_interrupt
 #define trace_reboot_interrupt  reboot_interrupt  #define
 trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
 +#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi
  #endif /* CONFIG_TRACING */
  
  struct irq_domain;
 diff --git a/arch/x86/include/asm/irq_vectors.h
 b/arch/x86/include/asm/irq_vectors.h index b26cb12..dca94f2 100644 ---
 a/arch/x86/include/asm/irq_vectors.h +++
 b/arch/x86/include/asm/irq_vectors.h @@ -105,6 +105,7 @@
  /* Vector for KVM to deliver posted interrupt IPI */  #ifdef
  CONFIG_HAVE_KVM #define POSTED_INTR_VECTOR   0xf2 +#define
  POSTED_INTR_WAKEUP_VECTOR0xf1 #endif
  
  /*
 diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
 index e61c14a..a598447 100644 --- a/arch/x86/kernel/entry_64.S +++
 b/arch/x86/kernel/entry_64.S @@ -960,6 +960,8 @@ apicinterrupt
 X86_PLATFORM_IPI_VECTOR \  #ifdef CONFIG_HAVE_KVM
  apicinterrupt3 POSTED_INTR_VECTOR \
   kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
 +apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \
 + kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi
  #endif
  
  #ifdef CONFIG_X86_MCE_THRESHOLD
 diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index
 922d285..47408c3 100644
 --- a/arch/x86/kernel/irq.c
 +++ b/arch/x86/kernel/irq.c
 @@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs
 *regs)  }
 
  #ifdef CONFIG_HAVE_KVM
 +void (*wakeup_handler_callback)(void) = NULL;
 +EXPORT_SYMBOL_GPL(wakeup_handler_callback); +
  /*
   * Handler for POSTED_INTERRUPT_VECTOR.
   */
 @@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct
 pt_regs *regs)
 
   set_irq_regs(old_regs);
  }
 +
 +/*
 + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
 + */
 +__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs) {
 + struct pt_regs *old_regs = set_irq_regs(regs);
 +
 + ack_APIC_irq();
 +
 + irq_enter();
 +
 + exit_idle();
 +
 + inc_irq_stat(kvm_posted_intr_wakeup_ipis);
 +
 + if (wakeup_handler_callback)
 + wakeup_handler_callback();
 +
 + irq_exit();
 +
 + set_irq_regs(old_regs);
 +}
 +
  #endif
  
  __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) diff
 --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index
 70e181e..844673c 100644 --- a/arch/x86/kernel/irqinit.c +++
 b/arch/x86/kernel/irqinit.c @@ -144,6 +144,8 @@ static void __init
 apic_intr_init(void)  #ifdef CONFIG_HAVE_KVM
   /* IPI for KVM to deliver

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-18:
 
 
 On 18/12/2014 04:14, Wu, Feng wrote:
 
 
 linux-kernel-ow...@vger.kernel.org wrote on 
 mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
 x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org;
 joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
 joro-zLv9SwRftAIdnm+Jiang
 Liu
 Cc: io...@lists.linux-foundation.org;
 linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
 Eric Auger
 Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
 set
 
 
 
 On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all interrupts are
 recognized as non-urgent interrupt, so we cannot send
 posted-interrupt when 'SN' is set.
 
 Can this happen?  If the vcpu is in guest mode, it cannot have been
 scheduled out, and that's the only case when SN is set.
 
 Paolo
 
 Currently, the only place where SN is set is vCPU is preempted and

If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if 
a PI is occurs when vCPU is preempted?

 waiting for the next scheduling in the runqueue. But I am not sure
 whether we need to set SN for other purpose in future. Adding SN
 checking here is just to follow the Spec. non-urgent interrupts are
 suppressed
 when SN is set.
 
 I would change that to a WARN_ON_ONCE then.


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 12/26] KVM: Initialize VT-d Posted-Interrupts Descriptor

2014-12-18 Thread Zhang, Yang Z

Feng Wu wrote on 2014-12-12:
 This patch initializes the VT-d Posted-Interrupts Descriptor.
 
 Signed-off-by: Feng Wu feng...@intel.com
 ---
  arch/x86/kvm/vmx.c | 27 +++
  1 file changed, 27 insertions(+)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
 0b1383e..66ca275 100644 --- a/arch/x86/kvm/vmx.c +++
 b/arch/x86/kvm/vmx.c @@ -45,6 +45,7 @@
  #include asm/perf_event.h
  #include asm/debugreg.h
  #include asm/kexec.h
 +#include asm/irq_remapping.h
 
  #include trace.h
 @@ -4433,6 +4434,30 @@ static void ept_set_mmio_spte_mask(void)
   kvm_mmu_set_mmio_spte_mask((0x3ull  62) | 0x6ull);  }
 +static void pi_desc_init(struct vcpu_vmx *vmx) {
 + unsigned int dest;
 +
 + if (!irq_remapping_cap(IRQ_POSTING_CAP))
 + return;
 +
 + /*
 +  * Initialize Posted-Interrupt Descriptor
 +  */
 +
 + pi_clear_sn(vmx-pi_desc);
 + vmx-pi_desc.nv = POSTED_INTR_VECTOR;

Here.

 +
 + /* Physical mode for Notificaiton Event */
 + vmx-pi_desc.ndm = 0;

And from here..

 + dest = cpu_physical_id(vmx-vcpu.cpu);
 +
 + if (x2apic_enabled())
 + vmx-pi_desc.ndst = dest;
 + else
 + vmx-pi_desc.ndst = (dest  8)  0xFF00; }
 +

..to here are useless. The right place to update PI descriptor is where vcpu 
got loaded not in initialization.

  /*
   * Sets up the vmcs for emulated real mode.
   */
 @@ -4476,6 +4501,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
   vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
   vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx-pi_desc)));
 +
 + pi_desc_init(vmx);
   }
  
   if (ple_gap) {


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z

Paolo Bonzini wrote on 2014-12-19:
 
 
 On 18/12/2014 15:49, Zhang, Yang Z wrote:
 Here, we introduce a similar way with 'apic_arb_prio' to handle
 guest lowest priority interrtups when VT-d PI is used. Here is the
 ideas: - Each vCPU has a counter 'round_robin_counter'. - When
 guests sets an interrupts to lowest priority, we choose the vCPU
 with smallest 'round_robin_counter' as the destination, then
 increase it.
 
 How this can work well? All subsequent interrupts are delivered to
 one vCPU? It shouldn't be the best solution, need more consideration.
 
 Well, it's a hardware limitation.  The alternative (which is easy to

Agree, it is limited by hardware. But lowest priority distributes the interrupt 
more efficient than fixed mode. And current implementation more likes to switch 
the lowest priority mode to fixed mode. In case of interrupt intensive 
environment, this may be a bottleneck and VM may not benefit greatly from VT-d 
PI. But agree again, it is really a hardware limitation.

 implement) is to only do PI for single-CPU interrupts.  This should
 work well for multiqueue NICs (and of course for UP guests :)), so
 perhaps it's a good idea to only support that as a first attempt.

The more easy way is to deliver the interrupt to the first matched VCPU we 
find. The round_robin_counter really helps nothing here since the interrupt is 
delivered by hardware directly.

 
 Paolo
 
 Also, I think you should take the apic_arb_prio into consider since
 the priority is for the whole vCPU not for one interrupt.


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:

 Zhang, Yang Z wrote on 2014-12-18:
 jiang@linux.intel.com
 Cc: eric.au...@linaro.org; linux-kernel@vger.kernel.org;
 io...@lists.linux-foundation.org; k...@vger.kernel.org; Wu, Feng
 Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for
 VT-d Posted-Interrupts

 Feng Wu wrote on 2014-12-12:
 We don't need to migrate the irqs for VT-d Posted-Interrupts here.
 When 'pst' is set in IRTE, the associated irq will be posted to
 guests instead of interrupt remapping. The destination of the
 interrupt is set in Posted-Interrupts Descriptor, and the
 migration happens during vCPU scheduling.

 However, we still update the cached irte here, which can be used
 when changing back to remapping mode.

 Signed-off-by: Feng Wu feng...@intel.com
 Reviewed-by: Jiang Liu jiang@linux.intel.com
 ---
  drivers/iommu/intel_irq_remapping.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-) diff --git
 a/drivers/iommu/intel_irq_remapping.c
 b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a
 100644
 --- a/drivers/iommu/intel_irq_remapping.c +++
 b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
 intel_ir_set_affinity(struct irq_data *data, const struct cpumask
 *mask,  {
 struct intel_ir_data *ir_data = data-chip_data;struct irte 
 *irte =
  ir_data-irte_entry; +struct irte_pi *irte_pi = (struct irte_pi
  *)irte;struct irq_cfg *cfg = irqd_cfg(data);   struct irq_data *parent
  = data-parent_data;   int ret;
 @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
 const struct cpumask *mask,
  */
 irte-vector = cfg-vector;
 irte-dest_id = IRTE_DEST(cfg-dest_apicid);
 -   modify_irte(ir_data-irq_2_iommu, irte);
 +
 +   /* We don't need to modify irte if the interrupt is for posting. */
 +   if (irte_pi-pst != 1)
 +   modify_irte(ir_data-irq_2_iommu, irte);

 What happens if user changes the IRQ affinity manually?

 If the IRQ is posted, its affinity is controlled by guest (irq ---
 vCPU  pCPU), it has no effect when host changes its affinity.

That's the problem: User is able to changes it in host but it never takes 
effect since it is actually controlled by guest. I guess it will break the IRQ 
balance too.

 Thanks,
 Feng

 /*
  * After this point, all the interrupts will start arriving

 Best regards,
 Yang

Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:

 Paolo Bonzini wrote on 2014-12-19:
 jiang@linux.intel.com
 Cc: eric.au...@linaro.org; linux-kernel@vger.kernel.org;
 io...@lists.linux-foundation.org; k...@vger.kernel.org
 Subject: Re: [v3 13/26] KVM: Define a new interface
 kvm_find_dest_vcpu() for VT-d PI

 On 18/12/2014 15:49, Zhang, Yang Z wrote:
 Here, we introduce a similar way with 'apic_arb_prio' to handle
 guest lowest priority interrtups when VT-d PI is used. Here is
 the
 ideas: - Each vCPU has a counter 'round_robin_counter'. - When
 guests sets an interrupts to lowest priority, we choose the vCPU
 with smallest 'round_robin_counter' as the destination, then
 increase it.

 How this can work well? All subsequent interrupts are delivered to
 one vCPU? It shouldn't be the best solution, need more consideration.

 Well, it's a hardware limitation.  The alternative (which is easy to
 implement) is to only do PI for single-CPU interrupts.  This should
 work well for multiqueue NICs (and of course for UP guests :)), so
 perhaps it's a good idea to only support that as a first attempt.

 Paolo

 Paolo, what do you mean by single-CPU interrupts? Do you mean we

It should be same idea as I mentioned on another thread: deliver the interrupt 
to a single CPU(maybe the first matched VCPU?)

 don't support lowest priority interrupts for PI? But Linux OS uses
 lowest priority for most of the case? If so, we can hardly get benefit
 from this feature for Linux guest OS.

 Thanks,
 Feng

 Also, I think you should take the apic_arb_prio into consider
 since the priority is for the whole vCPU not for one interrupt.

Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:

 iommu-boun...@lists.linux-foundation.org wrote on 
 mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
 Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
 k...@vger.kernel.org
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
 set

 Paolo Bonzini wrote on 2014-12-18:

 On 18/12/2014 04:14, Wu, Feng wrote:

 linux-kernel-ow...@vger.kernel.org wrote on
 mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
 x...@kernel.org; Gleb Natapov; Paolo Bonzini;
 dw...@infradead.org;
 joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
 joro-zLv9SwRftAIdnm+Jiang
 Liu
 Cc: io...@lists.linux-foundation.org;
 linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
 Eric Auger
 Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set

 On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all interrupts
 are recognized as non-urgent interrupt, so we cannot send
 posted-interrupt when 'SN' is set.

 Can this happen?  If the vcpu is in guest mode, it cannot have
 been scheduled out, and that's the only case when SN is set.

 Paolo

 Currently, the only place where SN is set is vCPU is preempted
 and

 If the vCPU is preempted, shouldn't the subsequent be ignored? What
 happens if a PI is occurs when vCPU is preempted?

 If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts
 are suppressed for posting.

I mean what happens if we don't set SN bit. From my point, if preempter already 
disabled the interrupt, it is ok to leave SN bit as zero. But if preempter 
enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, 
since there already has ON bit, so this means there only have one interrupt 
arrived at most and it doesn't hurt performance. Do we really need to set SN 
bit?

 Thanks,
 Feng

 waiting for the next scheduling in the runqueue. But I am not
 sure whether we need to set SN for other purpose in future.
 Adding SN checking here is just to follow the Spec. non-urgent
 interrupts are suppressed
 when SN is set.

 I would change that to a WARN_ON_ONCE then.

 Best regards,
 Yang

 ___
 iommu mailing list
 io...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu

Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set

2014-12-18 Thread Zhang, Yang Z

Wu, Feng wrote on 2014-12-19:
 
 
 Zhang, Yang Z wrote on 2014-12-19:
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
 set
 
 Wu, Feng wrote on 2014-12-19:
 
 
 iommu-boun...@lists.linux-foundation.org wrote on
 mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of:
 Cc: io...@lists.linux-foundation.org;
 linux-kernel@vger.kernel.org; k...@vger.kernel.org
 Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
 is set
 
 Paolo Bonzini wrote on 2014-12-18:
 
 
 On 18/12/2014 04:14, Wu, Feng wrote:
 
 
 linux-kernel-ow...@vger.kernel.org wrote on
 mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo:
 x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org;
 joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson;
 joro-zLv9SwRftAIdnm+Jiang Liu Cc:
 io...@lists.linux-foundation.org;
 linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list;
 Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt
 when 'SN' is set
 
 
 
 On 12/12/2014 16:14, Feng Wu wrote:
 Currently, we don't support urgent interrupt, all interrupts
 are recognized as non-urgent interrupt, so we cannot send
 posted-interrupt when 'SN' is set.
 
 Can this happen?  If the vcpu is in guest mode, it cannot have
 been scheduled out, and that's the only case when SN is set.
 
 Paolo
 
 Currently, the only place where SN is set is vCPU is preempted
 and
 
 If the vCPU is preempted, shouldn't the subsequent be ignored?
 What happens if a PI is occurs when vCPU is preempted?
 
 If a vCPU is preempted, the 'SN' bit is set, the subsequent
 interrupts are suppressed for posting.
 
 I mean what happens if we don't set SN bit. From my point, if
 preempter already disabled the interrupt, it is ok to leave SN bit
 as zero. But if preempter enabled the interrupt, doesn't this mean
 he allow interrupt to happen? BTW, since there already has ON bit,
 so this means there only have one interrupt arrived at most and it
 doesn't hurt performance. Do we really need to set SN bit?
 
 
 See this scenario:
 vCPU0 is running on pCPU0
 -- vCPU0 is preempted by vCPU1
 -- Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule
 -- in runqueue
 
 If the we don't set SN for vCPU0, then all subsequent interrupts for
 vCPU0 is posted to vCPU1, this will consume hardware and software

The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 
should be wakeup vector. Why vCPU1 will consume this PI event?

 efforts and in fact it is not needed at all. If SN is set for vCPU0,
 VT-d hardware will not issue Notification Event for vCPU0 when an
 interrupt is for it, but just setting the related PIR bit.
 
 Thanks,
 Feng
 
 
 
 Thanks,
 Feng
 
 
 waiting for the next scheduling in the runqueue. But I am not
 sure whether we need to set SN for other purpose in future.
 Adding SN checking here is just to follow the Spec. non-urgent
 interrupts are suppressed
 when SN is set.
 
 I would change that to a WARN_ON_ONCE then.
 
 
 Best regards,
 Yang
 
 
 ___
 iommu mailing list
 io...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/iommu
 
 
 Best regards,
 Yang



Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 140 matches

Mail list logo