RE: [PATCH] kvm/vmx: EPTP switching test
Michael S. Tsirkin wrote on 2015-11-16: > This patch adds a new parameter: eptp_switching_test, which enables > testing EPT switching on VMX if supported by hardware. All EPT > entries are initialized to the same value so this adds no useful > functionality by itself, but can be used to test VMFUNC performance, > and serve as a basis for future features based on EPTP switching. > > Support for nested virt is not enabled. > > This was tested using the following code within guest: > #define VMX_VMFUNC ".byte 0x0f,0x01,0xd4" > static void vmfunc(unsigned int nr, unsigned int ept) > { > asm volatile(VMX_VMFUNC >: >: "a"(nr), "c"(ept) >: "memory"); > } > > VMFUNC instruction cost was measured at ~122 cycles. > (Note: recent versions of gnu toolchain support the vmfunc > instruction - removing the need for writing the bytecode manually). > > Signed-off-by: Michael S. Tsirkin > --- > > I think I'd like to put this upstream so future eptp switching work > can be implemented on top. Comments? We have a different version in hand which is using separate EPTP. As you known, the patch will be more complex if using separate EPTP. And there are still lots of thing need to do in our version. We will send out for comments soon. Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH] KVM: x86: always set accessed bit in shadow PTEs
Paolo Bonzini wrote on 2015-11-13: > Commit 7a1638ce4220 ("nEPT: Redefine EPT-specific link_shadow_page()", > 2013-08-05) says: > > Since nEPT doesn't support A/D bit, we should not set those bit > when building the shadow page table. > but this is not necessary. Even though nEPT doesn't support A/D > bits, and hence the vmcs12 EPT pointer will never enable them, > we always use them for shadow page tables if available (see > construct_eptp in vmx.c). So we can set the A/D bits freely > in the shadow page table. > > This patch hence basically reverts commit 7a1638ce4220. > > Cc: Yang Zhang > Cc: Takuya Yoshikawa > Signed-off-by: Paolo Bonzini Looks good to me. Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] KVM: x86: always set accessed bit in shadow PTEs
Paolo Bonzini wrote on 2015-11-13: > Commit 7a1638ce4220 ("nEPT: Redefine EPT-specific link_shadow_page()", > 2013-08-05) says: > > Since nEPT doesn't support A/D bit, we should not set those bit > when building the shadow page table. > but this is not necessary. Even though nEPT doesn't support A/D > bits, and hence the vmcs12 EPT pointer will never enable them, > we always use them for shadow page tables if available (see > construct_eptp in vmx.c). So we can set the A/D bits freely > in the shadow page table. > > This patch hence basically reverts commit 7a1638ce4220. > > Cc: Yang Zhang> Cc: Takuya Yoshikawa > Signed-off-by: Paolo Bonzini Looks good to me. Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] kvm/vmx: EPTP switching test
Michael S. Tsirkin wrote on 2015-11-16: > This patch adds a new parameter: eptp_switching_test, which enables > testing EPT switching on VMX if supported by hardware. All EPT > entries are initialized to the same value so this adds no useful > functionality by itself, but can be used to test VMFUNC performance, > and serve as a basis for future features based on EPTP switching. > > Support for nested virt is not enabled. > > This was tested using the following code within guest: > #define VMX_VMFUNC ".byte 0x0f,0x01,0xd4" > static void vmfunc(unsigned int nr, unsigned int ept) > { > asm volatile(VMX_VMFUNC >: >: "a"(nr), "c"(ept) >: "memory"); > } > > VMFUNC instruction cost was measured at ~122 cycles. > (Note: recent versions of gnu toolchain support the vmfunc > instruction - removing the need for writing the bytecode manually). > > Signed-off-by: Michael S. Tsirkin> --- > > I think I'd like to put this upstream so future eptp switching work > can be implemented on top. Comments? We have a different version in hand which is using separate EPTP. As you known, the patch will be more complex if using separate EPTP. And there are still lots of thing need to do in our version. We will send out for comments soon. Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Zhang, Yang Z wrote on 2015-08-04: > Paolo Bonzini wrote on 2015-08-04: >> >> >> On 04/08/2015 02:46, Zhang, Yang Z wrote: >>>> It is a problem for split irqchip, where the EOI exit bitmap can >>>> be inferred from the IOAPIC routes but the TMR cannot. The >>>> hardware behavior on the other hand can be implemented purely within the >>>> LAPIC. >>> >>> So updating the TMR within LAPIC is the only solution to handle it? >> >> It's the simplest and the one that makes most sense. Considering >> that TMR is a pretty obscure feature, it's unlikely that it will be >> accelerated in the future. > > You may be right. It is safe if no future hardware plans to use it. > Let me check with our hardware team to see whether it will be used or not in > future. After checking with Jun, there is no guarantee that the guest running on another CPU will operate properly if hypervisor modify the vTMR from another CPU. So the hypervisor should not to do it. > >> >> Paolo > > > Best regards, > Yang > Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Zhang, Yang Z wrote on 2015-08-04: Paolo Bonzini wrote on 2015-08-04: On 04/08/2015 02:46, Zhang, Yang Z wrote: It is a problem for split irqchip, where the EOI exit bitmap can be inferred from the IOAPIC routes but the TMR cannot. The hardware behavior on the other hand can be implemented purely within the LAPIC. So updating the TMR within LAPIC is the only solution to handle it? It's the simplest and the one that makes most sense. Considering that TMR is a pretty obscure feature, it's unlikely that it will be accelerated in the future. You may be right. It is safe if no future hardware plans to use it. Let me check with our hardware team to see whether it will be used or not in future. After checking with Jun, there is no guarantee that the guest running on another CPU will operate properly if hypervisor modify the vTMR from another CPU. So the hypervisor should not to do it. Paolo Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-04: > > > On 04/08/2015 02:46, Zhang, Yang Z wrote: >>> It is a problem for split irqchip, where the EOI exit bitmap can be >>> inferred from the IOAPIC routes but the TMR cannot. The hardware >>> behavior on the other hand can be implemented purely within the LAPIC. >> >> So updating the TMR within LAPIC is the only solution to handle it? > > It's the simplest and the one that makes most sense. Considering that > TMR is a pretty obscure feature, it's unlikely that it will be > accelerated in the future. You may be right. It is safe if no future hardware plans to use it. Let me check with our hardware team to see whether it will be used or not in future. > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-04: On 04/08/2015 02:46, Zhang, Yang Z wrote: It is a problem for split irqchip, where the EOI exit bitmap can be inferred from the IOAPIC routes but the TMR cannot. The hardware behavior on the other hand can be implemented purely within the LAPIC. So updating the TMR within LAPIC is the only solution to handle it? It's the simplest and the one that makes most sense. Considering that TMR is a pretty obscure feature, it's unlikely that it will be accelerated in the future. You may be right. It is safe if no future hardware plans to use it. Let me check with our hardware team to see whether it will be used or not in future. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-03: > > > On 03/08/2015 12:23, Zhang, Yang Z wrote: >>> In any case, the TMR behavior introduced by the APICv patches is >>> completely different from the hardware behavior, so it has to be fixed. >> >> But any real problem with it? > > It is a problem for split irqchip, where the EOI exit bitmap can be > inferred from the IOAPIC routes but the TMR cannot. The hardware > behavior on the other hand can be implemented purely within the LAPIC. So updating the TMR within LAPIC is the only solution to handle it? > >>> The alternative is to inject level-triggered interrupts >>> synchronously, without using posted interrupts. >>> >>> I'll write some testcases to understand the functioning of TMR in >>> the virtual-APIC page, but the manual seems clear to me. >> >> Currently, no existing hardware will use TMR and will not cause any >> problem.(That's the reason why we leave it in Xen).But we don't know >> whether future hardware will use it or not(SDM always keeps changing >> :)). > > But that would be covered by a different execution control (for > backwards compatibility). We'll get there when such a feature is introduced. Yes, we can leave it in future. But one concern is that it may hard to handle it at that time if someone also develops feature which rely on it (like current patch to split irqchip). > >> And per 24.11.4's description, the perfect solution is don't modify >> it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS >> accesses are issued in right VMCS context. > > Yes, that's correct. It's just the TMR. > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-03: > > > On 03/08/2015 04:37, Zhang, Yang Z wrote: >>>> Only virtualized APIC register reads use the virtual TMR >>>> registers (SDM >>>> 29.4.2 or 29.5), but these just read data from the corresponding >>>> field in the virtual APIC page. >> >> 24.11.4 Software Access to Related Structures In addition to data in >> the VMCS region itself, VMX non-root operation can be controlled by >> data structures that are referenced by pointers in a VMCS (for >> example, the I/O bitmaps). While the pointers to these data >> structures are parts of the VMCS, the data structures themselves are >> not. They are not accessible using VMREAD and VMWRITE but by >> ordinary memory writes. > > I don't think the TMR fields of the virtual-APIC page are among the > data structures that _controls_ VMX non-root operations, because: > > * it is not part of the virtualized APIC state is listed in 29.1.1 > > * read accesses from the APIC-access page simply return data from the > corresponding page offset on the virtual-APIC page using the memory > access type stored in IA32_VMX_BASIC_MSR. I think this explicitly > says that the effects of 24.11.1 (especially non-deterministic > behavior after a write) do not apply here. > > In any case, the TMR behavior introduced by the APICv patches is > completely different from the hardware behavior, so it has to be fixed. But any real problem with it? > The alternative is to inject level-triggered interrupts > synchronously, without using posted interrupts. > > I'll write some testcases to understand the functioning of TMR in the > virtual-APIC page, but the manual seems clear to me. Currently, no existing hardware will use TMR and will not cause any problem.(That's the reason why we leave it in Xen).But we don't know whether future hardware will use it or not(SDM always keeps changing :)).And per 24.11.4's description, the perfect solution is don't modify it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS accesses are issued in right VMCS context. > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-03: On 03/08/2015 12:23, Zhang, Yang Z wrote: In any case, the TMR behavior introduced by the APICv patches is completely different from the hardware behavior, so it has to be fixed. But any real problem with it? It is a problem for split irqchip, where the EOI exit bitmap can be inferred from the IOAPIC routes but the TMR cannot. The hardware behavior on the other hand can be implemented purely within the LAPIC. So updating the TMR within LAPIC is the only solution to handle it? The alternative is to inject level-triggered interrupts synchronously, without using posted interrupts. I'll write some testcases to understand the functioning of TMR in the virtual-APIC page, but the manual seems clear to me. Currently, no existing hardware will use TMR and will not cause any problem.(That's the reason why we leave it in Xen).But we don't know whether future hardware will use it or not(SDM always keeps changing :)). But that would be covered by a different execution control (for backwards compatibility). We'll get there when such a feature is introduced. Yes, we can leave it in future. But one concern is that it may hard to handle it at that time if someone also develops feature which rely on it (like current patch to split irqchip). And per 24.11.4's description, the perfect solution is don't modify it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS accesses are issued in right VMCS context. Yes, that's correct. It's just the TMR. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-03: On 03/08/2015 04:37, Zhang, Yang Z wrote: Only virtualized APIC register reads use the virtual TMR registers (SDM 29.4.2 or 29.5), but these just read data from the corresponding field in the virtual APIC page. 24.11.4 Software Access to Related Structures In addition to data in the VMCS region itself, VMX non-root operation can be controlled by data structures that are referenced by pointers in a VMCS (for example, the I/O bitmaps). While the pointers to these data structures are parts of the VMCS, the data structures themselves are not. They are not accessible using VMREAD and VMWRITE but by ordinary memory writes. I don't think the TMR fields of the virtual-APIC page are among the data structures that _controls_ VMX non-root operations, because: * it is not part of the virtualized APIC state is listed in 29.1.1 * read accesses from the APIC-access page simply return data from the corresponding page offset on the virtual-APIC page using the memory access type stored in IA32_VMX_BASIC_MSR. I think this explicitly says that the effects of 24.11.1 (especially non-deterministic behavior after a write) do not apply here. In any case, the TMR behavior introduced by the APICv patches is completely different from the hardware behavior, so it has to be fixed. But any real problem with it? The alternative is to inject level-triggered interrupts synchronously, without using posted interrupts. I'll write some testcases to understand the functioning of TMR in the virtual-APIC page, but the manual seems clear to me. Currently, no existing hardware will use TMR and will not cause any problem.(That's the reason why we leave it in Xen).But we don't know whether future hardware will use it or not(SDM always keeps changing :)).And per 24.11.4's description, the perfect solution is don't modify it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS accesses are issued in right VMCS context. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-31: > > > On 31/07/2015 04:49, Steve Rutherford wrote: >> Oh... Yeah. That's a damn good point, given that the interrupt can be >> injected from another thread while one is in that guest vcpu. >> >> Easiest time to update the TMR should be on guest entry through >> vcpu_scan_ioapic, as before. >> >> The best way to go is probably to ditch the new per vcpu EOI exit >> bitmap, and just update/use the TMR. There's no reason to duplicate >> that data in the representation of the apic (I suspect that the >> duplication was inspired by my mistaken notion of the TMR). The >> IOAPIC exit check can use the TMR instead. >> >> Based upon my reading of the SDM, the only reason that the eoi exit >> bitmaps are not the exact same as the TMR is that it is possible to >> have virtual-interrupt delivery enabled /without/ an apic access page >> (Note: V-ID => EOI exit bitmap enabled). >> >> Yang, do you happen to know if that is the case? >> >> [Note: Just looked back at the old code for updating the EOI exit >> bitmaps, which for some reason was configured to trigger EOI exits >> for all IOAPIC interrupts, not just level-triggered IOAPIC >> interrupts. Which is weird, and I believe totally unecessary.] > > The RTC interrupt needs to trigger an EOI exit with the in-kernel > IOAPIC, in order to detect coalesced interrupts. This is necessary to > avoid severe time drift in Windows guest. Correct! EOI exit bitmap is not absolutely same with TMR. Some interrupts (especially timer interrupt) which is edge trigger but need a notification when it got injected into guest. > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-31: > > > On 31/07/2015 01:26, Zhang, Yang Z wrote: >>>> Do not compute TMR in advance. Instead, set the TMR just before >>>> the interrupt is accepted into the IRR. This limits the coupling >>>> between IOAPIC and LAPIC. >> >> Uh.., it back to original way which is wrong. You cannot modify the >> apic page(here is the TMR reg) directly when the corresponding VMCS >> may be used at same time. > > Where is this documented? The TMR is not part of the set of virtualized > APIC registers (TPR, PPR, EOI, ISR, IRR, ICR+ICR2; SDM 29.1.1). > > Only virtualized APIC register reads use the virtual TMR registers (SDM > 29.4.2 or 29.5), but these just read data from the corresponding field > in the virtual APIC page. It's not only for virtual apic page. All data structures that is used by VMCS(either directly inside in VMCS or referenced by pointer in a VMCS) need follow this rule: 24.11.4 Software Access to Related Structures In addition to data in the VMCS region itself, VMX non-root operation can be controlled by data structures that are referenced by pointers in a VMCS (for example, the I/O bitmaps). While the pointers to these data structures are parts of the VMCS, the data structures themselves are not. They are not accessible using VMREAD and VMWRITE but by ordinary memory writes. Software should ensure that each such data structure is modified only when no logical processor with a current VMCS that references it is in VMX non-root operation. Doing otherwise may lead to unpredictable behavior (including behaviors identified in Section 24.11.1). > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-31: On 31/07/2015 01:26, Zhang, Yang Z wrote: Do not compute TMR in advance. Instead, set the TMR just before the interrupt is accepted into the IRR. This limits the coupling between IOAPIC and LAPIC. Uh.., it back to original way which is wrong. You cannot modify the apic page(here is the TMR reg) directly when the corresponding VMCS may be used at same time. Where is this documented? The TMR is not part of the set of virtualized APIC registers (TPR, PPR, EOI, ISR, IRR, ICR+ICR2; SDM 29.1.1). Only virtualized APIC register reads use the virtual TMR registers (SDM 29.4.2 or 29.5), but these just read data from the corresponding field in the virtual APIC page. It's not only for virtual apic page. All data structures that is used by VMCS(either directly inside in VMCS or referenced by pointer in a VMCS) need follow this rule: 24.11.4 Software Access to Related Structures In addition to data in the VMCS region itself, VMX non-root operation can be controlled by data structures that are referenced by pointers in a VMCS (for example, the I/O bitmaps). While the pointers to these data structures are parts of the VMCS, the data structures themselves are not. They are not accessible using VMREAD and VMWRITE but by ordinary memory writes. Software should ensure that each such data structure is modified only when no logical processor with a current VMCS that references it is in VMX non-root operation. Doing otherwise may lead to unpredictable behavior (including behaviors identified in Section 24.11.1). Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-31: On 31/07/2015 04:49, Steve Rutherford wrote: Oh... Yeah. That's a damn good point, given that the interrupt can be injected from another thread while one is in that guest vcpu. Easiest time to update the TMR should be on guest entry through vcpu_scan_ioapic, as before. The best way to go is probably to ditch the new per vcpu EOI exit bitmap, and just update/use the TMR. There's no reason to duplicate that data in the representation of the apic (I suspect that the duplication was inspired by my mistaken notion of the TMR). The IOAPIC exit check can use the TMR instead. Based upon my reading of the SDM, the only reason that the eoi exit bitmaps are not the exact same as the TMR is that it is possible to have virtual-interrupt delivery enabled /without/ an apic access page (Note: V-ID = EOI exit bitmap enabled). Yang, do you happen to know if that is the case? [Note: Just looked back at the old code for updating the EOI exit bitmaps, which for some reason was configured to trigger EOI exits for all IOAPIC interrupts, not just level-triggered IOAPIC interrupts. Which is weird, and I believe totally unecessary.] The RTC interrupt needs to trigger an EOI exit with the in-kernel IOAPIC, in order to detect coalesced interrupts. This is necessary to avoid severe time drift in Windows guest. Correct! EOI exit bitmap is not absolutely same with TMR. Some interrupts (especially timer interrupt) which is edge trigger but need a notification when it got injected into guest. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-29: > Do not compute TMR in advance. Instead, set the TMR just before the > interrupt is accepted into the IRR. This limits the coupling between > IOAPIC and LAPIC. > Uh.., it back to original way which is wrong. You cannot modify the apic page(here is the TMR reg) directly when the corresponding VMCS may be used at same time. > Signed-off-by: Paolo Bonzini > --- > arch/x86/kvm/ioapic.c | 9 ++--- > arch/x86/kvm/ioapic.h | 3 +-- > arch/x86/kvm/lapic.c | 19 ++- > arch/x86/kvm/lapic.h | 1 - > arch/x86/kvm/x86.c| 5 + > 5 files changed, 14 insertions(+), 23 deletions(-) > diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c > index 856f79105bb5..eaf4ec38d980 100644 > --- a/arch/x86/kvm/ioapic.c > +++ b/arch/x86/kvm/ioapic.c > @@ -246,8 +246,7 @@ static void update_handled_vectors(struct kvm_ioapic > *ioapic) > smp_wmb(); > } > -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, > - u32 *tmr) > +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) > { > struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic; > union kvm_ioapic_redirect_entry *e; > @@ -260,13 +259,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, > u64 *eoi_exit_bitmap, > kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, > index) || > index == RTC_GSI) { if > (kvm_apic_match_dest(vcpu, NULL, 0, > - e->fields.dest_id, e->fields.dest_mode)) { > + e->fields.dest_id, e->fields.dest_mode)) > __set_bit(e->fields.vector, > (unsigned long *)eoi_exit_bitmap); > - if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG) > - __set_bit(e->fields.vector, - > (unsigned long *)tmr); - > } > } > } > spin_unlock(>lock); > diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h > index ca0b0b4e6256..3dbd0e2aac4e 100644 > --- a/arch/x86/kvm/ioapic.h > +++ b/arch/x86/kvm/ioapic.h > @@ -120,7 +120,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct > kvm_lapic *src, > struct kvm_lapic_irq *irq, unsigned long *dest_map); > int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state); > int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state); > -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, > - u32 *tmr); > +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); > > #endif > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index 2a5ca97c263b..9be64c77d6db 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -551,15 +551,6 @@ static void pv_eoi_clr_pending(struct kvm_vcpu > *vcpu) > __clear_bit(KVM_APIC_PV_EOI_PENDING, >arch.apic_attention); > } > -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr) > -{ > - struct kvm_lapic *apic = vcpu->arch.apic; > - int i; > - > - for (i = 0; i < 8; i++) > - apic_set_reg(apic, APIC_TMR + 0x10 * i, tmr[i]); > -} > - > static void apic_update_ppr(struct kvm_lapic *apic) > { > u32 tpr, isrv, ppr, old_ppr; > @@ -781,6 +772,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int > delivery_mode, > case APIC_DM_LOWEST: > vcpu->arch.apic_arb_prio++; > case APIC_DM_FIXED: > + if (unlikely(trig_mode && !level)) > + break; > + > /* FIXME add logic for vcpu on reset */ > if (unlikely(!apic_enabled(apic))) > break; > @@ -790,6 +784,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic, > int delivery_mode, > if (dest_map) > __set_bit(vcpu->vcpu_id, dest_map); > + if (apic_test_vector(vector, apic->regs + APIC_TMR) != > !!trig_mode) { > + if (trig_mode) > + apic_set_vector(vector, apic->regs + APIC_TMR); > + else > + apic_clear_vector(vector, apic->regs + > APIC_TMR); > + } > + > if (kvm_x86_ops->deliver_posted_interrupt) > kvm_x86_ops->deliver_posted_interrupt(vcpu, vector); > else { > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h > index 764037991d26..eb46d6bcaa75 100644 > --- a/arch/x86/kvm/lapic.h > +++ b/arch/x86/kvm/lapic.h > @@ -57,7 +57,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 > value); > u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); > void kvm_apic_set_version(struct kvm_vcpu *vcpu); > -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr); > void __kvm_apic_update_irr(u32 *pir, void *regs); > void
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-29: Do not compute TMR in advance. Instead, set the TMR just before the interrupt is accepted into the IRR. This limits the coupling between IOAPIC and LAPIC. Uh.., it back to original way which is wrong. You cannot modify the apic page(here is the TMR reg) directly when the corresponding VMCS may be used at same time. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/ioapic.c | 9 ++--- arch/x86/kvm/ioapic.h | 3 +-- arch/x86/kvm/lapic.c | 19 ++- arch/x86/kvm/lapic.h | 1 - arch/x86/kvm/x86.c| 5 + 5 files changed, 14 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c index 856f79105bb5..eaf4ec38d980 100644 --- a/arch/x86/kvm/ioapic.c +++ b/arch/x86/kvm/ioapic.c @@ -246,8 +246,7 @@ static void update_handled_vectors(struct kvm_ioapic *ioapic) smp_wmb(); } -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, - u32 *tmr) +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) { struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic; union kvm_ioapic_redirect_entry *e; @@ -260,13 +259,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index) || index == RTC_GSI) { if (kvm_apic_match_dest(vcpu, NULL, 0, - e-fields.dest_id, e-fields.dest_mode)) { + e-fields.dest_id, e-fields.dest_mode)) __set_bit(e-fields.vector, (unsigned long *)eoi_exit_bitmap); - if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG) - __set_bit(e-fields.vector, - (unsigned long *)tmr); - } } } spin_unlock(ioapic-lock); diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h index ca0b0b4e6256..3dbd0e2aac4e 100644 --- a/arch/x86/kvm/ioapic.h +++ b/arch/x86/kvm/ioapic.h @@ -120,7 +120,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, unsigned long *dest_map); int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state); int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state); -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, - u32 *tmr); +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); #endif diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 2a5ca97c263b..9be64c77d6db 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -551,15 +551,6 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu) __clear_bit(KVM_APIC_PV_EOI_PENDING, vcpu-arch.apic_attention); } -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr) -{ - struct kvm_lapic *apic = vcpu-arch.apic; - int i; - - for (i = 0; i 8; i++) - apic_set_reg(apic, APIC_TMR + 0x10 * i, tmr[i]); -} - static void apic_update_ppr(struct kvm_lapic *apic) { u32 tpr, isrv, ppr, old_ppr; @@ -781,6 +772,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, case APIC_DM_LOWEST: vcpu-arch.apic_arb_prio++; case APIC_DM_FIXED: + if (unlikely(trig_mode !level)) + break; + /* FIXME add logic for vcpu on reset */ if (unlikely(!apic_enabled(apic))) break; @@ -790,6 +784,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, if (dest_map) __set_bit(vcpu-vcpu_id, dest_map); + if (apic_test_vector(vector, apic-regs + APIC_TMR) != !!trig_mode) { + if (trig_mode) + apic_set_vector(vector, apic-regs + APIC_TMR); + else + apic_clear_vector(vector, apic-regs + APIC_TMR); + } + if (kvm_x86_ops-deliver_posted_interrupt) kvm_x86_ops-deliver_posted_interrupt(vcpu, vector); else { diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 764037991d26..eb46d6bcaa75 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -57,7 +57,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr); void __kvm_apic_update_irr(u32 *pir, void *regs); void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir); int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct
RE: [PATCH] kvm/fpu: Enable eager restore kvm FPU for MPX
Paolo Bonzini wrote on 2015-05-20: > > > On 20/05/2015 07:20, Zhang, Yang Z wrote: >> Li, Liang Z wrote on 2015-05-20: >>> The MPX feature requires eager KVM FPU restore support. We have >>> verified that MPX cannot work correctly with the current lazy KVM >>> FPU restore mechanism. Eager KVM FPU restore should be enabled if >>> the MPX feature is exposed to VM. >>> >>> Signed-off-by: Liang Li >>> --- >>> arch/x86/kvm/vmx.c | 2 ++ >>> arch/x86/kvm/x86.c | 3 ++- >>> 2 files changed, 4 insertions(+), 1 deletion(-) >>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index >>> f7b6168..e2cccbe 100644 >>> --- a/arch/x86/kvm/vmx.c >>> +++ b/arch/x86/kvm/vmx.c >>> @@ -8445,6 +8445,8 @@ static struct kvm_vcpu >>> *vmx_create_vcpu(struct > kvm *kvm, unsigned int id) >>> goto free_vmcs; >>> } >>> + if (vmx_mpx_supported()) >>> + vmx_fpu_activate(>vcpu); >>> return >vcpu; >>> >>> free_vmcs: >> >> Is it better to use guest_cpuid_has_mpx() instead of vmx_mpx_supported()? > > CPUID hasn't been set yet, so I think it is okay to key it on > vmx_mpx_supported(). It will be deactivated soon afterwards. > > Or even do it unconditionally; just make sure to add a comment about it. Correct! Unconditionally would be acceptable. > >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index >>> 5f38188..5993f5f >>> 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -7060,7 +7060,8 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) >>> fpu_save_init(>arch.guest_fpu); >>> __kernel_fpu_end(); >>> ++vcpu->stat.fpu_reload; >>> - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); >>> + if (!kvm_x86_ops->mpx_supported()) >>> + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); >>> trace_kvm_fpu(0); >>> } > > This is a hotter path. Here it's definitely better to avoid the call > to kvm_x86_ops->mpx_supported(). Especially because, with MPX > enabled, you would call this on every userspace exit. > > Yang's suggestion of using CPUID is actually more valuable here. You > could add a new field eager_fpu in kvm->arch and update it in > kvm_update_cpuid. Good suggestion! > > Thanks, > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] kvm/fpu: Enable eager restore kvm FPU for MPX
Paolo Bonzini wrote on 2015-05-20: On 20/05/2015 07:20, Zhang, Yang Z wrote: Li, Liang Z wrote on 2015-05-20: The MPX feature requires eager KVM FPU restore support. We have verified that MPX cannot work correctly with the current lazy KVM FPU restore mechanism. Eager KVM FPU restore should be enabled if the MPX feature is exposed to VM. Signed-off-by: Liang Li liang.z...@intel.com --- arch/x86/kvm/vmx.c | 2 ++ arch/x86/kvm/x86.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f7b6168..e2cccbe 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8445,6 +8445,8 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) goto free_vmcs; } + if (vmx_mpx_supported()) + vmx_fpu_activate(vmx-vcpu); return vmx-vcpu; free_vmcs: Is it better to use guest_cpuid_has_mpx() instead of vmx_mpx_supported()? CPUID hasn't been set yet, so I think it is okay to key it on vmx_mpx_supported(). It will be deactivated soon afterwards. Or even do it unconditionally; just make sure to add a comment about it. Correct! Unconditionally would be acceptable. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5f38188..5993f5f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7060,7 +7060,8 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) fpu_save_init(vcpu-arch.guest_fpu); __kernel_fpu_end(); ++vcpu-stat.fpu_reload; - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); + if (!kvm_x86_ops-mpx_supported()) + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); } This is a hotter path. Here it's definitely better to avoid the call to kvm_x86_ops-mpx_supported(). Especially because, with MPX enabled, you would call this on every userspace exit. Yang's suggestion of using CPUID is actually more valuable here. You could add a new field eager_fpu in kvm-arch and update it in kvm_update_cpuid. Good suggestion! Thanks, Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] kvm/fpu: Enable eager restore kvm FPU for MPX
Li, Liang Z wrote on 2015-05-20: > The MPX feature requires eager KVM FPU restore support. We have > verified that MPX cannot work correctly with the current lazy KVM FPU > restore mechanism. Eager KVM FPU restore should be enabled if the MPX > feature is exposed to VM. > > Signed-off-by: Liang Li > --- > arch/x86/kvm/vmx.c | 2 ++ > arch/x86/kvm/x86.c | 3 ++- > 2 files changed, 4 insertions(+), 1 deletion(-) > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index > f7b6168..e2cccbe 100644 --- a/arch/x86/kvm/vmx.c +++ > b/arch/x86/kvm/vmx.c @@ -8445,6 +8445,8 @@ static struct kvm_vcpu > *vmx_create_vcpu(struct kvm *kvm, unsigned int id) > goto free_vmcs; > } > + if (vmx_mpx_supported()) Is it better to use guest_cpuid_has_mpx() instead of vmx_mpx_supported()? > + vmx_fpu_activate(>vcpu); > return >vcpu; > > free_vmcs: > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index > 5f38188..5993f5f 100644 --- a/arch/x86/kvm/x86.c +++ > b/arch/x86/kvm/x86.c @@ -7060,7 +7060,8 @@ void kvm_put_guest_fpu(struct > kvm_vcpu *vcpu) > fpu_save_init(>arch.guest_fpu); > __kernel_fpu_end(); > ++vcpu->stat.fpu_reload; > - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); > + if (!kvm_x86_ops->mpx_supported()) > + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); > trace_kvm_fpu(0); > } Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC PATCH 00/13] KVM: x86: SMM support
Paolo Bonzini wrote on 2015-05-19: > > > On 19/05/2015 16:25, Zhang, Yang Z wrote: >> Paolo Bonzini wrote on 2015-04-30: >>> This patch series introduces system management mode support. >> >> Just curious what's motivation to add vSMM supporting? Is there any >> usage case inside guest requires SMM? Thanks. > > SMM provides the trusted base for UEFI Secure Boot. > > Paolo OK, I see it. Thanks for your explaination. Best regards, Yang
RE: [RFC PATCH 00/13] KVM: x86: SMM support
Paolo Bonzini wrote on 2015-04-30: > This patch series introduces system management mode support. Just curious what's motivation to add vSMM supporting? Is there any usage case inside guest requires SMM? Thanks. > There is still some work to do, namely: test without unrestricted > guest support, test on AMD, disable the capability if !unrestricted > guest and !emulate invalid guest state(*), test with a QEMU that > understand KVM_MEM_X86_SMRAM, actually post QEMU patches that let you use > this. > > (*) newer chipsets moved away from legacy SMRAM at 0xa, > thus support for real mode CS base above 1M is necessary > > Because legacy SMRAM is a mess, I have tried these patches with Q35's > high SMRAM (at 0xfeda). This means that right now this isn't the > easiest thing to test; you need QEMU patches that add support for high > SMRAM, and SeaBIOS patches to use high SMRAM. Until QEMU support for > KVM_MEM_X86_SMRAM is in place, also, I'm keeping SMRAM open in SeaBIOS. > > That said, even this clumsy and incomplete userspace configuration is > enough to test all patches except 11 and 12. > > The series is structured as follows. > > Patch 1 is an unrelated bugfix (I think). Patches 2 to 6 extend some > infrastructure functions. Patches 1 to 4 could be committed right now. > > Patches 7 to 9 implement basic support for SMM in the KVM API and > teach KVM about doing the world switch on SMI and RSM. > > Patch 10 touches all places in KVM that read/write guest memory to go > through an x86-specific function. The x86-specific function takes a > VCPU rather than a struct kvm. This is used in patches 11 and 12 to > limits access to specially marked SMRAM slots unless the VCPU is in > system management mode. > > Finally, patch 13 exposes the new capability for userspace to probe. > Best regards, Yang
RE: [PATCH] kvm/fpu: Enable eager restore kvm FPU for MPX
Li, Liang Z wrote on 2015-05-20: The MPX feature requires eager KVM FPU restore support. We have verified that MPX cannot work correctly with the current lazy KVM FPU restore mechanism. Eager KVM FPU restore should be enabled if the MPX feature is exposed to VM. Signed-off-by: Liang Li liang.z...@intel.com --- arch/x86/kvm/vmx.c | 2 ++ arch/x86/kvm/x86.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f7b6168..e2cccbe 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8445,6 +8445,8 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) goto free_vmcs; } + if (vmx_mpx_supported()) Is it better to use guest_cpuid_has_mpx() instead of vmx_mpx_supported()? + vmx_fpu_activate(vmx-vcpu); return vmx-vcpu; free_vmcs: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5f38188..5993f5f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7060,7 +7060,8 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) fpu_save_init(vcpu-arch.guest_fpu); __kernel_fpu_end(); ++vcpu-stat.fpu_reload; - kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); + if (!kvm_x86_ops-mpx_supported()) + kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu); trace_kvm_fpu(0); } Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC PATCH 00/13] KVM: x86: SMM support
Paolo Bonzini wrote on 2015-05-19: On 19/05/2015 16:25, Zhang, Yang Z wrote: Paolo Bonzini wrote on 2015-04-30: This patch series introduces system management mode support. Just curious what's motivation to add vSMM supporting? Is there any usage case inside guest requires SMM? Thanks. SMM provides the trusted base for UEFI Secure Boot. Paolo OK, I see it. Thanks for your explaination. Best regards, Yang
RE: [RFC PATCH 00/13] KVM: x86: SMM support
Paolo Bonzini wrote on 2015-04-30: This patch series introduces system management mode support. Just curious what's motivation to add vSMM supporting? Is there any usage case inside guest requires SMM? Thanks. There is still some work to do, namely: test without unrestricted guest support, test on AMD, disable the capability if !unrestricted guest and !emulate invalid guest state(*), test with a QEMU that understand KVM_MEM_X86_SMRAM, actually post QEMU patches that let you use this. (*) newer chipsets moved away from legacy SMRAM at 0xa, thus support for real mode CS base above 1M is necessary Because legacy SMRAM is a mess, I have tried these patches with Q35's high SMRAM (at 0xfeda). This means that right now this isn't the easiest thing to test; you need QEMU patches that add support for high SMRAM, and SeaBIOS patches to use high SMRAM. Until QEMU support for KVM_MEM_X86_SMRAM is in place, also, I'm keeping SMRAM open in SeaBIOS. That said, even this clumsy and incomplete userspace configuration is enough to test all patches except 11 and 12. The series is structured as follows. Patch 1 is an unrelated bugfix (I think). Patches 2 to 6 extend some infrastructure functions. Patches 1 to 4 could be committed right now. Patches 7 to 9 implement basic support for SMM in the KVM API and teach KVM about doing the world switch on SMI and RSM. Patch 10 touches all places in KVM that read/write guest memory to go through an x86-specific function. The x86-specific function takes a VCPU rather than a struct kvm. This is used in patches 11 and 12 to limits access to specially marked SMRAM slots unless the VCPU is in system management mode. Finally, patch 13 exposes the new capability for userspace to probe. Best regards, Yang
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
Paolo Bonzini wrote on 2015-04-24: > > > On 24/04/2015 09:46, Zhang, Yang Z wrote: >>> On the other hand vmexit is lighter and lighter on newer >>> processors; a Sandy Bridge has less than half the vmexit cost of a >>> Core 2 (IIRC >>> 1000 vs. 2500 clock cycles approximately). >> >> 1000 cycles? I remember it takes about 4000 cycle even in HSW server. > > I was going from memory, but I now measured it with the vmexit test of > kvm-unit-tests. With both SNB Xeon E5 and IVB Core i7, returns about > 1400 clock cycles for a vmcall exit. This includes the overhead of > doing the cpuid itself. > > Thus the vmexit cost is around 1300 cycles. Of this the vmresume > instruction is probably around 800 cycles, and the rest is introduced > by KVM. There are at least 4-5 memory barriers and locked instructions. Yes, that's make sense. The average vmexit/vmentry handle cost is around 4000 cycles. But I guess xsaveopt doesn't take so many cycles. Does anyone have the xsaveopt cost data? > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
Paolo Bonzini wrote on 2015-04-24: > > > On 24/04/2015 03:16, Zhang, Yang Z wrote: >>> This is interesting since previous measurements on KVM have had the >>> exact opposite results. I think we need to understand this a lot >>> more. >> >> What I can tell is that vmexit is heavy. So it is reasonable to see >> the improvement under some cases, especially kernel is using eager >> FPU now which means each schedule may trigger a vmexit. > > On the other hand vmexit is lighter and lighter on newer processors; a > Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000 > vs. 2500 clock cycles approximately). > 1000 cycles? I remember it takes about 4000 cycle even in HSW server. > Also, measurement were done on Westmere but Sandy Bridge is the first > processor to have XSAVEOPT and thus use eager FPU. > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
Paolo Bonzini wrote on 2015-04-24: On 24/04/2015 09:46, Zhang, Yang Z wrote: On the other hand vmexit is lighter and lighter on newer processors; a Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000 vs. 2500 clock cycles approximately). 1000 cycles? I remember it takes about 4000 cycle even in HSW server. I was going from memory, but I now measured it with the vmexit test of kvm-unit-tests. With both SNB Xeon E5 and IVB Core i7, returns about 1400 clock cycles for a vmcall exit. This includes the overhead of doing the cpuid itself. Thus the vmexit cost is around 1300 cycles. Of this the vmresume instruction is probably around 800 cycles, and the rest is introduced by KVM. There are at least 4-5 memory barriers and locked instructions. Yes, that's make sense. The average vmexit/vmentry handle cost is around 4000 cycles. But I guess xsaveopt doesn't take so many cycles. Does anyone have the xsaveopt cost data? Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
Paolo Bonzini wrote on 2015-04-24: On 24/04/2015 03:16, Zhang, Yang Z wrote: This is interesting since previous measurements on KVM have had the exact opposite results. I think we need to understand this a lot more. What I can tell is that vmexit is heavy. So it is reasonable to see the improvement under some cases, especially kernel is using eager FPU now which means each schedule may trigger a vmexit. On the other hand vmexit is lighter and lighter on newer processors; a Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000 vs. 2500 clock cycles approximately). 1000 cycles? I remember it takes about 4000 cycle even in HSW server. Also, measurement were done on Westmere but Sandy Bridge is the first processor to have XSAVEOPT and thus use eager FPU. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
H. Peter Anvin wrote on 2015-04-24: > On 04/23/2015 08:28 AM, Dave Hansen wrote: >> On 04/23/2015 02:13 PM, Liang Li wrote: >>> When compiling kernel on westmere, the performance of eager FPU is >>> about 0.4% faster than lazy FPU. >> >> Do you have an theory why this is? What does the regression come from? >> > > This is interesting since previous measurements on KVM have had the > exact opposite results. I think we need to understand this a lot more. What I can tell is that vmexit is heavy. So it is reasonable to see the improvement under some cases, especially kernel is using eager FPU now which means each schedule may trigger a vmexit. > > -hpa > Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
H. Peter Anvin wrote on 2015-04-24: On 04/23/2015 08:28 AM, Dave Hansen wrote: On 04/23/2015 02:13 PM, Liang Li wrote: When compiling kernel on westmere, the performance of eager FPU is about 0.4% faster than lazy FPU. Do you have an theory why this is? What does the regression come from? This is interesting since previous measurements on KVM have had the exact opposite results. I think we need to understand this a lot more. What I can tell is that vmexit is heavy. So it is reasonable to see the improvement under some cases, especially kernel is using eager FPU now which means each schedule may trigger a vmexit. -hpa Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v4 6/6] KVM: nVMX: Enable nested posted interrupt processing
Paolo Bonzini wrote on 2015-02-03: > > > On 02/02/2015 16:33, Wincy Van wrote: >> static void vmx_accomp_nested_posted_intr(struct kvm_vcpu *vcpu) { >> struct vcpu_vmx *vmx = to_vmx(vcpu); >> >> if (is_guest_mode(vcpu) && >> vmx->nested.posted_intr_nv != -1 && >> pi_test_on(vmx->nested.pi_desc)) >> kvm_apic_set_irr(vcpu, >> vmx->nested.posted_intr_nv); } >> Then we will get an nested-vmexit in vmx_check_nested_events, that >> posted intr will be handled by L1 immediately. >> This mechanism will also emulate the hardware's behavior: If a >> posted intr was not accomplished by hardware, we will get an Actually, we cannot say "not accomplished by hardware". It more like we don't do the job well. See my below answer. >> interrupt with POSTED_INTR_NV. > > Yes. This is not enough. From L1's point, L2 is in vmx non-root mode. So we should emulate the posted interrupt in L0 correctly, say: 1. clear ON bit 2. ack interrupt 3, syn pir to virr 4. update RVI. Then let the hardware(virtual interrupt delivery) to accomplish interrupt injection. Force a vmexit more like a trick. It's better to follow the hardware's behavior unless we cannot do it. > >> Would this be better? > > I think you do not even need a new bit. You can use KVM_REQ_EVENT and > (to complete my suggestion, which was not enough) do the above in > vmx_check_nested_events. > > Paolo Best regards, Yang
RE: [PATCH v4 6/6] KVM: nVMX: Enable nested posted interrupt processing
Paolo Bonzini wrote on 2015-02-03: On 02/02/2015 16:33, Wincy Van wrote: static void vmx_accomp_nested_posted_intr(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (is_guest_mode(vcpu) vmx-nested.posted_intr_nv != -1 pi_test_on(vmx-nested.pi_desc)) kvm_apic_set_irr(vcpu, vmx-nested.posted_intr_nv); } Then we will get an nested-vmexit in vmx_check_nested_events, that posted intr will be handled by L1 immediately. This mechanism will also emulate the hardware's behavior: If a posted intr was not accomplished by hardware, we will get an Actually, we cannot say not accomplished by hardware. It more like we don't do the job well. See my below answer. interrupt with POSTED_INTR_NV. Yes. This is not enough. From L1's point, L2 is in vmx non-root mode. So we should emulate the posted interrupt in L0 correctly, say: 1. clear ON bit 2. ack interrupt 3, syn pir to virr 4. update RVI. Then let the hardware(virtual interrupt delivery) to accomplish interrupt injection. Force a vmexit more like a trick. It's better to follow the hardware's behavior unless we cannot do it. Would this be better? I think you do not even need a new bit. You can use KVM_REQ_EVENT and (to complete my suggestion, which was not enough) do the above in vmx_check_nested_events. Paolo Best regards, Yang
RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode
Wincy Van wrote on 2015-01-29: > On Thu, Jan 29, 2015 at 10:54 AM, Zhang, Yang Z > wrote: >>> -8646,7 +8750,8 @@ static void prepare_vmcs02(struct kvm_vcpu >>> *vcpu, struct vmcs12 *vmcs12) >>> else >>> vmcs_write64(APIC_ACCESS_ADDR, >>> >>> page_to_phys(vmx->nested.apic_access_page)); >>> - } else if >>> (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) { >>> + } else if >>> + (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) >>> + && >>> + >>> + (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))) { >>> exec_control |= >> >> You don't load L2's apic_page in your patch correctly when x2apic >> mode is > used. Here is the right change for prepare_vmcs02()(maybe other place > also need change too): >> >> @@ -8585,7 +8585,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, >> struct vmcs12 *vmcs12) >> > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) >> exec_control |= >> vmcs12->secondary_vm_exec_control; >> >> - if (exec_control & >> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { + if >> (exec_control & (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + + >> SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) { >> /* >> * If translation failed, no matter: This >> feature > asks >> * to exit when accessing the given address, >> and if it @@ -8594,7 +8595,8 @@ static void prepare_vmcs02(struct > kvm_vcpu *vcpu, struct vmcs12 *vmcs12) >> */ >> if (!vmx->nested.apic_access_page) >> exec_control &= >> - ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; + >> ~ (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + + >> SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); >> else >> vmcs_write64(APIC_ACCESS_ADDR, >> page_to_phys(vmx->nested.apic_access_page)); >> > > I think we don't need to do this, if L1 enables x2apic mode, we have > already checked that the vmcs12->SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES > is 0. "exec_control |= vmcs12->secondary_vm_exec_control;" merged L1's > settings, including x2apic mode. the special case is > vm_need_virtualize_apic_accesses, if vm_need_virtualize_apic_accesses > returns true, the nested_cpu_has_virt_x2apic_mode will prevent us to set > the apic access bit. I just realized that we are talking different thing. I am thinking about VIRTUAL_APIC_PAGE_ADDR(my mistake :)). And it has been handled correctly already. For APIC_ACCESS_ADDR, you are right. The current implementation can handle it well. > > > Thanks, > Wincy Best regards, Yang
RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode
Wincy Van wrote on 2015-01-28: > When L2 is using x2apic, we can use virtualize x2apic mode to gain higher > performance, especially in apicv case. > > This patch also introduces nested_vmx_check_apicv_controls for the nested > apicv patches. > > Signed-off-by: Wincy Van > --- > arch/x86/kvm/vmx.c | 114 > +++- > 1 files changed, 112 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 787f886..9d11a93 > 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -1108,6 +1108,11 @@ static inline bool nested_cpu_has_xsaves(struct > vmcs12 *vmcs12) > vmx_xsaves_supported(); > } > > +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 > +*vmcs12) { > + return nested_cpu_has2(vmcs12, > +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); > +} > + > static inline bool is_exception(u32 intr_info) { > return (intr_info & (INTR_INFO_INTR_TYPE_MASK | > INTR_INFO_VALID_MASK)) @@ -2395,6 +2400,7 @@ static __init void > nested_vmx_setup_ctls_msrs(void) > nested_vmx_secondary_ctls_low = 0; > nested_vmx_secondary_ctls_high &= > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | > + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | > SECONDARY_EXEC_WBINVD_EXITING | > SECONDARY_EXEC_XSAVES; > > @@ -4155,6 +4161,52 @@ static void > __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap, > } > } > > +/* > + * If a msr is allowed by L0, we should check whether it is allowed by L1. > + * The corresponding bit will be cleared unless both of L0 and L1 allow it. > + */ > +static void nested_vmx_disable_intercept_for_msr(unsigned long > *msr_bitmap_l1, > + unsigned long > *msr_bitmap_nested, > + u32 msr, int type) { > + int f = sizeof(unsigned long); > + > + if (!cpu_has_vmx_msr_bitmap()) { > + WARN_ON(1); > + return; > + } > + > + /* > +* See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals > +* have the write-low and read-high bitmap offsets the wrong way > round. > +* We can control MSRs 0x-0x1fff and > 0xc000-0xc0001fff. > +*/ > + if (msr <= 0x1fff) { > + if (type & MSR_TYPE_R && > + !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) > + /* read-low */ > + __clear_bit(msr, msr_bitmap_nested + 0x000 / f); > + > + if (type & MSR_TYPE_W && > + !test_bit(msr, msr_bitmap_l1 + 0x800 / f)) > + /* write-low */ > + __clear_bit(msr, msr_bitmap_nested + 0x800 / f); > + > + } else if ((msr >= 0xc000) && (msr <= 0xc0001fff)) { > + msr &= 0x1fff; > + if (type & MSR_TYPE_R && > + !test_bit(msr, msr_bitmap_l1 + 0x400 / f)) > + /* read-high */ > + __clear_bit(msr, msr_bitmap_nested + 0x400 / f); > + > + if (type & MSR_TYPE_W && > + !test_bit(msr, msr_bitmap_l1 + 0xc00 / f)) > + /* write-high */ > + __clear_bit(msr, msr_bitmap_nested + 0xc00 / f); > + > + } > +} > + > static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) { > if (!longmode_only) > @@ -8350,7 +8402,59 @@ static int > nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static > inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, >struct vmcs12 > *vmcs12) { > - return false; > + struct page *page; > + unsigned long *msr_bitmap; > + > + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) > + return false; > + > + page = nested_get_page(vcpu, vmcs12->msr_bitmap); > + if (!page) { > + WARN_ON(1); > + return false; > + } > + msr_bitmap = (unsigned long *)kmap(page); > + if (!msr_bitmap) { > + nested_release_page_clean(page); > + WARN_ON(1); > + return false; > + } > + > + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) { > + /* TPR is allowed */ > + nested_vmx_disable_intercept_for_msr(msr_bitmap, > + vmx_msr_bitmap_nested, > + APIC_BASE_MSR + (APIC_TASKPRI >> > 4), > + MSR_TYPE_R | MSR_TYPE_W); > + } else > + __vmx_enable_intercept_for_msr( > + vmx_msr_bitmap_nested, > + APIC_BASE_MSR + (APIC_TASKPRI >> > 4), > + MSR_TYPE_R | MSR_TYPE_W); > + kunmap(page); > +
RE: [PATCH v4 0/6] KVM: nVMX: Enable nested apicv support
Wincy Van wrote on 2015-01-28: > v1 ---> v2: > Use spin lock to ensure vmcs12 is safe when doing nested > posted interrupt delivery. > v2 ---> v3: > 1. Add a new field in nested_vmx to avoid the spin lock in v2. > 2. Drop send eoi to L1 when doing nested interrupt delivery. > 3. Use hardware MSR bitmap to enable nested virtualize x2apic > mode. > v3 ---> v4: > 1. Optimize nested msr bitmap merging. > 2. Allocate nested msr bitmap only when nested == 1. > 3. Inline the nested vmx control checking functions. This version looks good to me. Only minor comment: EXIT_REASON_APIC_WRITE vmexit is introduced by apic register virtualization not virtual interrupt delivery, so it's better add it in 4th patch not 5th patch.(If no other comments, I guess Paolo can help do it when applying it). Reviewed-by: Yang Zhang > Wincy Van (6): > KVM: nVMX: Use hardware MSR bitmap > KVM: nVMX: Enable nested virtualize x2apic mode > KVM: nVMX: Make nested control MSRs per-cpu > KVM: nVMX: Enable nested apic register virtualization > KVM: nVMX: Enable nested virtual interrupt delivery > KVM: nVMX: Enable nested posted interrupt processing > arch/x86/kvm/vmx.c | 580 > +++- 1 files changed, > 480 insertions(+), 100 deletions(-) Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: > On Wed, Jan 28, 2015 at 7:25 PM, Zhang, Yang Z > wrote: >> Wincy Van wrote on 2015-01-28: >>> On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z >>> >>> wrote: >>>>> @@ -8344,7 +8394,68 @@ static int >>>>> nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static >>>>> inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, >>>>>struct vmcs12 >>>>> *vmcs12) { - return false; + struct page *page; + >>>>> unsigned long *msr_bitmap; + + if >>>>> (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return >>>>> false; + + page = nested_get_page(vcpu, vmcs12->msr_bitmap); > + >>>>> if (!page) { + WARN_ON(1); + > return >>>>> false; + } + msr_bitmap = (unsigned long *)kmap(page); + >>>>> if (!msr_bitmap) { + >>>>> nested_release_page_clean(page); + WARN_ON(1); + >>>>> return false; + } + + > memset(vmx_msr_bitmap_nested, >>>>> 0xff, PAGE_SIZE); + + if >>>>> (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is >>>>> allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + >>>>> vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI >>>>>>> 4), + MSR_TYPE_R | MSR_TYPE_W); >>>> >>>> I didn't understand what this function does? Per my understanding, >>>> you only >>> need to set the (vmx_msr_bitmap_nested = vmcs01->msr_bitmap | >>> vmcs12->msr_bitmap) and inject the nested vmexit to L1 if the bit >>> vmcs12->in msr_bitmap is setting. Am I missing some patches? >>> >>> In the beginning, I want to do "vmcs01->msr_bitmap | >>> vmcs12->msr_bitmap", but I remember that there isn't a instruction >>> vmcs12->to >>> do a bit or operation in two pages effectively, so I do the bit or >>> operation in nested_vmx_disable_intercept_for_msr. If the hardware >>> do not support this, I think it is faster if we deal with the bits on >>> demand. >>> nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps, >>> any features can put their logic here. >> >> You construct the nested_msr_bitmap based on vmcs12->msr_bitmap, what >> happens if vmcs01->msr_bitmap want to trap this msr? >> > > If L0 wants to intercept a msr, we should set > vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), > and that bitmaps should only be loaded in non-nested entry. > Currently we only clear the corresponding bits if L1 enables > virtualize x2apic mode, all the other bits are all set. On nested > entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's > bitmap. > In the nested entry, L0 only care about L2's msr access, not L1's. I > think that in nested entry, nested_msr_bitmap is "vmcs01->msr_bitmap" > as your mentioned above. Mmm... I think if the bit is setting in vmcs01->msr_bitmap means whenever the VCPU(no matter in nested guest mode or not) accesses this msr should cause vmexit, not only L1. That's why need to construct the vmcs02->msr_bitmap based on (vmcs01->msr_bitmap ORed vmcs12->msr_bitmap). > > Thanks, > Wincy Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-24: > When L2 is using x2apic, we can use virtualize x2apic mode to gain higher > performance, especially in apicv case. > > This patch also introduces nested_vmx_check_apicv_controls for the nested > apicv patches. > > Signed-off-by: Wincy Van > --- ...snip... > static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) { > if (!longmode_only) > @@ -8344,7 +8394,68 @@ static int > nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static > inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, >struct vmcs12 > *vmcs12) { > - return false; > + struct page *page; > + unsigned long *msr_bitmap; > + > + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) > + return false; > + > + page = nested_get_page(vcpu, vmcs12->msr_bitmap); > + if (!page) { > + WARN_ON(1); > + return false; > + } > + msr_bitmap = (unsigned long *)kmap(page); > + if (!msr_bitmap) { > + nested_release_page_clean(page); > + WARN_ON(1); > + return false; > + } > + > + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); > + > + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) > + /* TPR is allowed */ > + nested_vmx_disable_intercept_for_msr(msr_bitmap, > + vmx_msr_bitmap_nested, > + APIC_BASE_MSR + (APIC_TASKPRI >> > 4), > + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01->msr_bitmap | vmcs12->msr_bitmap) and inject the nested vmexit to L1 if the bit in vmcs12->msr_bitmap is setting. Am I missing some patches? Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: > On Wed, Jan 28, 2015 at 8:33 PM, Zhang, Yang Z > wrote: >>>> >>>> You are right, but this is not fit for all the cases, we should >>>> custom the nested_msr_bitmap. >>>> e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: >>>> if (enable_apicv) { >>>> for (msr = 0x800; msr <= 0x8ff; msr++) >>>> vmx_disable_intercept_msr_read_x2apic(msr); /* >>>> According SDM, in x2apic mode, the whole id reg >>>> is >>> used. >>>> * But in KVM, it only use the highest eight bits. >>>> Need > to >>>> * intercept it */ >>>> vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT >>>> */ vmx_enable_intercept_msr_read_x2apic(0x839); /* >>>> TPR */ vmx_disable_intercept_msr_write_x2apic(0x808); > /* >>>> EOI >>> */ >>>> vmx_disable_intercept_msr_write_x2apic(0x80b); /* >>>> SELF-IPI */ >>>> vmx_disable_intercept_msr_write_x2apic(0x83f); >>>> } >>>> But L1 may not want this. So I think we are better to deal with >>>> the >>> >>> Actually, from L0's point, it is totally unaware of the L2. The only >>> thing L0 aware is that the CPU should follow L0's configuration when >>> VCPU is running. So if L0 wants to trap a msr, then the read operation >>> to this msr should cause vmexit unconditionally no matter who is >>> running(who means L1, L2, L3.). >>> >>>> msrs seperately, there is not a common way suit for all the cases. >>>> If other features want to intercept a msr in nested entry, they >>>> can put the custom code in nested_vmx_merge_msr_bitmap. >>> >>> Yes, if other features want to do it in 'nested' entry, they can >>> fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it >>> should be our responsibly to handle it correctly, how about add following >>> check: >>> >>> if (type & MSR_TYPE_R && !test_bit(msr, vmcs01_msr_bitmap) && >>> !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) >>> __clear_bit(msr, msr_bitmap_nested + 0x000 / f); >> >> >> Anyway, this is not necessary for your current patch. We can consider >> it later if there really have other features will use it. >> > > Yep, I know what you mean now, for other msrs which are not forwarded > access by a mechanism like virtual-apic page, we should intercept it > unconditionally. I think we should ensure the msr can be allowed > before call nested_vmx_disable_intercept_for_msr, if L0 want to > intercept it, just do not call nested_vmx_disable_intercept_for_msr. Yes, this is a solution. But I prefer to handle it in nested code path since others may not familiar with nested and may block it by mistake. > > !test_bit(msr, vmcs01_msr_bitmap) will introduce a problem that some > of the msrs will be affcted by vmcs01_msr_bitmap, TMCCT and TPR, for example. > Intercept reading for these msrs is okay, but it is not efficient. TMCCT is always trapped by most VMM. I don't think TPR is trapped in KVM. > > Thanks, > Wincy Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: > On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z > wrote: >>>> >>> >>> If L0 wants to intercept a msr, we should set >>> vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), >>> and that bitmaps should only be loaded in non-nested entry. Currently >>> we only clear the corresponding bits if L1 enables virtualize x2apic >>> mode, all the other bits are all set. On nested entry, we load >>> nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In >>> the nested entry, L0 only care about L2's msr access, not L1's. I >>> think that in nested entry, nested_msr_bitmap is "vmcs01->msr_bitmap" >>> as your mentioned above. >> >> Mmm... I think if the bit is setting in vmcs01->msr_bitmap means >> whenever > the VCPU(no matter in nested guest mode or not) accesses this msr > should cause vmexit, not only L1. That's why need to construct the > vmcs02->msr_bitmap based on (vmcs01->msr_bitmap ORed > vmcs12->msr_bitmap). >> > > You are right, but this is not fit for all the cases, we should custom > the nested_msr_bitmap. > e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: > if (enable_apicv) { > for (msr = 0x800; msr <= 0x8ff; msr++) > vmx_disable_intercept_msr_read_x2apic(msr); > /* According SDM, in x2apic mode, the whole id reg is used. > * But in KVM, it only use the highest eight bits. Need to > * intercept it */ > vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */ > vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */ > vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */ > vmx_disable_intercept_msr_write_x2apic(0x80b); /* > SELF-IPI */ > vmx_disable_intercept_msr_write_x2apic(0x83f); > } > But L1 may not want this. So I think we are better to deal with the Actually, from L0's point, it is totally unaware of the L2. The only thing L0 aware is that the CPU should follow L0's configuration when VCPU is running. So if L0 wants to trap a msr, then the read operation to this msr should cause vmexit unconditionally no matter who is running(who means L1, L2, L3.). > msrs seperately, there is not a common way suit for all the cases. If > other features want to intercept a msr in nested entry, they can put > the custom code in nested_vmx_merge_msr_bitmap. Yes, if other features want to do it in 'nested' entry, they can fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our responsibly to handle it correctly, how about add following check: if (type & MSR_TYPE_R && !test_bit(msr, vmcs01_msr_bitmap) && !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) __clear_bit(msr, msr_bitmap_nested + 0x000 / f); > > > Thanks, > Wincy Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: > On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z > wrote: >>> @@ -8344,7 +8394,68 @@ static int >>> nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static >>> inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, >>>struct vmcs12 >>> *vmcs12) { - return false; + struct page *page; + >>> unsigned long *msr_bitmap; + + if >>> (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return >>> false; + + page = nested_get_page(vcpu, vmcs12->msr_bitmap); + >>> if (!page) { + WARN_ON(1); + return >>> false; + } + msr_bitmap = (unsigned long *)kmap(page); + >>> if (!msr_bitmap) { + >>> nested_release_page_clean(page); + WARN_ON(1); + >>> return false; + } + + memset(vmx_msr_bitmap_nested, >>> 0xff, PAGE_SIZE); + + if >>> (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is >>> allowed */ + >>> nested_vmx_disable_intercept_for_msr(msr_bitmap, + >>> vmx_msr_bitmap_nested, + >>> APIC_BASE_MSR + (APIC_TASKPRI >> 4), + >>> MSR_TYPE_R | MSR_TYPE_W); >> >> I didn't understand what this function does? Per my understanding, >> you only > need to set the (vmx_msr_bitmap_nested = vmcs01->msr_bitmap | > vmcs12->msr_bitmap) and inject the nested vmexit to L1 if the bit in > vmcs12->msr_bitmap is setting. Am I missing some patches? > > In the beginning, I want to do "vmcs01->msr_bitmap | > vmcs12->msr_bitmap", but I remember that there isn't a instruction to > do a bit or operation in two pages effectively, so I do the bit or > operation in nested_vmx_disable_intercept_for_msr. If the hardware do > not support this, I think it is faster if we deal with the bits on demand. > nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps, > any features can put their logic here. You construct the nested_msr_bitmap based on vmcs12->msr_bitmap, what happens if vmcs01->msr_bitmap want to trap this msr? > > If there is a faster way, please teach me how to do it : ) You are right. Interception should be much faster. > > Thanks, > Wincy > > >> >> Best regards, >> Yang >> >> Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Zhang, Yang Z wrote on 2015-01-28: > Wincy Van wrote on 2015-01-28: >> On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z >> >> wrote: >>>>> >>>> >>>> If L0 wants to intercept a msr, we should set >>>> vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), >>>> and that bitmaps should only be loaded in non-nested entry. Currently >>>> we only clear the corresponding bits if L1 enables virtualize x2apic >>>> mode, all the other bits are all set. On nested entry, we load >>>> nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In >>>> the nested entry, L0 only care about L2's msr access, not L1's. I >>>> think that in nested entry, nested_msr_bitmap is "vmcs01->msr_bitmap" >>>> as your mentioned above. >>> >>> Mmm... I think if the bit is setting in vmcs01->msr_bitmap means >>> whenever >> the VCPU(no matter in nested guest mode or not) accesses this msr >> should cause vmexit, not only L1. That's why need to construct the >> vmcs02->msr_bitmap based on (vmcs01->msr_bitmap ORed >> vmcs12->msr_bitmap). >>> >> >> You are right, but this is not fit for all the cases, we should >> custom the nested_msr_bitmap. >> e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: >> if (enable_apicv) { >> for (msr = 0x800; msr <= 0x8ff; msr++) > vmx_disable_intercept_msr_read_x2apic(msr); >> /* According SDM, in x2apic mode, the whole id reg >> is > used. >> * But in KVM, it only use the highest eight bits. Need to >> * intercept it */ >> vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT >> */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR >> */ vmx_disable_intercept_msr_write_x2apic(0x808); /* >> EOI > */ >> vmx_disable_intercept_msr_write_x2apic(0x80b); /* >> SELF-IPI */ >> vmx_disable_intercept_msr_write_x2apic(0x83f); >> } >> But L1 may not want this. So I think we are better to deal with the > > Actually, from L0's point, it is totally unaware of the L2. The only > thing L0 aware is that the CPU should follow L0's configuration when > VCPU is running. So if L0 wants to trap a msr, then the read operation > to this msr should cause vmexit unconditionally no matter who is running(who > means L1, L2, L3.). > >> msrs seperately, there is not a common way suit for all the cases. >> If other features want to intercept a msr in nested entry, they can >> put the custom code in nested_vmx_merge_msr_bitmap. > > Yes, if other features want to do it in 'nested' entry, they can fill > nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be > our responsibly to handle it correctly, how about add following check: > > if (type & MSR_TYPE_R && !test_bit(msr, vmcs01_msr_bitmap) && > !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) > __clear_bit(msr, msr_bitmap_nested + 0x000 / f); Anyway, this is not necessary for your current patch. We can consider it later if there really have other features will use it. > >> >> >> Thanks, >> Wincy > > > Best regards, > Yang > Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: > On Wed, Jan 28, 2015 at 4:00 PM, Zhang, Yang Z > wrote: >>> @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void) >>> (unsigned long >>> *)__get_free_page(GFP_KERNEL); >>> if (!vmx_msr_bitmap_longmode_x2apic) >>> goto out4; >>> + >>> + vmx_msr_bitmap_nested = (unsigned long >>> *)__get_free_page(GFP_KERNEL); >>> + if (!vmx_msr_bitmap_nested) >>> + goto out5; >>> + >> >> Since the nested virtualization is off by default. It's better to >> allocate the page only when nested is true. Maybe adding the following >> check is better: >> >> if (nested) { >> vmx_msr_bitmap_nested = (unsigned long >> *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested) >> goto out5; >> } > > Agreed. Will do. > >> >> ...snip... >> >>> + >>> +/* >>> + * Merge L0's and L1's MSR bitmap, return false to indicate that >>> + * we do not use the hardware. >>> + */ >>> +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, >>> + struct vmcs12 >>> *vmcs12) { >>> + return false; >>> +} >>> + >> >> The following patches have nothing to do with the MSR control. Why >> leave the function empty here? >> > > No. In patch "Enable nested virtualize x2apic mode", we will return > false if L1 disabled virt_x2apic_mode, then the hardware MSR-bitmap control > is disabled. > This is faster than rebuilding the vmx_msr_bitmap_nested. > This function returns false here to indicate that we do not use the hardware. > Since It is not only virtualize x2apic mode using this, other features > may use this either. I think it isn't suitable to introduce this function in > other patches. Yes, rebuilding is more costly. But your current implementation cannot leverage the APICv feature correctly. I replied in another thread. > > >> Best regards, >> Yang >> >> Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-24: > Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all of L2's > msr access is intercepted by L0. Since many features like virtualize x2apic > mode > has a complicated logic and it is difficult for us to emulate, we should use > hardware and merge the bitmap. > > This patch introduces nested_vmx_merge_msr_bitmap for future use. > > Signed-off-by: Wincy Van > --- > arch/x86/kvm/vmx.c | 71 > +++ > 1 files changed, 60 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c987374..36d0724 > 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -798,6 +798,7 @@ static unsigned long *vmx_msr_bitmap_legacy; > static unsigned long *vmx_msr_bitmap_longmode; static unsigned long > *vmx_msr_bitmap_legacy_x2apic; static unsigned long > *vmx_msr_bitmap_longmode_x2apic; > +static unsigned long *vmx_msr_bitmap_nested; > static unsigned long *vmx_vmread_bitmap; static unsigned long > *vmx_vmwrite_bitmap; > > @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void) > (unsigned long > *)__get_free_page(GFP_KERNEL); > if (!vmx_msr_bitmap_longmode_x2apic) > goto out4; > + > + vmx_msr_bitmap_nested = (unsigned long > *)__get_free_page(GFP_KERNEL); > + if (!vmx_msr_bitmap_nested) > + goto out5; > + Since the nested virtualization is off by default. It's better to allocate the page only when nested is true. Maybe adding the following check is better: if (nested) { vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested) goto out5; } ...snip... > + > +/* > + * Merge L0's and L1's MSR bitmap, return false to indicate that > + * we do not use the hardware. > + */ > +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, > + struct vmcs12 > *vmcs12) { > + return false; > +} > + The following patches have nothing to do with the MSR control. Why leave the function empty here? Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Zhang, Yang Z wrote on 2015-01-28: > Wincy Van wrote on 2015-01-24: >> When L2 is using x2apic, we can use virtualize x2apic mode to gain >> higher performance, especially in apicv case. >> >> This patch also introduces nested_vmx_check_apicv_controls for the >> nested apicv patches. Sorry, replied on the wrong thread.:( >> >> Signed-off-by: Wincy Van >> --- > > ...snip... > >> static void vmx_disable_intercept_for_msr(u32 msr, bool >> longmode_only) > { >> if (!longmode_only) >> @@ -8344,7 +8394,68 @@ static int >> nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static >> inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, >>struct vmcs12 >> *vmcs12) { >> - return false; >> + struct page *page; >> + unsigned long *msr_bitmap; >> + >> + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) >> + return false; >> + >> + page = nested_get_page(vcpu, vmcs12->msr_bitmap); >> + if (!page) { >> + WARN_ON(1); >> + return false; >> + } >> + msr_bitmap = (unsigned long *)kmap(page); >> + if (!msr_bitmap) { >> + nested_release_page_clean(page); >> + WARN_ON(1); >> + return false; >> + } >> + >> + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); >> + >> + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) >> + /* TPR is allowed */ >> + nested_vmx_disable_intercept_for_msr(msr_bitmap, >> + vmx_msr_bitmap_nested, >> + APIC_BASE_MSR + (APIC_TASKPRI >> >> 4), >> + MSR_TYPE_R | MSR_TYPE_W); > > I didn't understand what this function does? Per my understanding, you > only need to set the (vmx_msr_bitmap_nested = vmcs01->msr_bitmap | > vmcs12->msr_bitmap) and inject the nested vmexit to L1 if the bit in > vmcs12->msr_bitmap is setting. Am I missing some patches? > > Best regards, > Yang > Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-24: Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all of L2's msr access is intercepted by L0. Since many features like virtualize x2apic mode has a complicated logic and it is difficult for us to emulate, we should use hardware and merge the bitmap. This patch introduces nested_vmx_merge_msr_bitmap for future use. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- arch/x86/kvm/vmx.c | 71 +++ 1 files changed, 60 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c987374..36d0724 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -798,6 +798,7 @@ static unsigned long *vmx_msr_bitmap_legacy; static unsigned long *vmx_msr_bitmap_longmode; static unsigned long *vmx_msr_bitmap_legacy_x2apic; static unsigned long *vmx_msr_bitmap_longmode_x2apic; +static unsigned long *vmx_msr_bitmap_nested; static unsigned long *vmx_vmread_bitmap; static unsigned long *vmx_vmwrite_bitmap; @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void) (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_longmode_x2apic) goto out4; + + vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); + if (!vmx_msr_bitmap_nested) + goto out5; + Since the nested virtualization is off by default. It's better to allocate the page only when nested is true. Maybe adding the following check is better: if (nested) { vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested) goto out5; } ...snip... + +/* + * Merge L0's and L1's MSR bitmap, return false to indicate that + * we do not use the hardware. + */ +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + return false; +} + The following patches have nothing to do with the MSR control. Why leave the function empty here? Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Zhang, Yang Z wrote on 2015-01-28: Wincy Van wrote on 2015-01-24: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance, especially in apicv case. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Sorry, replied on the wrong thread.:( Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- ...snip... static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) { if (!longmode_only) @@ -8344,7 +8394,68 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in vmcs12-msr_bitmap is setting. Am I missing some patches? Best regards, Yang Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 4:00 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void) (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_longmode_x2apic) goto out4; + + vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); + if (!vmx_msr_bitmap_nested) + goto out5; + Since the nested virtualization is off by default. It's better to allocate the page only when nested is true. Maybe adding the following check is better: if (nested) { vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested) goto out5; } Agreed. Will do. ...snip... + +/* + * Merge L0's and L1's MSR bitmap, return false to indicate that + * we do not use the hardware. + */ +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + return false; +} + The following patches have nothing to do with the MSR control. Why leave the function empty here? No. In patch Enable nested virtualize x2apic mode, we will return false if L1 disabled virt_x2apic_mode, then the hardware MSR-bitmap control is disabled. This is faster than rebuilding the vmx_msr_bitmap_nested. This function returns false here to indicate that we do not use the hardware. Since It is not only virtualize x2apic mode using this, other features may use this either. I think it isn't suitable to introduce this function in other patches. Yes, rebuilding is more costly. But your current implementation cannot leverage the APICv feature correctly. I replied in another thread. Best regards, Yang Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Zhang, Yang Z wrote on 2015-01-28: Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: If L0 wants to intercept a msr, we should set vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), and that bitmaps should only be loaded in non-nested entry. Currently we only clear the corresponding bits if L1 enables virtualize x2apic mode, all the other bits are all set. On nested entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In the nested entry, L0 only care about L2's msr access, not L1's. I think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap as your mentioned above. Mmm... I think if the bit is setting in vmcs01-msr_bitmap means whenever the VCPU(no matter in nested guest mode or not) accesses this msr should cause vmexit, not only L1. That's why need to construct the vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed vmcs12-msr_bitmap). You are right, but this is not fit for all the cases, we should custom the nested_msr_bitmap. e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: if (enable_apicv) { for (msr = 0x800; msr = 0x8ff; msr++) vmx_disable_intercept_msr_read_x2apic(msr); /* According SDM, in x2apic mode, the whole id reg is used. * But in KVM, it only use the highest eight bits. Need to * intercept it */ vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */ vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */ vmx_disable_intercept_msr_write_x2apic(0x80b); /* SELF-IPI */ vmx_disable_intercept_msr_write_x2apic(0x83f); } But L1 may not want this. So I think we are better to deal with the Actually, from L0's point, it is totally unaware of the L2. The only thing L0 aware is that the CPU should follow L0's configuration when VCPU is running. So if L0 wants to trap a msr, then the read operation to this msr should cause vmexit unconditionally no matter who is running(who means L1, L2, L3.). msrs seperately, there is not a common way suit for all the cases. If other features want to intercept a msr in nested entry, they can put the custom code in nested_vmx_merge_msr_bitmap. Yes, if other features want to do it in 'nested' entry, they can fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our responsibly to handle it correctly, how about add following check: if (type MSR_TYPE_R !test_bit(msr, vmcs01_msr_bitmap) !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) __clear_bit(msr, msr_bitmap_nested + 0x000 / f); Anyway, this is not necessary for your current patch. We can consider it later if there really have other features will use it. Thanks, Wincy Best regards, Yang Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: @@ -8344,7 +8394,68 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in vmcs12-msr_bitmap is setting. Am I missing some patches? In the beginning, I want to do vmcs01-msr_bitmap | vmcs12-msr_bitmap, but I remember that there isn't a instruction to do a bit or operation in two pages effectively, so I do the bit or operation in nested_vmx_disable_intercept_for_msr. If the hardware do not support this, I think it is faster if we deal with the bits on demand. nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps, any features can put their logic here. You construct the nested_msr_bitmap based on vmcs12-msr_bitmap, what happens if vmcs01-msr_bitmap want to trap this msr? If there is a faster way, please teach me how to do it : ) You are right. Interception should be much faster. Thanks, Wincy Best regards, Yang Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: If L0 wants to intercept a msr, we should set vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), and that bitmaps should only be loaded in non-nested entry. Currently we only clear the corresponding bits if L1 enables virtualize x2apic mode, all the other bits are all set. On nested entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In the nested entry, L0 only care about L2's msr access, not L1's. I think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap as your mentioned above. Mmm... I think if the bit is setting in vmcs01-msr_bitmap means whenever the VCPU(no matter in nested guest mode or not) accesses this msr should cause vmexit, not only L1. That's why need to construct the vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed vmcs12-msr_bitmap). You are right, but this is not fit for all the cases, we should custom the nested_msr_bitmap. e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: if (enable_apicv) { for (msr = 0x800; msr = 0x8ff; msr++) vmx_disable_intercept_msr_read_x2apic(msr); /* According SDM, in x2apic mode, the whole id reg is used. * But in KVM, it only use the highest eight bits. Need to * intercept it */ vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */ vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */ vmx_disable_intercept_msr_write_x2apic(0x80b); /* SELF-IPI */ vmx_disable_intercept_msr_write_x2apic(0x83f); } But L1 may not want this. So I think we are better to deal with the Actually, from L0's point, it is totally unaware of the L2. The only thing L0 aware is that the CPU should follow L0's configuration when VCPU is running. So if L0 wants to trap a msr, then the read operation to this msr should cause vmexit unconditionally no matter who is running(who means L1, L2, L3.). msrs seperately, there is not a common way suit for all the cases. If other features want to intercept a msr in nested entry, they can put the custom code in nested_vmx_merge_msr_bitmap. Yes, if other features want to do it in 'nested' entry, they can fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our responsibly to handle it correctly, how about add following check: if (type MSR_TYPE_R !test_bit(msr, vmcs01_msr_bitmap) !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) __clear_bit(msr, msr_bitmap_nested + 0x000 / f); Thanks, Wincy Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 8:33 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: You are right, but this is not fit for all the cases, we should custom the nested_msr_bitmap. e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: if (enable_apicv) { for (msr = 0x800; msr = 0x8ff; msr++) vmx_disable_intercept_msr_read_x2apic(msr); /* According SDM, in x2apic mode, the whole id reg is used. * But in KVM, it only use the highest eight bits. Need to * intercept it */ vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */ vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */ vmx_disable_intercept_msr_write_x2apic(0x80b); /* SELF-IPI */ vmx_disable_intercept_msr_write_x2apic(0x83f); } But L1 may not want this. So I think we are better to deal with the Actually, from L0's point, it is totally unaware of the L2. The only thing L0 aware is that the CPU should follow L0's configuration when VCPU is running. So if L0 wants to trap a msr, then the read operation to this msr should cause vmexit unconditionally no matter who is running(who means L1, L2, L3.). msrs seperately, there is not a common way suit for all the cases. If other features want to intercept a msr in nested entry, they can put the custom code in nested_vmx_merge_msr_bitmap. Yes, if other features want to do it in 'nested' entry, they can fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our responsibly to handle it correctly, how about add following check: if (type MSR_TYPE_R !test_bit(msr, vmcs01_msr_bitmap) !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) __clear_bit(msr, msr_bitmap_nested + 0x000 / f); Anyway, this is not necessary for your current patch. We can consider it later if there really have other features will use it. Yep, I know what you mean now, for other msrs which are not forwarded access by a mechanism like virtual-apic page, we should intercept it unconditionally. I think we should ensure the msr can be allowed before call nested_vmx_disable_intercept_for_msr, if L0 want to intercept it, just do not call nested_vmx_disable_intercept_for_msr. Yes, this is a solution. But I prefer to handle it in nested code path since others may not familiar with nested and may block it by mistake. !test_bit(msr, vmcs01_msr_bitmap) will introduce a problem that some of the msrs will be affcted by vmcs01_msr_bitmap, TMCCT and TPR, for example. Intercept reading for these msrs is okay, but it is not efficient. TMCCT is always trapped by most VMM. I don't think TPR is trapped in KVM. Thanks, Wincy Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-24: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance, especially in apicv case. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- ...snip... static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) { if (!longmode_only) @@ -8344,7 +8394,68 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in vmcs12-msr_bitmap is setting. Am I missing some patches? Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 7:25 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: @@ -8344,7 +8394,68 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit vmcs12-in msr_bitmap is setting. Am I missing some patches? In the beginning, I want to do vmcs01-msr_bitmap | vmcs12-msr_bitmap, but I remember that there isn't a instruction vmcs12-to do a bit or operation in two pages effectively, so I do the bit or operation in nested_vmx_disable_intercept_for_msr. If the hardware do not support this, I think it is faster if we deal with the bits on demand. nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps, any features can put their logic here. You construct the nested_msr_bitmap based on vmcs12-msr_bitmap, what happens if vmcs01-msr_bitmap want to trap this msr? If L0 wants to intercept a msr, we should set vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), and that bitmaps should only be loaded in non-nested entry. Currently we only clear the corresponding bits if L1 enables virtualize x2apic mode, all the other bits are all set. On nested entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In the nested entry, L0 only care about L2's msr access, not L1's. I think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap as your mentioned above. Mmm... I think if the bit is setting in vmcs01-msr_bitmap means whenever the VCPU(no matter in nested guest mode or not) accesses this msr should cause vmexit, not only L1. That's why need to construct the vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed vmcs12-msr_bitmap). Thanks, Wincy Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH v4 0/6] KVM: nVMX: Enable nested apicv support
Wincy Van wrote on 2015-01-28: v1 --- v2: Use spin lock to ensure vmcs12 is safe when doing nested posted interrupt delivery. v2 --- v3: 1. Add a new field in nested_vmx to avoid the spin lock in v2. 2. Drop send eoi to L1 when doing nested interrupt delivery. 3. Use hardware MSR bitmap to enable nested virtualize x2apic mode. v3 --- v4: 1. Optimize nested msr bitmap merging. 2. Allocate nested msr bitmap only when nested == 1. 3. Inline the nested vmx control checking functions. This version looks good to me. Only minor comment: EXIT_REASON_APIC_WRITE vmexit is introduced by apic register virtualization not virtual interrupt delivery, so it's better add it in 4th patch not 5th patch.(If no other comments, I guess Paolo can help do it when applying it). Reviewed-by: Yang Zhang yang.z.zh...@intel.com Wincy Van (6): KVM: nVMX: Use hardware MSR bitmap KVM: nVMX: Enable nested virtualize x2apic mode KVM: nVMX: Make nested control MSRs per-cpu KVM: nVMX: Enable nested apic register virtualization KVM: nVMX: Enable nested virtual interrupt delivery KVM: nVMX: Enable nested posted interrupt processing arch/x86/kvm/vmx.c | 580 +++- 1 files changed, 480 insertions(+), 100 deletions(-) Best regards, Yang
RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode
Wincy Van wrote on 2015-01-28: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance, especially in apicv case. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- arch/x86/kvm/vmx.c | 114 +++- 1 files changed, 112 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 787f886..9d11a93 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1108,6 +1108,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 *vmcs12) vmx_xsaves_supported(); } +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 +*vmcs12) { + return nested_cpu_has2(vmcs12, +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2395,6 +2400,7 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_secondary_ctls_low = 0; nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_XSAVES; @@ -4155,6 +4161,52 @@ static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap, } } +/* + * If a msr is allowed by L0, we should check whether it is allowed by L1. + * The corresponding bit will be cleared unless both of L0 and L1 allow it. + */ +static void nested_vmx_disable_intercept_for_msr(unsigned long *msr_bitmap_l1, + unsigned long *msr_bitmap_nested, + u32 msr, int type) { + int f = sizeof(unsigned long); + + if (!cpu_has_vmx_msr_bitmap()) { + WARN_ON(1); + return; + } + + /* +* See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals +* have the write-low and read-high bitmap offsets the wrong way round. +* We can control MSRs 0x-0x1fff and 0xc000-0xc0001fff. +*/ + if (msr = 0x1fff) { + if (type MSR_TYPE_R + !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) + /* read-low */ + __clear_bit(msr, msr_bitmap_nested + 0x000 / f); + + if (type MSR_TYPE_W + !test_bit(msr, msr_bitmap_l1 + 0x800 / f)) + /* write-low */ + __clear_bit(msr, msr_bitmap_nested + 0x800 / f); + + } else if ((msr = 0xc000) (msr = 0xc0001fff)) { + msr = 0x1fff; + if (type MSR_TYPE_R + !test_bit(msr, msr_bitmap_l1 + 0x400 / f)) + /* read-high */ + __clear_bit(msr, msr_bitmap_nested + 0x400 / f); + + if (type MSR_TYPE_W + !test_bit(msr, msr_bitmap_l1 + 0xc00 / f)) + /* write-high */ + __clear_bit(msr, msr_bitmap_nested + 0xc00 / f); + + } +} + static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) { if (!longmode_only) @@ -8350,7 +8402,59 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) { + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); + } else + __vmx_enable_intercept_for_msr( + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); + kunmap(page); + nested_release_page_clean(page); + + return true; +} + +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, +
RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode
Wincy Van wrote on 2015-01-29: On Thu, Jan 29, 2015 at 10:54 AM, Zhang, Yang Z yang.z.zh...@intel.com wrote: -8646,7 +8750,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) else vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vmx-nested.apic_access_page)); - } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { + } else if + (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) + + + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) { exec_control |= You don't load L2's apic_page in your patch correctly when x2apic mode is used. Here is the right change for prepare_vmcs02()(maybe other place also need change too): @@ -8585,7 +8585,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) exec_control |= vmcs12-secondary_vm_exec_control; - if (exec_control SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { + if (exec_control (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) { /* * If translation failed, no matter: This feature asks * to exit when accessing the given address, and if it @@ -8594,7 +8595,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) */ if (!vmx-nested.apic_access_page) exec_control = - ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; + ~ (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); else vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vmx-nested.apic_access_page)); I think we don't need to do this, if L1 enables x2apic mode, we have already checked that the vmcs12-SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES is 0. exec_control |= vmcs12-secondary_vm_exec_control; merged L1's settings, including x2apic mode. the special case is vm_need_virtualize_apic_accesses, if vm_need_virtualize_apic_accesses returns true, the nested_cpu_has_virt_x2apic_mode will prevent us to set the apic access bit. I just realized that we are talking different thing. I am thinking about VIRTUAL_APIC_PAGE_ADDR(my mistake :)). And it has been handled correctly already. For APIC_ACCESS_ADDR, you are right. The current implementation can handle it well. Thanks, Wincy Best regards, Yang
RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
Wincy Van wrote on 2015-01-21: > On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z > wrote: >>> + if (vector == vmcs12->posted_intr_nv && + >>> nested_cpu_has_posted_intr(vmcs12)) { + if (vcpu->mode >>> == IN_GUEST_MODE) + apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), + >>> POSTED_INTR_VECTOR); + else { >>> + r = -1; + goto out; + >>>} + + /* +* if posted intr is >>> done by hardware, the +* corresponding eoi was sent to >>> L0. Thus +* we should send eoi to L1 manually. + >>> */ + kvm_apic_set_eoi_accelerated(vcpu, + >>> vmcs12->posted_intr_nv); >> >> Why this is necessary? As your comments mentioned, it is done by >> hardware not L1, why L1 should aware of it? >> > > According to SDM 29.6, if the processor recognizes a posted interrupt, > it will send an EOI to LAPIC. > If the posted intr is done by hardware, the processor will send eoi to > hardware LAPIC, not L1's, just like the none-nested case(the physical > interrupt is dismissed). So we should take care of the L1's LAPIC and send an > eoi to it. No. You are not emulating the PI feature. You just reuse the hardware's capability. So you don't need to let L1 know it. > > > Thanks, > > Wincy Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.
Wincy Van wrote on 2015-01-16: > When L2 is using x2apic, we can use virtualize x2apic mode to gain higher > performance. > > This patch also introduces nested_vmx_check_apicv_controls for the nested > apicv patches. > > Signed-off-by: Wincy Van To enable x2apic, should you to consider the behavior changes to rdmsr and wrmsr. I didn't see your patch do it. Is it correct? BTW, this patch has nothing to do with APICv, it's better to not use x2apic here and change to apicv in following patch. > --- > arch/x86/kvm/vmx.c | 49 > - > 1 files changed, 48 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee > 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct > vmcs12 *vmcs12) > vmx_xsaves_supported(); > } > > +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 > +*vmcs12) { > + return nested_cpu_has2(vmcs12, > +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); > +} > + > static inline bool is_exception(u32 intr_info) { > return (intr_info & (INTR_INFO_INTR_TYPE_MASK | > INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void > nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) > vmx->nested.nested_vmx_secondary_ctls_low = 0; > vmx->nested.nested_vmx_secondary_ctls_high &= > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | > + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | > SECONDARY_EXEC_WBINVD_EXITING | > SECONDARY_EXEC_XSAVES; > > @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct > kvm_vcpu *vcpu) > case EXIT_REASON_APIC_ACCESS: > return nested_cpu_has2(vmcs12, > > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); > + case EXIT_REASON_APIC_WRITE: > + /* apic_write should exit unconditionally. */ > + return 1; APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize x2apic. Move it to next patch. > case EXIT_REASON_EPT_VIOLATION: > /* > * L0 always deals with the EPT violation. If nested EPT is > @@ -8356,6 +8365,38 @@ static void vmx_start_preemption_timer(struct > kvm_vcpu *vcpu) > ns_to_ktime(preemption_timeout), > HRTIMER_MODE_REL); } > > +static inline int nested_vmx_check_virt_x2apic(struct kvm_vcpu *vcpu, > + struct vmcs12 > *vmcs12) { > + if (nested_cpu_has2(vmcs12, > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) > + return -EINVAL; > + return 0; > +} > + > +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, > + struct vmcs12 *vmcs12) { > + int r; > + > + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) > + return 0; > + > + r = nested_vmx_check_virt_x2apic(vcpu, vmcs12); > + if (r) > + goto fail; > + > + /* tpr shadow is needed by all apicv features. */ > + if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { > + r = -EINVAL; > + goto fail; > + } > + > + return 0; > + > +fail: > + return r; > +} > + > static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu, >unsigned long count_field, >unsigned long addr_field, @@ > -8649,7 +8690,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, > struct vmcs12 *vmcs12) > else > vmcs_write64(APIC_ACCESS_ADDR, > > page_to_phys(vmx->nested.apic_access_page)); > - } else if > (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm)) { > + } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) && > + > + (vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))) { > exec_control |= > > SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; > kvm_vcpu_reload_apic_access_page(vcpu); > @@ -8856,6 +8898,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, > bool launch) > return 1; > } > > + if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) { > + nested_vmx_failValid(vcpu, > VMXERR_ENTRY_INVALID_CONTROL_FIELD); > + return 1; > + } > + > if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) { > nested_vmx_failValid(vcpu, > VMXERR_ENTRY_INVALID_CONTROL_FIELD); > return 1; > -- > 1.7.1 Best regards, Yang
RE: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.
Wincy Van wrote on 2015-01-16: > To enable nested apicv support, we need per-cpu vmx control MSRs: > 1. If in-kernel irqchip is enabled, we can enable nested > posted interrupt, we should set posted intr bit in the > nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled, > we can not enable nested posted interrupt, the posted intr bit in > the nested_vmx_pinbased_ctls_high will be cleared. > Since there would be different settings about in-kernel irqchip > between VMs, different nested control MSRs are needed. I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr to per vcpu. Best regards, Yang
RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
Wincy Van wrote on 2015-01-20: > If vcpu has a interrupt in vmx non-root mode, we will kick that vcpu > to inject interrupt timely. With posted interrupt processing, the kick > intr is not needed, and interrupts are fully taken care of by hardware. > > In nested vmx, this feature avoids much more vmexits than non-nested vmx. > > This patch use L0's POSTED_INTR_NV to avoid unexpected interrupt if > L1's vector is different with L0's. If vcpu is in hardware's non-root > mode, we use a physical ipi to deliver posted interrupts, otherwise we > will deliver that interrupt to L1 and kick that vcpu out of nested non-root > mode. > > Signed-off-by: Wincy Van > --- > arch/x86/kvm/vmx.c | 136 > ++-- 1 files changed, > 132 insertions(+), 4 deletions(-) > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index > ea56e9f..cda9133 100644 --- a/arch/x86/kvm/vmx.c +++ > b/arch/x86/kvm/vmx.c @@ -215,6 +215,7 @@ struct __packed vmcs12 { > u64 tsc_offset; u64 virtual_apic_page_addr; u64 > apic_access_addr; + u64 posted_intr_desc_addr; u64 > ept_pointer; u64 eoi_exit_bitmap0; u64 eoi_exit_bitmap1; @@ > -334,6 +335,7 @@ struct __packed vmcs12 { u32 > vmx_preemption_timer_value; u32 padding32[7]; /* room for future > expansion */ u16 virtual_processor_id; + u16 > posted_intr_nv; u16 guest_es_selector; u16 guest_cs_selector; > u16 guest_ss_selector; @@ -387,6 +389,7 @@ struct nested_vmx { > /* The host-usable pointer to the above */ struct page > *current_vmcs12_page; struct vmcs12 *current_vmcs12; + > spinlock_t vmcs12_lock; struct vmcs *current_shadow_vmcs; /* > * Indicates if the shadow vmcs must be updated with the @@ > -406,6 +409,8 @@ struct nested_vmx { */ > struct page *apic_access_page; > struct page *virtual_apic_page; > + struct page *pi_desc_page; > + struct pi_desc *pi_desc; > u64 msr_ia32_feature_control; > > struct hrtimer preemption_timer; @@ -621,6 +626,7 @@ static > int max_shadow_read_write_fields = > > static const unsigned short vmcs_field_to_offset_table[] = { > FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), + > FIELD(POSTED_INTR_NV, posted_intr_nv), FIELD(GUEST_ES_SELECTOR, > guest_es_selector), FIELD(GUEST_CS_SELECTOR, guest_cs_selector), > FIELD(GUEST_SS_SELECTOR, guest_ss_selector), @@ -646,6 +652,7 @@ > static const unsigned short vmcs_field_to_offset_table[] = { > FIELD64(TSC_OFFSET, tsc_offset), FIELD64(VIRTUAL_APIC_PAGE_ADDR, > virtual_apic_page_addr), FIELD64(APIC_ACCESS_ADDR, > apic_access_addr), + FIELD64(POSTED_INTR_DESC_ADDR, > posted_intr_desc_addr), FIELD64(EPT_POINTER, ept_pointer), > FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0), > FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), @@ -798,6 +805,7 > @@ static void kvm_cpu_vmxon(u64 addr); static void > kvm_cpu_vmxoff(void); static bool vmx_mpx_supported(void); static > bool vmx_xsaves_supported(void); > +static int vmx_vm_has_apicv(struct kvm *kvm); > static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); > static void vmx_set_segment(struct kvm_vcpu *vcpu, > struct kvm_segment *var, int seg); @@ > -1159,6 +1167,11 @@ static inline bool nested_cpu_has_vid(struct > vmcs12 *vmcs12) > return nested_cpu_has2(vmcs12, > SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); > } > +static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) { > + return vmcs12->pin_based_vm_exec_control & > +PIN_BASED_POSTED_INTR; } > + > static inline bool is_exception(u32 intr_info) { > return (intr_info & (INTR_INFO_INTR_TYPE_MASK | > INTR_INFO_VALID_MASK)) @@ -2362,6 +2375,9 @@ static void > nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) > vmx->nested.nested_vmx_pinbased_ctls_high |= > PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | > PIN_BASED_VMX_PREEMPTION_TIMER; > + if (vmx_vm_has_apicv(vmx->vcpu.kvm)) > + vmx->nested.nested_vmx_pinbased_ctls_high |= > + PIN_BASED_POSTED_INTR; > > /* exit controls */ rdmsr(MSR_IA32_VMX_EXIT_CTLS, @@ -4267,6 > +4283,46 @@ static int vmx_vm_has_apicv(struct kvm *kvm) return > enable_apicv && irqchip_in_kernel(kvm); } > +static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, > + int vector) { > + int r = 0; > + struct vmcs12 *vmcs12; > + > + /* > +* Since posted intr delivery is async, > +* we must aquire a spin-lock to avoid > +* the race of vmcs12. > +*/ > + spin_lock(_vmx(vcpu)->nested.vmcs12_lock); > + vmcs12 = get_vmcs12(vcpu); > + if (!is_guest_mode(vcpu) || !vmcs12) { > + r = -1; > +
RE: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.
Wincy Van wrote on 2015-01-16: To enable nested apicv support, we need per-cpu vmx control MSRs: 1. If in-kernel irqchip is enabled, we can enable nested posted interrupt, we should set posted intr bit in the nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled, we can not enable nested posted interrupt, the posted intr bit in the nested_vmx_pinbased_ctls_high will be cleared. Since there would be different settings about in-kernel irqchip between VMs, different nested control MSRs are needed. I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr to per vcpu. Best regards, Yang
RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
Wincy Van wrote on 2015-01-21: On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: + if (vector == vmcs12-posted_intr_nv + nested_cpu_has_posted_intr(vmcs12)) { + if (vcpu-mode == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), + POSTED_INTR_VECTOR); + else { + r = -1; + goto out; + } + + /* +* if posted intr is done by hardware, the +* corresponding eoi was sent to L0. Thus +* we should send eoi to L1 manually. + */ + kvm_apic_set_eoi_accelerated(vcpu, + vmcs12-posted_intr_nv); Why this is necessary? As your comments mentioned, it is done by hardware not L1, why L1 should aware of it? According to SDM 29.6, if the processor recognizes a posted interrupt, it will send an EOI to LAPIC. If the posted intr is done by hardware, the processor will send eoi to hardware LAPIC, not L1's, just like the none-nested case(the physical interrupt is dismissed). So we should take care of the L1's LAPIC and send an eoi to it. No. You are not emulating the PI feature. You just reuse the hardware's capability. So you don't need to let L1 know it. Thanks, Wincy Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
Wincy Van wrote on 2015-01-20: If vcpu has a interrupt in vmx non-root mode, we will kick that vcpu to inject interrupt timely. With posted interrupt processing, the kick intr is not needed, and interrupts are fully taken care of by hardware. In nested vmx, this feature avoids much more vmexits than non-nested vmx. This patch use L0's POSTED_INTR_NV to avoid unexpected interrupt if L1's vector is different with L0's. If vcpu is in hardware's non-root mode, we use a physical ipi to deliver posted interrupts, otherwise we will deliver that interrupt to L1 and kick that vcpu out of nested non-root mode. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- arch/x86/kvm/vmx.c | 136 ++-- 1 files changed, 132 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ea56e9f..cda9133 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -215,6 +215,7 @@ struct __packed vmcs12 { u64 tsc_offset; u64 virtual_apic_page_addr; u64 apic_access_addr; + u64 posted_intr_desc_addr; u64 ept_pointer; u64 eoi_exit_bitmap0; u64 eoi_exit_bitmap1; @@ -334,6 +335,7 @@ struct __packed vmcs12 { u32 vmx_preemption_timer_value; u32 padding32[7]; /* room for future expansion */ u16 virtual_processor_id; + u16 posted_intr_nv; u16 guest_es_selector; u16 guest_cs_selector; u16 guest_ss_selector; @@ -387,6 +389,7 @@ struct nested_vmx { /* The host-usable pointer to the above */ struct page *current_vmcs12_page; struct vmcs12 *current_vmcs12; + spinlock_t vmcs12_lock; struct vmcs *current_shadow_vmcs; /* * Indicates if the shadow vmcs must be updated with the @@ -406,6 +409,8 @@ struct nested_vmx { */ struct page *apic_access_page; struct page *virtual_apic_page; + struct page *pi_desc_page; + struct pi_desc *pi_desc; u64 msr_ia32_feature_control; struct hrtimer preemption_timer; @@ -621,6 +626,7 @@ static int max_shadow_read_write_fields = static const unsigned short vmcs_field_to_offset_table[] = { FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), + FIELD(POSTED_INTR_NV, posted_intr_nv), FIELD(GUEST_ES_SELECTOR, guest_es_selector), FIELD(GUEST_CS_SELECTOR, guest_cs_selector), FIELD(GUEST_SS_SELECTOR, guest_ss_selector), @@ -646,6 +652,7 @@ static const unsigned short vmcs_field_to_offset_table[] = { FIELD64(TSC_OFFSET, tsc_offset), FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr), FIELD64(APIC_ACCESS_ADDR, apic_access_addr), + FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr), FIELD64(EPT_POINTER, ept_pointer), FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0), FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), @@ -798,6 +805,7 @@ static void kvm_cpu_vmxon(u64 addr); static void kvm_cpu_vmxoff(void); static bool vmx_mpx_supported(void); static bool vmx_xsaves_supported(void); +static int vmx_vm_has_apicv(struct kvm *kvm); static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -1159,6 +1167,11 @@ static inline bool nested_cpu_has_vid(struct vmcs12 *vmcs12) return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); } +static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) { + return vmcs12-pin_based_vm_exec_control +PIN_BASED_POSTED_INTR; } + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2362,6 +2375,9 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | PIN_BASED_VMX_PREEMPTION_TIMER; + if (vmx_vm_has_apicv(vmx-vcpu.kvm)) + vmx-nested.nested_vmx_pinbased_ctls_high |= + PIN_BASED_POSTED_INTR; /* exit controls */ rdmsr(MSR_IA32_VMX_EXIT_CTLS, @@ -4267,6 +4283,46 @@ static int vmx_vm_has_apicv(struct kvm *kvm) return enable_apicv irqchip_in_kernel(kvm); } +static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, + int vector) { + int r = 0; + struct vmcs12 *vmcs12; + + /* +* Since posted intr delivery is async, +* we must aquire a spin-lock to avoid +* the race of vmcs12. +*/ + spin_lock(to_vmx(vcpu)-nested.vmcs12_lock); + vmcs12 = get_vmcs12(vcpu); + if (!is_guest_mode(vcpu) || !vmcs12) { + r = -1; + goto out; + } + if (vector == vmcs12-posted_intr_nv +
RE: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.
Wincy Van wrote on 2015-01-16: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com To enable x2apic, should you to consider the behavior changes to rdmsr and wrmsr. I didn't see your patch do it. Is it correct? BTW, this patch has nothing to do with APICv, it's better to not use x2apic here and change to apicv in following patch. --- arch/x86/kvm/vmx.c | 49 - 1 files changed, 48 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 *vmcs12) vmx_xsaves_supported(); } +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 +*vmcs12) { + return nested_cpu_has2(vmcs12, +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_secondary_ctls_low = 0; vmx-nested.nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_XSAVES; @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_APIC_ACCESS: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); + case EXIT_REASON_APIC_WRITE: + /* apic_write should exit unconditionally. */ + return 1; APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize x2apic. Move it to next patch. case EXIT_REASON_EPT_VIOLATION: /* * L0 always deals with the EPT violation. If nested EPT is @@ -8356,6 +8365,38 @@ static void vmx_start_preemption_timer(struct kvm_vcpu *vcpu) ns_to_ktime(preemption_timeout), HRTIMER_MODE_REL); } +static inline int nested_vmx_check_virt_x2apic(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) + return -EINVAL; + return 0; +} + +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + int r; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return 0; + + r = nested_vmx_check_virt_x2apic(vcpu, vmcs12); + if (r) + goto fail; + + /* tpr shadow is needed by all apicv features. */ + if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { + r = -EINVAL; + goto fail; + } + + return 0; + +fail: + return r; +} + static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu, unsigned long count_field, unsigned long addr_field, @@ -8649,7 +8690,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) else vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vmx-nested.apic_access_page)); - } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { + } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) + + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) { exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; kvm_vcpu_reload_apic_access_page(vcpu); @@ -8856,6 +8898,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) return 1; } + if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) { + nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); + return 1; + } + if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) { nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); return 1; -- 1.7.1 Best regards, Yang
RE: [PATCH 0/5] KVM: nVMX: Enable nested apicv support.
Wincy Van wrote on 2015-01-20: > Hi, Yang, > > Could you please have a look at this patch set? > Your comment is very appreciated! Sure. I will take a look. > > > Thanks, > > Wincy Best regards, Yang
RE: [PATCH 0/5] KVM: nVMX: Enable nested apicv support.
Wincy Van wrote on 2015-01-20: Hi, Yang, Could you please have a look at this patch set? Your comment is very appreciated! Sure. I will take a look. Thanks, Wincy Best regards, Yang
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Wu, Feng wrote on 2014-12-24: > > > Zhang, Yang Z wrote on 2014-12-24: >> Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org; >> KVM list; Eric Auger >> Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for >> VT-d Posted-Interrupts >> >> Jiang Liu wrote on 2014-12-24: >>> On 2014/12/24 9:38, Zhang, Yang Z wrote: >>>> Paolo Bonzini wrote on 2014-12-23: >>>>> >>>>> >>>>> On 23/12/2014 10:07, Wu, Feng wrote: >>>>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote: >>>>>>>> I don't quite understand it. If user set an interrupt's affinity >>>>>>>> to a CPU, but he still see the interrupt delivers to other CPUs >>>>>>>> in host. Do you think it is a right behavior? >>>>>>> >>>>>>> No, the interrupt is not delivered at all in the host. Normally >>>>>>> you'd have: >>>>>>> >>>>>>> - interrupt delivered to CPU from host affinity >>>>>>> >>>>>>> - VFIO interrupt handler writes to irqfd >>>>>>> >>>>>>> - interrupt delivered to vCPU from guest affinity >>>>>>> >>>>>>> Here, you just skip the first two steps. The interrupt is >>>>>>> delivered to the thread that is running the vCPU directly, so >>>>>>> the host affinity is bypassed entirely. >>>>>>> >>>>>>> ... unless you are considering the case where the vCPU is >>>>>>> blocked and the host is processing the posted interrupt wakeup vector. >>>>>>> In that case yes, it would be better to set NDST to a CPU >>>>>>> matching the host >>> affinity. >>>>>> >>>>>> In my understanding, wakeup vector should have no relationship >>>>>> with the host affinity of the irq. Wakeup notification event >>>>>> should be delivered to the pCPU which the vCPU was blocked on. >>>>>> And in kernel's point of view, the irq is not associated with >>>>>> the wakeup vector, >> right? >>>>> >>>>> That is correct indeed. It is not associated to the wakeup >>>>> vector, hence this patch is right, I think. >>>>> >>>>> However, the wakeup vector has the same function as the VFIO >>>>> interrupt handler, so you could argue that it is tied to the >>>>> host affinity rather than the guest. Let's wait for Yang to answer. >>>> >>>> Actually, that's my original question too. I am wondering what >>>> happens if the >>> user changes the assigned device's affinity in host's /proc/irq/? If >>> ignore it is acceptable, then this patch is ok. But it seems the >>> discussion out of my scope, need some experts to tell us their idea >>> since it will impact the user experience. Hi Yang, >> >> Hi Jiang, >> >>> Originally we have a proposal to return failure when user sets >>> IRQ affinity through native OS interfaces if an IRQ is in PI mode. >>> But that proposal will break CPU hot-removal because OS needs to >>> migrate away all IRQs binding to the CPU to be offlined. Then we >>> propose saving user IRQ affinity setting without changing hardware >>> configuration (keeping PI configuration). Later when PI mode is >>> disabled, the cached affinity setting will be used to setup IRQ >>> destination for native OS. On the other hand, for IRQ in PI mode, >>> it won't be delivered to native OS, so user may not sense that the >>> IRQ is >> delivered to CPUs other than those in the affinity set. >> >> The IRQ is still there but will be delivered to host in the form of >> PI event(if the VCPU is running in root-mode). I am not sure whether >> those interrupts should be reflected in /proc/interrupts? If the >> answer is yes, then which entries should be used, a new PI entry or >> use the > original IRQ entry? > > Even though, setting the affinity of the IRQ in host should not affect > the destination of the PI event (normal notification event of wakeup This is your implementation. To me, disable PI if the VCPU is going to run in the CPU out of IRQ affinity bitmap also is acceptable. And it will keep the user interface looks the same as before. Hi Thomas, Ingo, Peter Can you guys help to review this patch? Really appreciate if you can give some feedbacks. > notification event), because the destination of the PI event is > determined in NDST field of Posted-interrupts descriptor and PI > notification vector is global. Just had a discussion with Jiang > offline, maybe we can add the statistics information for the notification > vector in /proc/interrupts just like any other global interrupts. > > Thanks, > Feng > >> >>> In that aspect, I think it's acceptable:) Regards! >> >> Yes, if all of you guys(especially the IRQ maintainer) are think it >> is acceptable then we can follow current implementation and document it. >> >>> Gerry >>>> >>>>> >>>>> Paolo >>>> >>>> >>>> Best regards, >>>> Yang >>>> >>>> >> >> >> Best regards, >> Yang >> Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Jiang Liu wrote on 2014-12-24: > On 2014/12/24 9:38, Zhang, Yang Z wrote: >> Paolo Bonzini wrote on 2014-12-23: >>> >>> >>> On 23/12/2014 10:07, Wu, Feng wrote: >>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote: >>>>>> I don't quite understand it. If user set an interrupt's affinity >>>>>> to a CPU, but he still see the interrupt delivers to other CPUs in host. >>>>>> Do you think it is a right behavior? >>>>> >>>>> No, the interrupt is not delivered at all in the host. Normally you'd >>>>> have: >>>>> >>>>> - interrupt delivered to CPU from host affinity >>>>> >>>>> - VFIO interrupt handler writes to irqfd >>>>> >>>>> - interrupt delivered to vCPU from guest affinity >>>>> >>>>> Here, you just skip the first two steps. The interrupt is >>>>> delivered to the thread that is running the vCPU directly, so the >>>>> host affinity is bypassed entirely. >>>>> >>>>> ... unless you are considering the case where the vCPU is blocked >>>>> and the host is processing the posted interrupt wakeup vector. >>>>> In that case yes, it would be better to set NDST to a CPU >>>>> matching the host > affinity. >>>> >>>> In my understanding, wakeup vector should have no relationship >>>> with the host affinity of the irq. Wakeup notification event >>>> should be delivered to the pCPU which the vCPU was blocked on. And >>>> in kernel's point of view, the irq is not associated with the wakeup >>>> vector, right? >>> >>> That is correct indeed. It is not associated to the wakeup vector, >>> hence this patch is right, I think. >>> >>> However, the wakeup vector has the same function as the VFIO >>> interrupt handler, so you could argue that it is tied to the host >>> affinity rather than the guest. Let's wait for Yang to answer. >> >> Actually, that's my original question too. I am wondering what >> happens if the > user changes the assigned device's affinity in host's /proc/irq/? If > ignore it is acceptable, then this patch is ok. But it seems the > discussion out of my scope, need some experts to tell us their idea since it > will impact the user experience. > Hi Yang, Hi Jiang, > Originally we have a proposal to return failure when user sets IRQ > affinity through native OS interfaces if an IRQ is in PI mode. But > that proposal will break CPU hot-removal because OS needs to migrate > away all IRQs binding to the CPU to be offlined. Then we propose > saving user IRQ affinity setting without changing hardware > configuration (keeping PI configuration). Later when PI mode is > disabled, the cached affinity setting will be used to setup IRQ > destination for native OS. On the other hand, for IRQ in PI mode, it > won't be delivered to native OS, so user may not sense that the IRQ is > delivered to CPUs other than those in the affinity set. The IRQ is still there but will be delivered to host in the form of PI event(if the VCPU is running in root-mode). I am not sure whether those interrupts should be reflected in /proc/interrupts? If the answer is yes, then which entries should be used, a new PI entry or use the original IRQ entry? > In that aspect, I think it's acceptable:) Regards! Yes, if all of you guys(especially the IRQ maintainer) are think it is acceptable then we can follow current implementation and document it. > Gerry >> >>> >>> Paolo >> >> >> Best regards, >> Yang >> >> Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Paolo Bonzini wrote on 2014-12-23: > > > On 23/12/2014 10:07, Wu, Feng wrote: >>> On 23/12/2014 01:37, Zhang, Yang Z wrote: >>>> I don't quite understand it. If user set an interrupt's affinity >>>> to a CPU, but he still see the interrupt delivers to other CPUs in host. >>>> Do you think it is a right behavior? >>> >>> No, the interrupt is not delivered at all in the host. Normally you'd have: >>> >>> - interrupt delivered to CPU from host affinity >>> >>> - VFIO interrupt handler writes to irqfd >>> >>> - interrupt delivered to vCPU from guest affinity >>> >>> Here, you just skip the first two steps. The interrupt is >>> delivered to the thread that is running the vCPU directly, so the >>> host affinity is bypassed entirely. >>> >>> ... unless you are considering the case where the vCPU is blocked >>> and the host is processing the posted interrupt wakeup vector. In >>> that case yes, it would be better to set NDST to a CPU matching the host >>> affinity. >> >> In my understanding, wakeup vector should have no relationship with >> the host affinity of the irq. Wakeup notification event should be >> delivered to the pCPU which the vCPU was blocked on. And in kernel's >> point of view, the irq is not associated with the wakeup vector, right? > > That is correct indeed. It is not associated to the wakeup vector, > hence this patch is right, I think. > > However, the wakeup vector has the same function as the VFIO interrupt > handler, so you could argue that it is tied to the host affinity > rather than the guest. Let's wait for Yang to answer. Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience. > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Paolo Bonzini wrote on 2014-12-23: On 23/12/2014 10:07, Wu, Feng wrote: On 23/12/2014 01:37, Zhang, Yang Z wrote: I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? No, the interrupt is not delivered at all in the host. Normally you'd have: - interrupt delivered to CPU from host affinity - VFIO interrupt handler writes to irqfd - interrupt delivered to vCPU from guest affinity Here, you just skip the first two steps. The interrupt is delivered to the thread that is running the vCPU directly, so the host affinity is bypassed entirely. ... unless you are considering the case where the vCPU is blocked and the host is processing the posted interrupt wakeup vector. In that case yes, it would be better to set NDST to a CPU matching the host affinity. In my understanding, wakeup vector should have no relationship with the host affinity of the irq. Wakeup notification event should be delivered to the pCPU which the vCPU was blocked on. And in kernel's point of view, the irq is not associated with the wakeup vector, right? That is correct indeed. It is not associated to the wakeup vector, hence this patch is right, I think. However, the wakeup vector has the same function as the VFIO interrupt handler, so you could argue that it is tied to the host affinity rather than the guest. Let's wait for Yang to answer. Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Jiang Liu wrote on 2014-12-24: On 2014/12/24 9:38, Zhang, Yang Z wrote: Paolo Bonzini wrote on 2014-12-23: On 23/12/2014 10:07, Wu, Feng wrote: On 23/12/2014 01:37, Zhang, Yang Z wrote: I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? No, the interrupt is not delivered at all in the host. Normally you'd have: - interrupt delivered to CPU from host affinity - VFIO interrupt handler writes to irqfd - interrupt delivered to vCPU from guest affinity Here, you just skip the first two steps. The interrupt is delivered to the thread that is running the vCPU directly, so the host affinity is bypassed entirely. ... unless you are considering the case where the vCPU is blocked and the host is processing the posted interrupt wakeup vector. In that case yes, it would be better to set NDST to a CPU matching the host affinity. In my understanding, wakeup vector should have no relationship with the host affinity of the irq. Wakeup notification event should be delivered to the pCPU which the vCPU was blocked on. And in kernel's point of view, the irq is not associated with the wakeup vector, right? That is correct indeed. It is not associated to the wakeup vector, hence this patch is right, I think. However, the wakeup vector has the same function as the VFIO interrupt handler, so you could argue that it is tied to the host affinity rather than the guest. Let's wait for Yang to answer. Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience. Hi Yang, Hi Jiang, Originally we have a proposal to return failure when user sets IRQ affinity through native OS interfaces if an IRQ is in PI mode. But that proposal will break CPU hot-removal because OS needs to migrate away all IRQs binding to the CPU to be offlined. Then we propose saving user IRQ affinity setting without changing hardware configuration (keeping PI configuration). Later when PI mode is disabled, the cached affinity setting will be used to setup IRQ destination for native OS. On the other hand, for IRQ in PI mode, it won't be delivered to native OS, so user may not sense that the IRQ is delivered to CPUs other than those in the affinity set. The IRQ is still there but will be delivered to host in the form of PI event(if the VCPU is running in root-mode). I am not sure whether those interrupts should be reflected in /proc/interrupts? If the answer is yes, then which entries should be used, a new PI entry or use the original IRQ entry? In that aspect, I think it's acceptable:) Regards! Yes, if all of you guys(especially the IRQ maintainer) are think it is acceptable then we can follow current implementation and document it. Gerry Paolo Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Wu, Feng wrote on 2014-12-24: Zhang, Yang Z wrote on 2014-12-24: Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org; KVM list; Eric Auger Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts Jiang Liu wrote on 2014-12-24: On 2014/12/24 9:38, Zhang, Yang Z wrote: Paolo Bonzini wrote on 2014-12-23: On 23/12/2014 10:07, Wu, Feng wrote: On 23/12/2014 01:37, Zhang, Yang Z wrote: I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? No, the interrupt is not delivered at all in the host. Normally you'd have: - interrupt delivered to CPU from host affinity - VFIO interrupt handler writes to irqfd - interrupt delivered to vCPU from guest affinity Here, you just skip the first two steps. The interrupt is delivered to the thread that is running the vCPU directly, so the host affinity is bypassed entirely. ... unless you are considering the case where the vCPU is blocked and the host is processing the posted interrupt wakeup vector. In that case yes, it would be better to set NDST to a CPU matching the host affinity. In my understanding, wakeup vector should have no relationship with the host affinity of the irq. Wakeup notification event should be delivered to the pCPU which the vCPU was blocked on. And in kernel's point of view, the irq is not associated with the wakeup vector, right? That is correct indeed. It is not associated to the wakeup vector, hence this patch is right, I think. However, the wakeup vector has the same function as the VFIO interrupt handler, so you could argue that it is tied to the host affinity rather than the guest. Let's wait for Yang to answer. Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience. Hi Yang, Hi Jiang, Originally we have a proposal to return failure when user sets IRQ affinity through native OS interfaces if an IRQ is in PI mode. But that proposal will break CPU hot-removal because OS needs to migrate away all IRQs binding to the CPU to be offlined. Then we propose saving user IRQ affinity setting without changing hardware configuration (keeping PI configuration). Later when PI mode is disabled, the cached affinity setting will be used to setup IRQ destination for native OS. On the other hand, for IRQ in PI mode, it won't be delivered to native OS, so user may not sense that the IRQ is delivered to CPUs other than those in the affinity set. The IRQ is still there but will be delivered to host in the form of PI event(if the VCPU is running in root-mode). I am not sure whether those interrupts should be reflected in /proc/interrupts? If the answer is yes, then which entries should be used, a new PI entry or use the original IRQ entry? Even though, setting the affinity of the IRQ in host should not affect the destination of the PI event (normal notification event of wakeup This is your implementation. To me, disable PI if the VCPU is going to run in the CPU out of IRQ affinity bitmap also is acceptable. And it will keep the user interface looks the same as before. Hi Thomas, Ingo, Peter Can you guys help to review this patch? Really appreciate if you can give some feedbacks. notification event), because the destination of the PI event is determined in NDST field of Posted-interrupts descriptor and PI notification vector is global. Just had a discussion with Jiang offline, maybe we can add the statistics information for the notification vector in /proc/interrupts just like any other global interrupts. Thanks, Feng In that aspect, I think it's acceptable:) Regards! Yes, if all of you guys(especially the IRQ maintainer) are think it is acceptable then we can follow current implementation and document it. Gerry Paolo Best regards, Yang Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Paolo Bonzini wrote on 2014-12-23: >> The problem is we still need to support PI with lowest priority >> delivery mode > even if guest does not configure irq affinity via /proc/irq/. Don't we? > > Yes, but we can get the basic support working first. > > I and Feng talked on irc and agreed to start with a simple > implementation. Then he can add support for vector hashing for all > interrupts, both vt-d pi and normal LAPIC interrupts. Agree. Obviously, current PI has some limitations to achieve highest performance. We can start with a simple implementation but must ensure we don't change hardware's behavior(From guest's point). Feng's current implementation or use the way I suggested are both ok since both of them cannot solve the problem well. > > Paolo > Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Paolo Bonzini wrote on 2014-12-19: > > > On 19/12/2014 02:46, Zhang, Yang Z wrote: >>> If the IRQ is posted, its affinity is controlled by guest (irq >>> <---> vCPU <> pCPU), it has no effect when host changes its affinity. >> >> That's the problem: User is able to changes it in host but it never >> takes effect since it is actually controlled by guest. I guess it >> will break the IRQ balance too. > > I don't think that's a problem. > > Controlling the affinity in the host affects which CPU in the host > takes care of signaling the guest. > > If this signaling is done directly by the chipset, there is no need to > do anything in the host and thus the host affinity can be bypassed. I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? > > Paolo Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Paolo Bonzini wrote on 2014-12-19: On 19/12/2014 02:46, Zhang, Yang Z wrote: If the IRQ is posted, its affinity is controlled by guest (irq --- vCPU pCPU), it has no effect when host changes its affinity. That's the problem: User is able to changes it in host but it never takes effect since it is actually controlled by guest. I guess it will break the IRQ balance too. I don't think that's a problem. Controlling the affinity in the host affects which CPU in the host takes care of signaling the guest. If this signaling is done directly by the chipset, there is no need to do anything in the host and thus the host affinity can be bypassed. I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Paolo Bonzini wrote on 2014-12-23: The problem is we still need to support PI with lowest priority delivery mode even if guest does not configure irq affinity via /proc/irq/. Don't we? Yes, but we can get the basic support working first. I and Feng talked on irc and agreed to start with a simple implementation. Then he can add support for vector hashing for all interrupts, both vt-d pi and normal LAPIC interrupts. Agree. Obviously, current PI has some limitations to achieve highest performance. We can start with a simple implementation but must ensure we don't change hardware's behavior(From guest's point). Feng's current implementation or use the way I suggested are both ok since both of them cannot solve the problem well. Paolo Best regards, Yang N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: > > > Zhang, Yang Z wrote on 2014-12-19: >> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is >> set >> >> Wu, Feng wrote on 2014-12-19: >>> >>> >>> Zhang, Yang Z wrote on 2014-12-19: >>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' >>>> is set >>>> >>>> Wu, Feng wrote on 2014-12-19: >>>>> >>>>> >>>>> Zhang, Yang Z wrote on 2014-12-19: >>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' >>>>>> is set >>>>>> >>>>>> Wu, Feng wrote on 2014-12-19: >>>>>>> >>>>>>> >>>>>>> iommu-boun...@lists.linux-foundation.org wrote on >>>>>> mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: >>>>>>>> Cc: io...@lists.linux-foundation.org; >>>>>>>> linux-kernel@vger.kernel.org; k...@vger.kernel.org >>>>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' >>>>>>>> is set >>>>>>>> >>>>>>>> Paolo Bonzini wrote on 2014-12-18: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 18/12/2014 04:14, Wu, Feng wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> linux-kernel-ow...@vger.kernel.org wrote on >>>>>>>> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: >>>>>>>>>>> x...@kernel.org; Gleb Natapov; Paolo Bonzini; >>>>>>>>>>> dw...@infradead.org; >>>>>>>>>>> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; >>>>>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc: >>>>>>>>>>> io...@lists.linux-foundation.org; >>>>>>>>>>> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; > KVM >> list; >>>>>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress >>>>>>>>>>> posted-interrupt when 'SN' is set >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 12/12/2014 16:14, Feng Wu wrote: >>>>>>>>>>>> Currently, we don't support urgent interrupt, all >>>>>>>>>>>> interrupts are recognized as non-urgent interrupt, so we >>>>>>>>>>>> cannot send posted-interrupt when 'SN' is set. >>>>>>>>>>> >>>>>>>>>>> Can this happen? If the vcpu is in guest mode, it cannot >>>>>>>>>>> have been scheduled out, and that's the only case when SN is set. >>>>>>>>>>> >>>>>>>>>>> Paolo >>>>>>>>>> >>>>>>>>>> Currently, the only place where SN is set is vCPU is >>>>>>>>>> preempted and >>>>>>>> >>>>>>>> If the vCPU is preempted, shouldn't the subsequent be ignored? >>>>>>>> What happens if a PI is occurs when vCPU is preempted? >>>>>>> >>>>>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent >>>>>>> interrupts are suppressed for posting. >>>>>> >>>>>> I mean what happens if we don't set SN bit. From my point, if >>>>>> preempter already disabled the interrupt, it is ok to leave SN >>>>>> bit as zero. But if preempter enabled the interrupt, doesn't >>>>>> this mean he allow interrupt to happen? BTW, since there >>>>>> already has ON bit, so this means there only have one interrupt >>>>>> arrived at most and it doesn't hurt performance. Do we really need to >>>>>> set SN bit? >>>>> >>>>> >>>>> See this scenario: >>>>> vCPU0 is running on pCPU0 >>>>> --> vCPU0 is preempted by vCPU1 >>>>> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for >>>>> --> schedule in runqueue >>>>> >>>>> If the we don't set SN for vCPU0,
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: > > > Zhang, Yang Z wrote on 2014-12-19: >> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is >> set >> >> Wu, Feng wrote on 2014-12-19: >>> >>> >>> Zhang, Yang Z wrote on 2014-12-19: >>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' >>>> is set >>>> >>>> Wu, Feng wrote on 2014-12-19: >>>>> >>>>> >>>>> iommu-boun...@lists.linux-foundation.org wrote on >>>> mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: >>>>>> Cc: io...@lists.linux-foundation.org; >>>>>> linux-kernel@vger.kernel.org; k...@vger.kernel.org >>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' >>>>>> is set >>>>>> >>>>>> Paolo Bonzini wrote on 2014-12-18: >>>>>>> >>>>>>> >>>>>>> On 18/12/2014 04:14, Wu, Feng wrote: >>>>>>>> >>>>>>>> >>>>>>>> linux-kernel-ow...@vger.kernel.org wrote on >>>>>> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: >>>>>>>>> x...@kernel.org; Gleb Natapov; Paolo Bonzini; >>>>>>>>> dw...@infradead.org; >>>>>>>>> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; >>>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc: >>>>>>>>> io...@lists.linux-foundation.org; >>>>>>>>> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; >>>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress >>>>>>>>> posted-interrupt when 'SN' is set >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 12/12/2014 16:14, Feng Wu wrote: >>>>>>>>>> Currently, we don't support urgent interrupt, all >>>>>>>>>> interrupts are recognized as non-urgent interrupt, so we >>>>>>>>>> cannot send posted-interrupt when 'SN' is set. >>>>>>>>> >>>>>>>>> Can this happen? If the vcpu is in guest mode, it cannot >>>>>>>>> have been scheduled out, and that's the only case when SN is set. >>>>>>>>> >>>>>>>>> Paolo >>>>>>>> >>>>>>>> Currently, the only place where SN is set is vCPU is >>>>>>>> preempted and >>>>>> >>>>>> If the vCPU is preempted, shouldn't the subsequent be ignored? >>>>>> What happens if a PI is occurs when vCPU is preempted? >>>>> >>>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent >>>>> interrupts are suppressed for posting. >>>> >>>> I mean what happens if we don't set SN bit. From my point, if >>>> preempter already disabled the interrupt, it is ok to leave SN >>>> bit as zero. But if preempter enabled the interrupt, doesn't this >>>> mean he allow interrupt to happen? BTW, since there already has >>>> ON bit, so this means there only have one interrupt arrived at >>>> most and it doesn't hurt performance. Do we really need to set SN bit? >>> >>> >>> See this scenario: >>> vCPU0 is running on pCPU0 >>> --> vCPU0 is preempted by vCPU1 >>> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for >>> --> schedule in runqueue >>> >>> If the we don't set SN for vCPU0, then all subsequent interrupts >>> for >>> vCPU0 is posted to vCPU1, this will consume hardware and software >> >> The PI vector for vCPU1 is notification vector, but the PI vector >> for >> vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event? > > Wakeup vector is only used for blocking case, when vCPU is preempted > and waiting in the runqueue, the NV is the notification vector. I see your point. But from performance point, if we can schedule the vCPU to another PCPU to handle the interrupt, it would helpful. But I remember current KVM will not schedule the vCPU in run queue (even though it got preempted) to another pCPU to run(Am I right?). So it may hard to do it. > > Thanks, > Feng > >> >>> efforts and in fact it is not needed at all. If SN is set for >>> vCPU0, VT-d hardware will not issue Notification Event for vCPU0 >>> when an interrupt is for it, but just setting the related PIR bit. >>> >>> Thanks, >>> Feng >>> >>>> >>>>> >>>>> Thanks, >>>>> Feng >>>>> >>>>>> >>>>>>>> waiting for the next scheduling in the runqueue. But I am not >>>>>>>> sure whether we need to set SN for other purpose in future. >>>>>>>> Adding SN checking here is just to follow the Spec. >>>>>>>> non-urgent interrupts are suppressed >>>>>>> when SN is set. >>>>>>> >>>>>>> I would change that to a WARN_ON_ONCE then. >>>>>> >>>>>> >>>>>> Best regards, >>>>>> Yang >>>>>> >>>>>> >>>>>> ___ >>>>>> iommu mailing list >>>>>> io...@lists.linux-foundation.org >>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu >>>> >>>> >>>> Best regards, >>>> Yang >>>> >> >> >> Best regards, >> Yang >> Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: > > > Zhang, Yang Z wrote on 2014-12-19: >> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is >> set >> >> Wu, Feng wrote on 2014-12-19: >>> >>> >>> iommu-boun...@lists.linux-foundation.org wrote on >> mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: >>>> Cc: io...@lists.linux-foundation.org; >>>> linux-kernel@vger.kernel.org; k...@vger.kernel.org >>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' >>>> is set >>>> >>>> Paolo Bonzini wrote on 2014-12-18: >>>>> >>>>> >>>>> On 18/12/2014 04:14, Wu, Feng wrote: >>>>>> >>>>>> >>>>>> linux-kernel-ow...@vger.kernel.org wrote on >>>> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: >>>>>>> x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; >>>>>>> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; >>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc: >>>>>>> io...@lists.linux-foundation.org; >>>>>>> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; >>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt >>>>>>> when 'SN' is set >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 12/12/2014 16:14, Feng Wu wrote: >>>>>>>> Currently, we don't support urgent interrupt, all interrupts >>>>>>>> are recognized as non-urgent interrupt, so we cannot send >>>>>>>> posted-interrupt when 'SN' is set. >>>>>>> >>>>>>> Can this happen? If the vcpu is in guest mode, it cannot have >>>>>>> been scheduled out, and that's the only case when SN is set. >>>>>>> >>>>>>> Paolo >>>>>> >>>>>> Currently, the only place where SN is set is vCPU is preempted >>>>>> and >>>> >>>> If the vCPU is preempted, shouldn't the subsequent be ignored? >>>> What happens if a PI is occurs when vCPU is preempted? >>> >>> If a vCPU is preempted, the 'SN' bit is set, the subsequent >>> interrupts are suppressed for posting. >> >> I mean what happens if we don't set SN bit. From my point, if >> preempter already disabled the interrupt, it is ok to leave SN bit >> as zero. But if preempter enabled the interrupt, doesn't this mean >> he allow interrupt to happen? BTW, since there already has ON bit, >> so this means there only have one interrupt arrived at most and it >> doesn't hurt performance. Do we really need to set SN bit? > > > See this scenario: > vCPU0 is running on pCPU0 > --> vCPU0 is preempted by vCPU1 > --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule > --> in runqueue > > If the we don't set SN for vCPU0, then all subsequent interrupts for > vCPU0 is posted to vCPU1, this will consume hardware and software The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event? > efforts and in fact it is not needed at all. If SN is set for vCPU0, > VT-d hardware will not issue Notification Event for vCPU0 when an > interrupt is for it, but just setting the related PIR bit. > > Thanks, > Feng > >> >>> >>> Thanks, >>> Feng >>> >>>> >>>>>> waiting for the next scheduling in the runqueue. But I am not >>>>>> sure whether we need to set SN for other purpose in future. >>>>>> Adding SN checking here is just to follow the Spec. non-urgent >>>>>> interrupts are suppressed >>>>> when SN is set. >>>>> >>>>> I would change that to a WARN_ON_ONCE then. >>>> >>>> >>>> Best regards, >>>> Yang >>>> >>>> >>>> ___ >>>> iommu mailing list >>>> io...@lists.linux-foundation.org >>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu >> >> >> Best regards, >> Yang >> Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: > > > iommu-boun...@lists.linux-foundation.org wrote on > mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: >> Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org; >> k...@vger.kernel.org >> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is >> set >> >> Paolo Bonzini wrote on 2014-12-18: >>> >>> >>> On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on >> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: > x...@kernel.org; Gleb Natapov; Paolo Bonzini; > dw...@infradead.org; > joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; > joro-zLv9SwRftAIdnm+Jiang > Liu > Cc: io...@lists.linux-foundation.org; > linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; > Eric Auger > Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' > is set > > > > On 12/12/2014 16:14, Feng Wu wrote: >> Currently, we don't support urgent interrupt, all interrupts >> are recognized as non-urgent interrupt, so we cannot send >> posted-interrupt when 'SN' is set. > > Can this happen? If the vcpu is in guest mode, it cannot have > been scheduled out, and that's the only case when SN is set. > > Paolo Currently, the only place where SN is set is vCPU is preempted and >> >> If the vCPU is preempted, shouldn't the subsequent be ignored? What >> happens if a PI is occurs when vCPU is preempted? > > If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts > are suppressed for posting. I mean what happens if we don't set SN bit. From my point, if preempter already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, since there already has ON bit, so this means there only have one interrupt arrived at most and it doesn't hurt performance. Do we really need to set SN bit? > > Thanks, > Feng > >> waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed >>> when SN is set. >>> >>> I would change that to a WARN_ON_ONCE then. >> >> >> Best regards, >> Yang >> >> >> ___ >> iommu mailing list >> io...@lists.linux-foundation.org >> https://lists.linuxfoundation.org/mailman/listinfo/iommu Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Wu, Feng wrote on 2014-12-19: > > > Paolo Bonzini wrote on 2014-12-19: >> jiang@linux.intel.com >> Cc: eric.au...@linaro.org; linux-kernel@vger.kernel.org; >> io...@lists.linux-foundation.org; k...@vger.kernel.org >> Subject: Re: [v3 13/26] KVM: Define a new interface >> kvm_find_dest_vcpu() for VT-d PI >> >> >> >> On 18/12/2014 15:49, Zhang, Yang Z wrote: >>>>> Here, we introduce a similar way with 'apic_arb_prio' to handle >>>>> guest lowest priority interrtups when VT-d PI is used. Here is >>>>> the >>>>> ideas: - Each vCPU has a counter 'round_robin_counter'. - When >>>>> guests sets an interrupts to lowest priority, we choose the vCPU >>>>> with smallest 'round_robin_counter' as the destination, then >>>>> increase it. >>> >>> How this can work well? All subsequent interrupts are delivered to >>> one vCPU? It shouldn't be the best solution, need more consideration. >> >> Well, it's a hardware limitation. The alternative (which is easy to >> implement) is to only do PI for single-CPU interrupts. This should >> work well for multiqueue NICs (and of course for UP guests :)), so >> perhaps it's a good idea to only support that as a first attempt. >> >> Paolo > > Paolo, what do you mean by "single-CPU interrupts"? Do you mean we It should be same idea as I mentioned on another thread: deliver the interrupt to a single CPU(maybe the first matched VCPU?) > don't support lowest priority interrupts for PI? But Linux OS uses > lowest priority for most of the case? If so, we can hardly get benefit > from this feature for Linux guest OS. > > Thanks, > Feng > >> >>> Also, I think you should take the apic_arb_prio into consider >>> since the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Wu, Feng wrote on 2014-12-19: > > > Zhang, Yang Z wrote on 2014-12-18: >> jiang@linux.intel.com >> Cc: eric.au...@linaro.org; linux-kernel@vger.kernel.org; >> io...@lists.linux-foundation.org; k...@vger.kernel.org; Wu, Feng >> Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for >> VT-d Posted-Interrupts >> >> Feng Wu wrote on 2014-12-12: >>> We don't need to migrate the irqs for VT-d Posted-Interrupts here. >>> When 'pst' is set in IRTE, the associated irq will be posted to >>> guests instead of interrupt remapping. The destination of the >>> interrupt is set in Posted-Interrupts Descriptor, and the >>> migration happens during vCPU scheduling. >>> >>> However, we still update the cached irte here, which can be used >>> when changing back to remapping mode. >>> >>> Signed-off-by: Feng Wu >>> Reviewed-by: Jiang Liu >>> --- >>> drivers/iommu/intel_irq_remapping.c | 6 +- >>> 1 file changed, 5 insertions(+), 1 deletion(-) diff --git >>> a/drivers/iommu/intel_irq_remapping.c >>> b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a >>> 100644 >>> --- a/drivers/iommu/intel_irq_remapping.c +++ >>> b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@ >>> intel_ir_set_affinity(struct irq_data *data, const struct cpumask >>> *mask, { >>> struct intel_ir_data *ir_data = data->chip_data;struct irte >>> *irte = >>> _data->irte_entry; +struct irte_pi *irte_pi = (struct irte_pi >>> *)irte;struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent >>> = data->parent_data; int ret; >>> @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data, >>> const struct cpumask *mask, >>> */ >>> irte->vector = cfg->vector; >>> irte->dest_id = IRTE_DEST(cfg->dest_apicid); >>> - modify_irte(_data->irq_2_iommu, irte); >>> + >>> + /* We don't need to modify irte if the interrupt is for posting. */ >>> + if (irte_pi->pst != 1) >>> + modify_irte(_data->irq_2_iommu, irte); >> >> What happens if user changes the IRQ affinity manually? > > If the IRQ is posted, its affinity is controlled by guest (irq <---> > vCPU <> pCPU), it has no effect when host changes its affinity. That's the problem: User is able to changes it in host but it never takes effect since it is actually controlled by guest. I guess it will break the IRQ balance too. > > Thanks, > Feng > >> >>> >>> /* >>> * After this point, all the interrupts will start arriving >> >> >> Best regards, >> Yang >> Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Paolo Bonzini wrote on 2014-12-19: > > > On 18/12/2014 15:49, Zhang, Yang Z wrote: >>>> Here, we introduce a similar way with 'apic_arb_prio' to handle >>>> guest lowest priority interrtups when VT-d PI is used. Here is the >>>> ideas: - Each vCPU has a counter 'round_robin_counter'. - When >>>> guests sets an interrupts to lowest priority, we choose the vCPU >>>> with smallest 'round_robin_counter' as the destination, then >>>> increase it. >> >> How this can work well? All subsequent interrupts are delivered to >> one vCPU? It shouldn't be the best solution, need more consideration. > > Well, it's a hardware limitation. The alternative (which is easy to Agree, it is limited by hardware. But lowest priority distributes the interrupt more efficient than fixed mode. And current implementation more likes to switch the lowest priority mode to fixed mode. In case of interrupt intensive environment, this may be a bottleneck and VM may not benefit greatly from VT-d PI. But agree again, it is really a hardware limitation. > implement) is to only do PI for single-CPU interrupts. This should > work well for multiqueue NICs (and of course for UP guests :)), so > perhaps it's a good idea to only support that as a first attempt. The more easy way is to deliver the interrupt to the first matched VCPU we find. The round_robin_counter really helps nothing here since the interrupt is delivered by hardware directly. > > Paolo > >> Also, I think you should take the apic_arb_prio into consider since >> the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 12/26] KVM: Initialize VT-d Posted-Interrupts Descriptor
Feng Wu wrote on 2014-12-12: > This patch initializes the VT-d Posted-Interrupts Descriptor. > > Signed-off-by: Feng Wu > --- > arch/x86/kvm/vmx.c | 27 +++ > 1 file changed, 27 insertions(+) > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index > 0b1383e..66ca275 100644 --- a/arch/x86/kvm/vmx.c +++ > b/arch/x86/kvm/vmx.c @@ -45,6 +45,7 @@ > #include > #include > #include > +#include > > #include "trace.h" > @@ -4433,6 +4434,30 @@ static void ept_set_mmio_spte_mask(void) > kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull); } > +static void pi_desc_init(struct vcpu_vmx *vmx) { > + unsigned int dest; > + > + if (!irq_remapping_cap(IRQ_POSTING_CAP)) > + return; > + > + /* > + * Initialize Posted-Interrupt Descriptor > + */ > + > + pi_clear_sn(>pi_desc); > + vmx->pi_desc.nv = POSTED_INTR_VECTOR; Here. > + > + /* Physical mode for Notificaiton Event */ > + vmx->pi_desc.ndm = 0; And from here.. > + dest = cpu_physical_id(vmx->vcpu.cpu); > + > + if (x2apic_enabled()) > + vmx->pi_desc.ndst = dest; > + else > + vmx->pi_desc.ndst = (dest << 8) & 0xFF00; } > + ..to here are useless. The right place to update PI descriptor is where vcpu got loaded not in initialization. > /* > * Sets up the vmcs for emulated real mode. > */ > @@ -4476,6 +4501,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) > > vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR); > vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((>pi_desc))); > + > + pi_desc_init(vmx); > } > > if (ple_gap) { Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Paolo Bonzini wrote on 2014-12-18: > > > On 18/12/2014 04:14, Wu, Feng wrote: >> >> >> linux-kernel-ow...@vger.kernel.org wrote on >> mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: >>> x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; >>> joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; >>> joro-zLv9SwRftAIdnm+Jiang >>> Liu >>> Cc: io...@lists.linux-foundation.org; >>> linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; >>> Eric Auger >>> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is >>> set >>> >>> >>> >>> On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. >>> >>> Can this happen? If the vcpu is in guest mode, it cannot have been >>> scheduled out, and that's the only case when SN is set. >>> >>> Paolo >> >> Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? >> waiting for the next scheduling in the runqueue. But I am not sure >> whether we need to set SN for other purpose in future. Adding SN >> checking here is just to follow the Spec. non-urgent interrupts are >> suppressed > when SN is set. > > I would change that to a WARN_ON_ONCE then. Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 21/26] x86, irq: Define a global vector for VT-d Posted-Interrupts
Feng Wu wrote on 2014-12-12: > Currently, we use a global vector as the Posted-Interrupts > Notification Event for all the vCPUs in the system. We need to > introduce another global vector for VT-d Posted-Interrtups, which will > be used to wakeup the sleep vCPU when an external interrupt from a > direct-assigned device happens for that vCPU. > Hi Feng, Since the idea of two global vectors mechanism is from me, please add me to the comments. > Signed-off-by: Feng Wu > --- > arch/x86/include/asm/entry_arch.h | 2 ++ > arch/x86/include/asm/hardirq.h | 1 + > arch/x86/include/asm/hw_irq.h | 2 ++ > arch/x86/include/asm/irq_vectors.h | 1 + > arch/x86/kernel/entry_64.S | 2 ++ > arch/x86/kernel/irq.c | 27 +++ > arch/x86/kernel/irqinit.c | 2 ++ > 7 files changed, 37 insertions(+) > diff --git a/arch/x86/include/asm/entry_arch.h > b/arch/x86/include/asm/entry_arch.h index dc5fa66..27ca0af 100644 --- > a/arch/x86/include/asm/entry_arch.h +++ > b/arch/x86/include/asm/entry_arch.h @@ -23,6 +23,8 @@ > BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR) #ifdef > CONFIG_HAVE_KVM BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR, >smp_kvm_posted_intr_ipi) > +BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR, > + smp_kvm_posted_intr_wakeup_ipi) > #endif > > /* > diff --git a/arch/x86/include/asm/hardirq.h > b/arch/x86/include/asm/hardirq.h index 0f5fb6b..9866065 100644 > --- a/arch/x86/include/asm/hardirq.h > +++ b/arch/x86/include/asm/hardirq.h > @@ -14,6 +14,7 @@ typedef struct { > #endif #ifdef CONFIG_HAVE_KVMunsigned int kvm_posted_intr_ipis; > +unsigned int kvm_posted_intr_wakeup_ipis; #endifunsigned int > x86_platform_ipis; /* arch dependent */unsigned int apic_perf_irqs; > diff --git a/arch/x86/include/asm/hw_irq.h > b/arch/x86/include/asm/hw_irq.h index e7ae6eb..38fac9b 100644 > --- a/arch/x86/include/asm/hw_irq.h > +++ b/arch/x86/include/asm/hw_irq.h > @@ -29,6 +29,7 @@ > extern asmlinkage void apic_timer_interrupt(void); extern asmlinkage > void x86_platform_ipi(void); extern asmlinkage void > kvm_posted_intr_ipi(void); +extern asmlinkage void > kvm_posted_intr_wakeup_ipi(void); > extern asmlinkage void error_interrupt(void); extern asmlinkage void > irq_work_interrupt(void); > > @@ -92,6 +93,7 @@ extern void > trace_call_function_single_interrupt(void); > #define trace_irq_move_cleanup_interrupt irq_move_cleanup_interrupt > #define trace_reboot_interrupt reboot_interrupt #define > trace_kvm_posted_intr_ipi kvm_posted_intr_ipi > +#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi > #endif /* CONFIG_TRACING */ > > struct irq_domain; > diff --git a/arch/x86/include/asm/irq_vectors.h > b/arch/x86/include/asm/irq_vectors.h index b26cb12..dca94f2 100644 --- > a/arch/x86/include/asm/irq_vectors.h +++ > b/arch/x86/include/asm/irq_vectors.h @@ -105,6 +105,7 @@ > /* Vector for KVM to deliver posted interrupt IPI */ #ifdef > CONFIG_HAVE_KVM #define POSTED_INTR_VECTOR 0xf2 +#define > POSTED_INTR_WAKEUP_VECTOR0xf1 #endif > > /* > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S > index e61c14a..a598447 100644 --- a/arch/x86/kernel/entry_64.S +++ > b/arch/x86/kernel/entry_64.S @@ -960,6 +960,8 @@ apicinterrupt > X86_PLATFORM_IPI_VECTOR \ #ifdef CONFIG_HAVE_KVM > apicinterrupt3 POSTED_INTR_VECTOR \ > kvm_posted_intr_ipi smp_kvm_posted_intr_ipi > +apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \ > + kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi > #endif > > #ifdef CONFIG_X86_MCE_THRESHOLD > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index > 922d285..47408c3 100644 > --- a/arch/x86/kernel/irq.c > +++ b/arch/x86/kernel/irq.c > @@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs > *regs) } > > #ifdef CONFIG_HAVE_KVM > +void (*wakeup_handler_callback)(void) = NULL; > +EXPORT_SYMBOL_GPL(wakeup_handler_callback); + > /* > * Handler for POSTED_INTERRUPT_VECTOR. > */ > @@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct > pt_regs *regs) > > set_irq_regs(old_regs); > } > + > +/* > + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR. > + */ > +__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs) { > + struct pt_regs *old_regs = set_irq_regs(regs); > + > + ack_APIC_irq(); > + > + irq_enter(); > + > + exit_idle(); > + > + inc_irq_stat(kvm_posted_intr_wakeup_ipis); > + > + if (wakeup_handler_callback) > + wakeup_handler_callback(); > + > + irq_exit(); > + > + set_irq_regs(old_regs); > +} > + > #endif > > __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) diff > --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index > 70e181e..844673c 100644 --- a/arch/x86/kernel/irqinit.c +++ > b/arch/x86/kernel/irqinit.c @@ -144,6 +144,8
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Feng Wu wrote on 2014-12-12: > This patch defines a new interface kvm_find_dest_vcpu for > VT-d PI, which can returns the destination vCPU of the > interrupt for guests. > > Since VT-d PI cannot handle broadcast/multicast interrupt, > Here we only handle Fixed and Lowest priority interrupts. > > The current method of handling guest lowest priority interrtups > is to use a counter 'apic_arb_prio' for each vCPU, we choose the > vCPU with smallest 'apic_arb_prio' and then increase it by 1. > However, for VT-d PI, we cannot re-use this, since we no longer > have control to 'apic_arb_prio' with posted interrupt direct > delivery by Hardware. > > Here, we introduce a similar way with 'apic_arb_prio' to handle guest > lowest priority interrtups when VT-d PI is used. Here is the ideas: - > Each vCPU has a counter 'round_robin_counter'. - When guests sets an > interrupts to lowest priority, we choose the vCPU with smallest > 'round_robin_counter' as the destination, then increase it. How this can work well? All subsequent interrupts are delivered to one vCPU? It shouldn't be the best solution, need more consideration. Also, I think you should take the apic_arb_prio into consider since the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Feng Wu wrote on 2014-12-12: > We don't need to migrate the irqs for VT-d Posted-Interrupts here. > When 'pst' is set in IRTE, the associated irq will be posted to guests > instead of interrupt remapping. The destination of the interrupt is > set in Posted-Interrupts Descriptor, and the migration happens during > vCPU scheduling. > > However, we still update the cached irte here, which can be used when > changing back to remapping mode. > > Signed-off-by: Feng Wu > Reviewed-by: Jiang Liu > --- > drivers/iommu/intel_irq_remapping.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > diff --git a/drivers/iommu/intel_irq_remapping.c > b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 --- > a/drivers/iommu/intel_irq_remapping.c +++ > b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@ > intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, > { > struct intel_ir_data *ir_data = data->chip_data;struct irte > *irte = > _data->irte_entry; + struct irte_pi *irte_pi = (struct irte_pi > *)irte; struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent > = data->parent_data; int ret; > @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data, > const struct cpumask *mask, >*/ > irte->vector = cfg->vector; > irte->dest_id = IRTE_DEST(cfg->dest_apicid); > - modify_irte(_data->irq_2_iommu, irte); > + > + /* We don't need to modify irte if the interrupt is for posting. */ > + if (irte_pi->pst != 1) > + modify_irte(_data->irq_2_iommu, irte); What happens if user changes the IRQ affinity manually? > > /* >* After this point, all the interrupts will start arriving Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Feng Wu wrote on 2014-12-12: We don't need to migrate the irqs for VT-d Posted-Interrupts here. When 'pst' is set in IRTE, the associated irq will be posted to guests instead of interrupt remapping. The destination of the interrupt is set in Posted-Interrupts Descriptor, and the migration happens during vCPU scheduling. However, we still update the cached irte here, which can be used when changing back to remapping mode. Signed-off-by: Feng Wu feng...@intel.com Reviewed-by: Jiang Liu jiang@linux.intel.com --- drivers/iommu/intel_irq_remapping.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 --- a/drivers/iommu/intel_irq_remapping.c +++ b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, { struct intel_ir_data *ir_data = data-chip_data;struct irte *irte = ir_data-irte_entry; + struct irte_pi *irte_pi = (struct irte_pi *)irte; struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent = data-parent_data; int ret; @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, */ irte-vector = cfg-vector; irte-dest_id = IRTE_DEST(cfg-dest_apicid); - modify_irte(ir_data-irq_2_iommu, irte); + + /* We don't need to modify irte if the interrupt is for posting. */ + if (irte_pi-pst != 1) + modify_irte(ir_data-irq_2_iommu, irte); What happens if user changes the IRQ affinity manually? /* * After this point, all the interrupts will start arriving Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Feng Wu wrote on 2014-12-12: This patch defines a new interface kvm_find_dest_vcpu for VT-d PI, which can returns the destination vCPU of the interrupt for guests. Since VT-d PI cannot handle broadcast/multicast interrupt, Here we only handle Fixed and Lowest priority interrupts. The current method of handling guest lowest priority interrtups is to use a counter 'apic_arb_prio' for each vCPU, we choose the vCPU with smallest 'apic_arb_prio' and then increase it by 1. However, for VT-d PI, we cannot re-use this, since we no longer have control to 'apic_arb_prio' with posted interrupt direct delivery by Hardware. Here, we introduce a similar way with 'apic_arb_prio' to handle guest lowest priority interrtups when VT-d PI is used. Here is the ideas: - Each vCPU has a counter 'round_robin_counter'. - When guests sets an interrupts to lowest priority, we choose the vCPU with smallest 'round_robin_counter' as the destination, then increase it. How this can work well? All subsequent interrupts are delivered to one vCPU? It shouldn't be the best solution, need more consideration. Also, I think you should take the apic_arb_prio into consider since the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 21/26] x86, irq: Define a global vector for VT-d Posted-Interrupts
Feng Wu wrote on 2014-12-12: Currently, we use a global vector as the Posted-Interrupts Notification Event for all the vCPUs in the system. We need to introduce another global vector for VT-d Posted-Interrtups, which will be used to wakeup the sleep vCPU when an external interrupt from a direct-assigned device happens for that vCPU. Hi Feng, Since the idea of two global vectors mechanism is from me, please add me to the comments. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/include/asm/entry_arch.h | 2 ++ arch/x86/include/asm/hardirq.h | 1 + arch/x86/include/asm/hw_irq.h | 2 ++ arch/x86/include/asm/irq_vectors.h | 1 + arch/x86/kernel/entry_64.S | 2 ++ arch/x86/kernel/irq.c | 27 +++ arch/x86/kernel/irqinit.c | 2 ++ 7 files changed, 37 insertions(+) diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h index dc5fa66..27ca0af 100644 --- a/arch/x86/include/asm/entry_arch.h +++ b/arch/x86/include/asm/entry_arch.h @@ -23,6 +23,8 @@ BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR) #ifdef CONFIG_HAVE_KVM BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR, smp_kvm_posted_intr_ipi) +BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR, + smp_kvm_posted_intr_wakeup_ipi) #endif /* diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 0f5fb6b..9866065 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -14,6 +14,7 @@ typedef struct { #endif #ifdef CONFIG_HAVE_KVMunsigned int kvm_posted_intr_ipis; +unsigned int kvm_posted_intr_wakeup_ipis; #endifunsigned int x86_platform_ipis; /* arch dependent */unsigned int apic_perf_irqs; diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h index e7ae6eb..38fac9b 100644 --- a/arch/x86/include/asm/hw_irq.h +++ b/arch/x86/include/asm/hw_irq.h @@ -29,6 +29,7 @@ extern asmlinkage void apic_timer_interrupt(void); extern asmlinkage void x86_platform_ipi(void); extern asmlinkage void kvm_posted_intr_ipi(void); +extern asmlinkage void kvm_posted_intr_wakeup_ipi(void); extern asmlinkage void error_interrupt(void); extern asmlinkage void irq_work_interrupt(void); @@ -92,6 +93,7 @@ extern void trace_call_function_single_interrupt(void); #define trace_irq_move_cleanup_interrupt irq_move_cleanup_interrupt #define trace_reboot_interrupt reboot_interrupt #define trace_kvm_posted_intr_ipi kvm_posted_intr_ipi +#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi #endif /* CONFIG_TRACING */ struct irq_domain; diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index b26cb12..dca94f2 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -105,6 +105,7 @@ /* Vector for KVM to deliver posted interrupt IPI */ #ifdef CONFIG_HAVE_KVM #define POSTED_INTR_VECTOR 0xf2 +#define POSTED_INTR_WAKEUP_VECTOR0xf1 #endif /* diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index e61c14a..a598447 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -960,6 +960,8 @@ apicinterrupt X86_PLATFORM_IPI_VECTOR \ #ifdef CONFIG_HAVE_KVM apicinterrupt3 POSTED_INTR_VECTOR \ kvm_posted_intr_ipi smp_kvm_posted_intr_ipi +apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \ + kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi #endif #ifdef CONFIG_X86_MCE_THRESHOLD diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 922d285..47408c3 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs *regs) } #ifdef CONFIG_HAVE_KVM +void (*wakeup_handler_callback)(void) = NULL; +EXPORT_SYMBOL_GPL(wakeup_handler_callback); + /* * Handler for POSTED_INTERRUPT_VECTOR. */ @@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs) set_irq_regs(old_regs); } + +/* + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR. + */ +__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs) { + struct pt_regs *old_regs = set_irq_regs(regs); + + ack_APIC_irq(); + + irq_enter(); + + exit_idle(); + + inc_irq_stat(kvm_posted_intr_wakeup_ipis); + + if (wakeup_handler_callback) + wakeup_handler_callback(); + + irq_exit(); + + set_irq_regs(old_regs); +} + #endif __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index 70e181e..844673c 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -144,6 +144,8 @@ static void __init apic_intr_init(void) #ifdef CONFIG_HAVE_KVM /* IPI for KVM to deliver
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Paolo Bonzini wrote on 2014-12-18: On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; joro-zLv9SwRftAIdnm+Jiang Liu Cc: io...@lists.linux-foundation.org; linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. Can this happen? If the vcpu is in guest mode, it cannot have been scheduled out, and that's the only case when SN is set. Paolo Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed when SN is set. I would change that to a WARN_ON_ONCE then. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 12/26] KVM: Initialize VT-d Posted-Interrupts Descriptor
Feng Wu wrote on 2014-12-12: This patch initializes the VT-d Posted-Interrupts Descriptor. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/vmx.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0b1383e..66ca275 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -45,6 +45,7 @@ #include asm/perf_event.h #include asm/debugreg.h #include asm/kexec.h +#include asm/irq_remapping.h #include trace.h @@ -4433,6 +4434,30 @@ static void ept_set_mmio_spte_mask(void) kvm_mmu_set_mmio_spte_mask((0x3ull 62) | 0x6ull); } +static void pi_desc_init(struct vcpu_vmx *vmx) { + unsigned int dest; + + if (!irq_remapping_cap(IRQ_POSTING_CAP)) + return; + + /* + * Initialize Posted-Interrupt Descriptor + */ + + pi_clear_sn(vmx-pi_desc); + vmx-pi_desc.nv = POSTED_INTR_VECTOR; Here. + + /* Physical mode for Notificaiton Event */ + vmx-pi_desc.ndm = 0; And from here.. + dest = cpu_physical_id(vmx-vcpu.cpu); + + if (x2apic_enabled()) + vmx-pi_desc.ndst = dest; + else + vmx-pi_desc.ndst = (dest 8) 0xFF00; } + ..to here are useless. The right place to update PI descriptor is where vcpu got loaded not in initialization. /* * Sets up the vmcs for emulated real mode. */ @@ -4476,6 +4501,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR); vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx-pi_desc))); + + pi_desc_init(vmx); } if (ple_gap) { Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Paolo Bonzini wrote on 2014-12-19: On 18/12/2014 15:49, Zhang, Yang Z wrote: Here, we introduce a similar way with 'apic_arb_prio' to handle guest lowest priority interrtups when VT-d PI is used. Here is the ideas: - Each vCPU has a counter 'round_robin_counter'. - When guests sets an interrupts to lowest priority, we choose the vCPU with smallest 'round_robin_counter' as the destination, then increase it. How this can work well? All subsequent interrupts are delivered to one vCPU? It shouldn't be the best solution, need more consideration. Well, it's a hardware limitation. The alternative (which is easy to Agree, it is limited by hardware. But lowest priority distributes the interrupt more efficient than fixed mode. And current implementation more likes to switch the lowest priority mode to fixed mode. In case of interrupt intensive environment, this may be a bottleneck and VM may not benefit greatly from VT-d PI. But agree again, it is really a hardware limitation. implement) is to only do PI for single-CPU interrupts. This should work well for multiqueue NICs (and of course for UP guests :)), so perhaps it's a good idea to only support that as a first attempt. The more easy way is to deliver the interrupt to the first matched VCPU we find. The round_robin_counter really helps nothing here since the interrupt is delivered by hardware directly. Paolo Also, I think you should take the apic_arb_prio into consider since the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-18: jiang@linux.intel.com Cc: eric.au...@linaro.org; linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org; k...@vger.kernel.org; Wu, Feng Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts Feng Wu wrote on 2014-12-12: We don't need to migrate the irqs for VT-d Posted-Interrupts here. When 'pst' is set in IRTE, the associated irq will be posted to guests instead of interrupt remapping. The destination of the interrupt is set in Posted-Interrupts Descriptor, and the migration happens during vCPU scheduling. However, we still update the cached irte here, which can be used when changing back to remapping mode. Signed-off-by: Feng Wu feng...@intel.com Reviewed-by: Jiang Liu jiang@linux.intel.com --- drivers/iommu/intel_irq_remapping.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 --- a/drivers/iommu/intel_irq_remapping.c +++ b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, { struct intel_ir_data *ir_data = data-chip_data;struct irte *irte = ir_data-irte_entry; +struct irte_pi *irte_pi = (struct irte_pi *)irte;struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent = data-parent_data; int ret; @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, */ irte-vector = cfg-vector; irte-dest_id = IRTE_DEST(cfg-dest_apicid); - modify_irte(ir_data-irq_2_iommu, irte); + + /* We don't need to modify irte if the interrupt is for posting. */ + if (irte_pi-pst != 1) + modify_irte(ir_data-irq_2_iommu, irte); What happens if user changes the IRQ affinity manually? If the IRQ is posted, its affinity is controlled by guest (irq --- vCPU pCPU), it has no effect when host changes its affinity. That's the problem: User is able to changes it in host but it never takes effect since it is actually controlled by guest. I guess it will break the IRQ balance too. Thanks, Feng /* * After this point, all the interrupts will start arriving Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Wu, Feng wrote on 2014-12-19: Paolo Bonzini wrote on 2014-12-19: jiang@linux.intel.com Cc: eric.au...@linaro.org; linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org; k...@vger.kernel.org Subject: Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI On 18/12/2014 15:49, Zhang, Yang Z wrote: Here, we introduce a similar way with 'apic_arb_prio' to handle guest lowest priority interrtups when VT-d PI is used. Here is the ideas: - Each vCPU has a counter 'round_robin_counter'. - When guests sets an interrupts to lowest priority, we choose the vCPU with smallest 'round_robin_counter' as the destination, then increase it. How this can work well? All subsequent interrupts are delivered to one vCPU? It shouldn't be the best solution, need more consideration. Well, it's a hardware limitation. The alternative (which is easy to implement) is to only do PI for single-CPU interrupts. This should work well for multiqueue NICs (and of course for UP guests :)), so perhaps it's a good idea to only support that as a first attempt. Paolo Paolo, what do you mean by single-CPU interrupts? Do you mean we It should be same idea as I mentioned on another thread: deliver the interrupt to a single CPU(maybe the first matched VCPU?) don't support lowest priority interrupts for PI? But Linux OS uses lowest priority for most of the case? If so, we can hardly get benefit from this feature for Linux guest OS. Thanks, Feng Also, I think you should take the apic_arb_prio into consider since the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: iommu-boun...@lists.linux-foundation.org wrote on mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Paolo Bonzini wrote on 2014-12-18: On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; joro-zLv9SwRftAIdnm+Jiang Liu Cc: io...@lists.linux-foundation.org; linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. Can this happen? If the vcpu is in guest mode, it cannot have been scheduled out, and that's the only case when SN is set. Paolo Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts are suppressed for posting. I mean what happens if we don't set SN bit. From my point, if preempter already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, since there already has ON bit, so this means there only have one interrupt arrived at most and it doesn't hurt performance. Do we really need to set SN bit? Thanks, Feng waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed when SN is set. I would change that to a WARN_ON_ONCE then. Best regards, Yang ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-19: Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Wu, Feng wrote on 2014-12-19: iommu-boun...@lists.linux-foundation.org wrote on mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Paolo Bonzini wrote on 2014-12-18: On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; joro-zLv9SwRftAIdnm+Jiang Liu Cc: io...@lists.linux-foundation.org; linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. Can this happen? If the vcpu is in guest mode, it cannot have been scheduled out, and that's the only case when SN is set. Paolo Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts are suppressed for posting. I mean what happens if we don't set SN bit. From my point, if preempter already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, since there already has ON bit, so this means there only have one interrupt arrived at most and it doesn't hurt performance. Do we really need to set SN bit? See this scenario: vCPU0 is running on pCPU0 -- vCPU0 is preempted by vCPU1 -- Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule -- in runqueue If the we don't set SN for vCPU0, then all subsequent interrupts for vCPU0 is posted to vCPU1, this will consume hardware and software The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event? efforts and in fact it is not needed at all. If SN is set for vCPU0, VT-d hardware will not issue Notification Event for vCPU0 when an interrupt is for it, but just setting the related PIR bit. Thanks, Feng Thanks, Feng waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed when SN is set. I would change that to a WARN_ON_ONCE then. Best regards, Yang ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/