RE: [PATCH] KVM: x86: always set accessed bit in shadow PTEs
Paolo Bonzini wrote on 2015-11-13: > Commit 7a1638ce4220 ("nEPT: Redefine EPT-specific link_shadow_page()", > 2013-08-05) says: > > Since nEPT doesn't support A/D bit, we should not set those bit > when building the shadow page table. > but this is not necessary. Even though nEPT doesn't support A/D > bits, and hence the vmcs12 EPT pointer will never enable them, > we always use them for shadow page tables if available (see > construct_eptp in vmx.c). So we can set the A/D bits freely > in the shadow page table. > > This patch hence basically reverts commit 7a1638ce4220. > > Cc: Yang Zhang> Cc: Takuya Yoshikawa > Signed-off-by: Paolo Bonzini Looks good to me. Best regards, Yang -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] kvm/vmx: EPTP switching test
Michael S. Tsirkin wrote on 2015-11-16: > This patch adds a new parameter: eptp_switching_test, which enables > testing EPT switching on VMX if supported by hardware. All EPT > entries are initialized to the same value so this adds no useful > functionality by itself, but can be used to test VMFUNC performance, > and serve as a basis for future features based on EPTP switching. > > Support for nested virt is not enabled. > > This was tested using the following code within guest: > #define VMX_VMFUNC ".byte 0x0f,0x01,0xd4" > static void vmfunc(unsigned int nr, unsigned int ept) > { > asm volatile(VMX_VMFUNC >: >: "a"(nr), "c"(ept) >: "memory"); > } > > VMFUNC instruction cost was measured at ~122 cycles. > (Note: recent versions of gnu toolchain support the vmfunc > instruction - removing the need for writing the bytecode manually). > > Signed-off-by: Michael S. Tsirkin> --- > > I think I'd like to put this upstream so future eptp switching work > can be implemented on top. Comments? We have a different version in hand which is using separate EPTP. As you known, the patch will be more complex if using separate EPTP. And there are still lots of thing need to do in our version. We will send out for comments soon. Best regards, Yang N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf
RE: [PATCH v3] KVM: x86: INIT and reset sequences are different
Paolo Bonzini wrote on 2015-10-01: Hi Paolo Sorry for the late reply. I am just back from vacation. > > > On 13/04/2015 13:34, Nadav Amit wrote: >> x86 architecture defines differences between the reset and INIT >> sequences. INIT does not initialize the FPU (including MMX, XMM, YMM, >> etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC >> arbitration ID and BSP. >> >> References (from Intel SDM): >> >> "If the MP protocol has completed and a BSP is chosen, subsequent INITs >> (either to a specific processor or system wide) do not cause the MP >> protocol to be repeated." [8.4.2: MP Initialization Protocol >> Requirements and Restrictions] >> >> [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT] >> >> "If the processor is reset by asserting the INIT# pin, the x87 FPU state is >> not >> changed." [9.2: X87 FPU INITIALIZATION] >> >> "The state of the local APIC following an INIT reset is the same as it is >> after >> a power-up or hardware reset, except that the APIC ID and arbitration ID >> registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset >> (“Wait-for-SIPI” State)] >> >> Signed-off-by: Nadav Amit>> >> --- >> >> v3: >> >> - Leave EFER unchanged on INIT. Instead, set cr0 correctly so vmx_set_cr0 > would >> recognize that paging was changed from on to off and clear LMA. > > I wonder if this change from v2 to v3 was correct. > > It means that a 32-bit firmware cannot enter paging mode without > clearing EFER.LME first (which it should not know about). > > Yang, can you check what real hardware does to EFER on an INIT? Perhaps > it only clears EFER.LME (in addition of course to EFER.LMA, which is > cleared as a side effect of writing CR0). Sure, I will check it with our hardware expert. > > Thanks, > > Paolo Best regards, Yang
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Zhang, Yang Z wrote on 2015-08-04: Paolo Bonzini wrote on 2015-08-04: On 04/08/2015 02:46, Zhang, Yang Z wrote: It is a problem for split irqchip, where the EOI exit bitmap can be inferred from the IOAPIC routes but the TMR cannot. The hardware behavior on the other hand can be implemented purely within the LAPIC. So updating the TMR within LAPIC is the only solution to handle it? It's the simplest and the one that makes most sense. Considering that TMR is a pretty obscure feature, it's unlikely that it will be accelerated in the future. You may be right. It is safe if no future hardware plans to use it. Let me check with our hardware team to see whether it will be used or not in future. After checking with Jun, there is no guarantee that the guest running on another CPU will operate properly if hypervisor modify the vTMR from another CPU. So the hypervisor should not to do it. Paolo Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-04: On 04/08/2015 02:46, Zhang, Yang Z wrote: It is a problem for split irqchip, where the EOI exit bitmap can be inferred from the IOAPIC routes but the TMR cannot. The hardware behavior on the other hand can be implemented purely within the LAPIC. So updating the TMR within LAPIC is the only solution to handle it? It's the simplest and the one that makes most sense. Considering that TMR is a pretty obscure feature, it's unlikely that it will be accelerated in the future. You may be right. It is safe if no future hardware plans to use it. Let me check with our hardware team to see whether it will be used or not in future. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-03: On 03/08/2015 12:23, Zhang, Yang Z wrote: In any case, the TMR behavior introduced by the APICv patches is completely different from the hardware behavior, so it has to be fixed. But any real problem with it? It is a problem for split irqchip, where the EOI exit bitmap can be inferred from the IOAPIC routes but the TMR cannot. The hardware behavior on the other hand can be implemented purely within the LAPIC. So updating the TMR within LAPIC is the only solution to handle it? The alternative is to inject level-triggered interrupts synchronously, without using posted interrupts. I'll write some testcases to understand the functioning of TMR in the virtual-APIC page, but the manual seems clear to me. Currently, no existing hardware will use TMR and will not cause any problem.(That's the reason why we leave it in Xen).But we don't know whether future hardware will use it or not(SDM always keeps changing :)). But that would be covered by a different execution control (for backwards compatibility). We'll get there when such a feature is introduced. Yes, we can leave it in future. But one concern is that it may hard to handle it at that time if someone also develops feature which rely on it (like current patch to split irqchip). And per 24.11.4's description, the perfect solution is don't modify it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS accesses are issued in right VMCS context. Yes, that's correct. It's just the TMR. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-08-03: On 03/08/2015 04:37, Zhang, Yang Z wrote: Only virtualized APIC register reads use the virtual TMR registers (SDM 29.4.2 or 29.5), but these just read data from the corresponding field in the virtual APIC page. 24.11.4 Software Access to Related Structures In addition to data in the VMCS region itself, VMX non-root operation can be controlled by data structures that are referenced by pointers in a VMCS (for example, the I/O bitmaps). While the pointers to these data structures are parts of the VMCS, the data structures themselves are not. They are not accessible using VMREAD and VMWRITE but by ordinary memory writes. I don't think the TMR fields of the virtual-APIC page are among the data structures that _controls_ VMX non-root operations, because: * it is not part of the virtualized APIC state is listed in 29.1.1 * read accesses from the APIC-access page simply return data from the corresponding page offset on the virtual-APIC page using the memory access type stored in IA32_VMX_BASIC_MSR. I think this explicitly says that the effects of 24.11.1 (especially non-deterministic behavior after a write) do not apply here. In any case, the TMR behavior introduced by the APICv patches is completely different from the hardware behavior, so it has to be fixed. But any real problem with it? The alternative is to inject level-triggered interrupts synchronously, without using posted interrupts. I'll write some testcases to understand the functioning of TMR in the virtual-APIC page, but the manual seems clear to me. Currently, no existing hardware will use TMR and will not cause any problem.(That's the reason why we leave it in Xen).But we don't know whether future hardware will use it or not(SDM always keeps changing :)).And per 24.11.4's description, the perfect solution is don't modify it. btw, IIRC, only TMR doesn't follow the rule. All other VMCS accesses are issued in right VMCS context. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-31: On 31/07/2015 01:26, Zhang, Yang Z wrote: Do not compute TMR in advance. Instead, set the TMR just before the interrupt is accepted into the IRR. This limits the coupling between IOAPIC and LAPIC. Uh.., it back to original way which is wrong. You cannot modify the apic page(here is the TMR reg) directly when the corresponding VMCS may be used at same time. Where is this documented? The TMR is not part of the set of virtualized APIC registers (TPR, PPR, EOI, ISR, IRR, ICR+ICR2; SDM 29.1.1). Only virtualized APIC register reads use the virtual TMR registers (SDM 29.4.2 or 29.5), but these just read data from the corresponding field in the virtual APIC page. It's not only for virtual apic page. All data structures that is used by VMCS(either directly inside in VMCS or referenced by pointer in a VMCS) need follow this rule: 24.11.4 Software Access to Related Structures In addition to data in the VMCS region itself, VMX non-root operation can be controlled by data structures that are referenced by pointers in a VMCS (for example, the I/O bitmaps). While the pointers to these data structures are parts of the VMCS, the data structures themselves are not. They are not accessible using VMREAD and VMWRITE but by ordinary memory writes. Software should ensure that each such data structure is modified only when no logical processor with a current VMCS that references it is in VMX non-root operation. Doing otherwise may lead to unpredictable behavior (including behaviors identified in Section 24.11.1). Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-31: On 31/07/2015 04:49, Steve Rutherford wrote: Oh... Yeah. That's a damn good point, given that the interrupt can be injected from another thread while one is in that guest vcpu. Easiest time to update the TMR should be on guest entry through vcpu_scan_ioapic, as before. The best way to go is probably to ditch the new per vcpu EOI exit bitmap, and just update/use the TMR. There's no reason to duplicate that data in the representation of the apic (I suspect that the duplication was inspired by my mistaken notion of the TMR). The IOAPIC exit check can use the TMR instead. Based upon my reading of the SDM, the only reason that the eoi exit bitmaps are not the exact same as the TMR is that it is possible to have virtual-interrupt delivery enabled /without/ an apic access page (Note: V-ID = EOI exit bitmap enabled). Yang, do you happen to know if that is the case? [Note: Just looked back at the old code for updating the EOI exit bitmaps, which for some reason was configured to trigger EOI exits for all IOAPIC interrupts, not just level-triggered IOAPIC interrupts. Which is weird, and I believe totally unecessary.] The RTC interrupt needs to trigger an EOI exit with the in-kernel IOAPIC, in order to detect coalesced interrupts. This is necessary to avoid severe time drift in Windows guest. Correct! EOI exit bitmap is not absolutely same with TMR. Some interrupts (especially timer interrupt) which is edge trigger but need a notification when it got injected into guest. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] KVM: x86: set TMR when the interrupt is accepted
Paolo Bonzini wrote on 2015-07-29: Do not compute TMR in advance. Instead, set the TMR just before the interrupt is accepted into the IRR. This limits the coupling between IOAPIC and LAPIC. Uh.., it back to original way which is wrong. You cannot modify the apic page(here is the TMR reg) directly when the corresponding VMCS may be used at same time. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/ioapic.c | 9 ++--- arch/x86/kvm/ioapic.h | 3 +-- arch/x86/kvm/lapic.c | 19 ++- arch/x86/kvm/lapic.h | 1 - arch/x86/kvm/x86.c| 5 + 5 files changed, 14 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c index 856f79105bb5..eaf4ec38d980 100644 --- a/arch/x86/kvm/ioapic.c +++ b/arch/x86/kvm/ioapic.c @@ -246,8 +246,7 @@ static void update_handled_vectors(struct kvm_ioapic *ioapic) smp_wmb(); } -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, - u32 *tmr) +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) { struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic; union kvm_ioapic_redirect_entry *e; @@ -260,13 +259,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index) || index == RTC_GSI) { if (kvm_apic_match_dest(vcpu, NULL, 0, - e-fields.dest_id, e-fields.dest_mode)) { + e-fields.dest_id, e-fields.dest_mode)) __set_bit(e-fields.vector, (unsigned long *)eoi_exit_bitmap); - if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG) - __set_bit(e-fields.vector, - (unsigned long *)tmr); - } } } spin_unlock(ioapic-lock); diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h index ca0b0b4e6256..3dbd0e2aac4e 100644 --- a/arch/x86/kvm/ioapic.h +++ b/arch/x86/kvm/ioapic.h @@ -120,7 +120,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, unsigned long *dest_map); int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state); int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state); -void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, - u32 *tmr); +void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); #endif diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 2a5ca97c263b..9be64c77d6db 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -551,15 +551,6 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu) __clear_bit(KVM_APIC_PV_EOI_PENDING, vcpu-arch.apic_attention); } -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr) -{ - struct kvm_lapic *apic = vcpu-arch.apic; - int i; - - for (i = 0; i 8; i++) - apic_set_reg(apic, APIC_TMR + 0x10 * i, tmr[i]); -} - static void apic_update_ppr(struct kvm_lapic *apic) { u32 tpr, isrv, ppr, old_ppr; @@ -781,6 +772,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, case APIC_DM_LOWEST: vcpu-arch.apic_arb_prio++; case APIC_DM_FIXED: + if (unlikely(trig_mode !level)) + break; + /* FIXME add logic for vcpu on reset */ if (unlikely(!apic_enabled(apic))) break; @@ -790,6 +784,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, if (dest_map) __set_bit(vcpu-vcpu_id, dest_map); + if (apic_test_vector(vector, apic-regs + APIC_TMR) != !!trig_mode) { + if (trig_mode) + apic_set_vector(vector, apic-regs + APIC_TMR); + else + apic_clear_vector(vector, apic-regs + APIC_TMR); + } + if (kvm_x86_ops-deliver_posted_interrupt) kvm_x86_ops-deliver_posted_interrupt(vcpu, vector); else { diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 764037991d26..eb46d6bcaa75 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -57,7 +57,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); -void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr); void __kvm_apic_update_irr(u32 *pir, void *regs); void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir); int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
Paolo Bonzini wrote on 2015-04-24: On 24/04/2015 09:46, Zhang, Yang Z wrote: On the other hand vmexit is lighter and lighter on newer processors; a Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000 vs. 2500 clock cycles approximately). 1000 cycles? I remember it takes about 4000 cycle even in HSW server. I was going from memory, but I now measured it with the vmexit test of kvm-unit-tests. With both SNB Xeon E5 and IVB Core i7, returns about 1400 clock cycles for a vmcall exit. This includes the overhead of doing the cpuid itself. Thus the vmexit cost is around 1300 cycles. Of this the vmresume instruction is probably around 800 cycles, and the rest is introduced by KVM. There are at least 4-5 memory barriers and locked instructions. Yes, that's make sense. The average vmexit/vmentry handle cost is around 4000 cycles. But I guess xsaveopt doesn't take so many cycles. Does anyone have the xsaveopt cost data? Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
Paolo Bonzini wrote on 2015-04-24: On 24/04/2015 03:16, Zhang, Yang Z wrote: This is interesting since previous measurements on KVM have had the exact opposite results. I think we need to understand this a lot more. What I can tell is that vmexit is heavy. So it is reasonable to see the improvement under some cases, especially kernel is using eager FPU now which means each schedule may trigger a vmexit. On the other hand vmexit is lighter and lighter on newer processors; a Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000 vs. 2500 clock cycles approximately). 1000 cycles? I remember it takes about 4000 cycle even in HSW server. Also, measurement were done on Westmere but Sandy Bridge is the first processor to have XSAVEOPT and thus use eager FPU. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU
H. Peter Anvin wrote on 2015-04-24: On 04/23/2015 08:28 AM, Dave Hansen wrote: On 04/23/2015 02:13 PM, Liang Li wrote: When compiling kernel on westmere, the performance of eager FPU is about 0.4% faster than lazy FPU. Do you have an theory why this is? What does the regression come from? This is interesting since previous measurements on KVM have had the exact opposite results. I think we need to understand this a lot more. What I can tell is that vmexit is heavy. So it is reasonable to see the improvement under some cases, especially kernel is using eager FPU now which means each schedule may trigger a vmexit. -hpa Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v4 6/6] KVM: nVMX: Enable nested posted interrupt processing
Paolo Bonzini wrote on 2015-02-03: On 02/02/2015 16:33, Wincy Van wrote: static void vmx_accomp_nested_posted_intr(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (is_guest_mode(vcpu) vmx-nested.posted_intr_nv != -1 pi_test_on(vmx-nested.pi_desc)) kvm_apic_set_irr(vcpu, vmx-nested.posted_intr_nv); } Then we will get an nested-vmexit in vmx_check_nested_events, that posted intr will be handled by L1 immediately. This mechanism will also emulate the hardware's behavior: If a posted intr was not accomplished by hardware, we will get an Actually, we cannot say not accomplished by hardware. It more like we don't do the job well. See my below answer. interrupt with POSTED_INTR_NV. Yes. This is not enough. From L1's point, L2 is in vmx non-root mode. So we should emulate the posted interrupt in L0 correctly, say: 1. clear ON bit 2. ack interrupt 3, syn pir to virr 4. update RVI. Then let the hardware(virtual interrupt delivery) to accomplish interrupt injection. Force a vmexit more like a trick. It's better to follow the hardware's behavior unless we cannot do it. Would this be better? I think you do not even need a new bit. You can use KVM_REQ_EVENT and (to complete my suggestion, which was not enough) do the above in vmx_check_nested_events. Paolo Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 4:00 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void) (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_longmode_x2apic) goto out4; + + vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); + if (!vmx_msr_bitmap_nested) + goto out5; + Since the nested virtualization is off by default. It's better to allocate the page only when nested is true. Maybe adding the following check is better: if (nested) { vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested) goto out5; } Agreed. Will do. ...snip... + +/* + * Merge L0's and L1's MSR bitmap, return false to indicate that + * we do not use the hardware. + */ +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + return false; +} + The following patches have nothing to do with the MSR control. Why leave the function empty here? No. In patch Enable nested virtualize x2apic mode, we will return false if L1 disabled virt_x2apic_mode, then the hardware MSR-bitmap control is disabled. This is faster than rebuilding the vmx_msr_bitmap_nested. This function returns false here to indicate that we do not use the hardware. Since It is not only virtualize x2apic mode using this, other features may use this either. I think it isn't suitable to introduce this function in other patches. Yes, rebuilding is more costly. But your current implementation cannot leverage the APICv feature correctly. I replied in another thread. Best regards, Yang Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Zhang, Yang Z wrote on 2015-01-28: Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: If L0 wants to intercept a msr, we should set vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), and that bitmaps should only be loaded in non-nested entry. Currently we only clear the corresponding bits if L1 enables virtualize x2apic mode, all the other bits are all set. On nested entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In the nested entry, L0 only care about L2's msr access, not L1's. I think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap as your mentioned above. Mmm... I think if the bit is setting in vmcs01-msr_bitmap means whenever the VCPU(no matter in nested guest mode or not) accesses this msr should cause vmexit, not only L1. That's why need to construct the vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed vmcs12-msr_bitmap). You are right, but this is not fit for all the cases, we should custom the nested_msr_bitmap. e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: if (enable_apicv) { for (msr = 0x800; msr = 0x8ff; msr++) vmx_disable_intercept_msr_read_x2apic(msr); /* According SDM, in x2apic mode, the whole id reg is used. * But in KVM, it only use the highest eight bits. Need to * intercept it */ vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */ vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */ vmx_disable_intercept_msr_write_x2apic(0x80b); /* SELF-IPI */ vmx_disable_intercept_msr_write_x2apic(0x83f); } But L1 may not want this. So I think we are better to deal with the Actually, from L0's point, it is totally unaware of the L2. The only thing L0 aware is that the CPU should follow L0's configuration when VCPU is running. So if L0 wants to trap a msr, then the read operation to this msr should cause vmexit unconditionally no matter who is running(who means L1, L2, L3.). msrs seperately, there is not a common way suit for all the cases. If other features want to intercept a msr in nested entry, they can put the custom code in nested_vmx_merge_msr_bitmap. Yes, if other features want to do it in 'nested' entry, they can fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our responsibly to handle it correctly, how about add following check: if (type MSR_TYPE_R !test_bit(msr, vmcs01_msr_bitmap) !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) __clear_bit(msr, msr_bitmap_nested + 0x000 / f); Anyway, this is not necessary for your current patch. We can consider it later if there really have other features will use it. Thanks, Wincy Best regards, Yang Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-24: Currently, if L1 enables MSR_BITMAP, we will emulate this feature, all of L2's msr access is intercepted by L0. Since many features like virtualize x2apic mode has a complicated logic and it is difficult for us to emulate, we should use hardware and merge the bitmap. This patch introduces nested_vmx_merge_msr_bitmap for future use. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- arch/x86/kvm/vmx.c | 71 +++ 1 files changed, 60 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c987374..36d0724 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -798,6 +798,7 @@ static unsigned long *vmx_msr_bitmap_legacy; static unsigned long *vmx_msr_bitmap_longmode; static unsigned long *vmx_msr_bitmap_legacy_x2apic; static unsigned long *vmx_msr_bitmap_longmode_x2apic; +static unsigned long *vmx_msr_bitmap_nested; static unsigned long *vmx_vmread_bitmap; static unsigned long *vmx_vmwrite_bitmap; @@ -5812,13 +5813,18 @@ static __init int hardware_setup(void) (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_longmode_x2apic) goto out4; + + vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); + if (!vmx_msr_bitmap_nested) + goto out5; + Since the nested virtualization is off by default. It's better to allocate the page only when nested is true. Maybe adding the following check is better: if (nested) { vmx_msr_bitmap_nested = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_msr_bitmap_nested) goto out5; } ...snip... + +/* + * Merge L0's and L1's MSR bitmap, return false to indicate that + * we do not use the hardware. + */ +static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + return false; +} + The following patches have nothing to do with the MSR control. Why leave the function empty here? Best regards, Yang N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Zhang, Yang Z wrote on 2015-01-28: Wincy Van wrote on 2015-01-24: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance, especially in apicv case. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Sorry, replied on the wrong thread.:( Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- ...snip... static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) { if (!longmode_only) @@ -8344,7 +8394,68 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in vmcs12-msr_bitmap is setting. Am I missing some patches? Best regards, Yang Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: @@ -8344,7 +8394,68 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in vmcs12-msr_bitmap is setting. Am I missing some patches? In the beginning, I want to do vmcs01-msr_bitmap | vmcs12-msr_bitmap, but I remember that there isn't a instruction to do a bit or operation in two pages effectively, so I do the bit or operation in nested_vmx_disable_intercept_for_msr. If the hardware do not support this, I think it is faster if we deal with the bits on demand. nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps, any features can put their logic here. You construct the nested_msr_bitmap based on vmcs12-msr_bitmap, what happens if vmcs01-msr_bitmap want to trap this msr? If there is a faster way, please teach me how to do it : ) You are right. Interception should be much faster. Thanks, Wincy Best regards, Yang Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 7:52 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: If L0 wants to intercept a msr, we should set vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), and that bitmaps should only be loaded in non-nested entry. Currently we only clear the corresponding bits if L1 enables virtualize x2apic mode, all the other bits are all set. On nested entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In the nested entry, L0 only care about L2's msr access, not L1's. I think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap as your mentioned above. Mmm... I think if the bit is setting in vmcs01-msr_bitmap means whenever the VCPU(no matter in nested guest mode or not) accesses this msr should cause vmexit, not only L1. That's why need to construct the vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed vmcs12-msr_bitmap). You are right, but this is not fit for all the cases, we should custom the nested_msr_bitmap. e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: if (enable_apicv) { for (msr = 0x800; msr = 0x8ff; msr++) vmx_disable_intercept_msr_read_x2apic(msr); /* According SDM, in x2apic mode, the whole id reg is used. * But in KVM, it only use the highest eight bits. Need to * intercept it */ vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */ vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */ vmx_disable_intercept_msr_write_x2apic(0x80b); /* SELF-IPI */ vmx_disable_intercept_msr_write_x2apic(0x83f); } But L1 may not want this. So I think we are better to deal with the Actually, from L0's point, it is totally unaware of the L2. The only thing L0 aware is that the CPU should follow L0's configuration when VCPU is running. So if L0 wants to trap a msr, then the read operation to this msr should cause vmexit unconditionally no matter who is running(who means L1, L2, L3.). msrs seperately, there is not a common way suit for all the cases. If other features want to intercept a msr in nested entry, they can put the custom code in nested_vmx_merge_msr_bitmap. Yes, if other features want to do it in 'nested' entry, they can fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our responsibly to handle it correctly, how about add following check: if (type MSR_TYPE_R !test_bit(msr, vmcs01_msr_bitmap) !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) __clear_bit(msr, msr_bitmap_nested + 0x000 / f); Thanks, Wincy Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 8:33 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: You are right, but this is not fit for all the cases, we should custom the nested_msr_bitmap. e.g. Currently L0 wants to intercept some of the x2apic msrs' reading: if (enable_apicv) { for (msr = 0x800; msr = 0x8ff; msr++) vmx_disable_intercept_msr_read_x2apic(msr); /* According SDM, in x2apic mode, the whole id reg is used. * But in KVM, it only use the highest eight bits. Need to * intercept it */ vmx_enable_intercept_msr_read_x2apic(0x802); /* TMCCT */ vmx_enable_intercept_msr_read_x2apic(0x839); /* TPR */ vmx_disable_intercept_msr_write_x2apic(0x808); /* EOI */ vmx_disable_intercept_msr_write_x2apic(0x80b); /* SELF-IPI */ vmx_disable_intercept_msr_write_x2apic(0x83f); } But L1 may not want this. So I think we are better to deal with the Actually, from L0's point, it is totally unaware of the L2. The only thing L0 aware is that the CPU should follow L0's configuration when VCPU is running. So if L0 wants to trap a msr, then the read operation to this msr should cause vmexit unconditionally no matter who is running(who means L1, L2, L3.). msrs seperately, there is not a common way suit for all the cases. If other features want to intercept a msr in nested entry, they can put the custom code in nested_vmx_merge_msr_bitmap. Yes, if other features want to do it in 'nested' entry, they can fill nested_vmx_merge_msr_bitmap. But if in non-nested case, it should be our responsibly to handle it correctly, how about add following check: if (type MSR_TYPE_R !test_bit(msr, vmcs01_msr_bitmap) !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) __clear_bit(msr, msr_bitmap_nested + 0x000 / f); Anyway, this is not necessary for your current patch. We can consider it later if there really have other features will use it. Yep, I know what you mean now, for other msrs which are not forwarded access by a mechanism like virtual-apic page, we should intercept it unconditionally. I think we should ensure the msr can be allowed before call nested_vmx_disable_intercept_for_msr, if L0 want to intercept it, just do not call nested_vmx_disable_intercept_for_msr. Yes, this is a solution. But I prefer to handle it in nested code path since others may not familiar with nested and may block it by mistake. !test_bit(msr, vmcs01_msr_bitmap) will introduce a problem that some of the msrs will be affcted by vmcs01_msr_bitmap, TMCCT and TPR, for example. Intercept reading for these msrs is okay, but it is not efficient. TMCCT is always trapped by most VMM. I don't think TPR is trapped in KVM. Thanks, Wincy Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-24: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance, especially in apicv case. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- ...snip... static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) { if (!longmode_only) @@ -8344,7 +8394,68 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit in vmcs12-msr_bitmap is setting. Am I missing some patches? Best regards, Yang
RE: [PATCH v3 1/6] KVM: nVMX: Use hardware MSR bitmap
Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 7:25 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: Wincy Van wrote on 2015-01-28: On Wed, Jan 28, 2015 at 4:05 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: @@ -8344,7 +8394,68 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + memset(vmx_msr_bitmap_nested, 0xff, PAGE_SIZE); + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); I didn't understand what this function does? Per my understanding, you only need to set the (vmx_msr_bitmap_nested = vmcs01-msr_bitmap | vmcs12-msr_bitmap) and inject the nested vmexit to L1 if the bit vmcs12-in msr_bitmap is setting. Am I missing some patches? In the beginning, I want to do vmcs01-msr_bitmap | vmcs12-msr_bitmap, but I remember that there isn't a instruction vmcs12-to do a bit or operation in two pages effectively, so I do the bit or operation in nested_vmx_disable_intercept_for_msr. If the hardware do not support this, I think it is faster if we deal with the bits on demand. nested_vmx_merge_msr_bitmap is used to merge L0's and L1's bitmaps, any features can put their logic here. You construct the nested_msr_bitmap based on vmcs12-msr_bitmap, what happens if vmcs01-msr_bitmap want to trap this msr? If L0 wants to intercept a msr, we should set vmx_msr_bitmap_legacy(_x2apic) and vmx_msr_bitmap_longmode(_x2apic), and that bitmaps should only be loaded in non-nested entry. Currently we only clear the corresponding bits if L1 enables virtualize x2apic mode, all the other bits are all set. On nested entry, we load nested_msr_bitmap, on nested vmexit, we will restore L0's bitmap. In the nested entry, L0 only care about L2's msr access, not L1's. I think that in nested entry, nested_msr_bitmap is vmcs01-msr_bitmap as your mentioned above. Mmm... I think if the bit is setting in vmcs01-msr_bitmap means whenever the VCPU(no matter in nested guest mode or not) accesses this msr should cause vmexit, not only L1. That's why need to construct the vmcs02-msr_bitmap based on (vmcs01-msr_bitmap ORed vmcs12-msr_bitmap). Thanks, Wincy Best regards, Yang N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode
Wincy Van wrote on 2015-01-28: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance, especially in apicv case. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- arch/x86/kvm/vmx.c | 114 +++- 1 files changed, 112 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 787f886..9d11a93 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1108,6 +1108,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 *vmcs12) vmx_xsaves_supported(); } +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 +*vmcs12) { + return nested_cpu_has2(vmcs12, +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2395,6 +2400,7 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_secondary_ctls_low = 0; nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_XSAVES; @@ -4155,6 +4161,52 @@ static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap, } } +/* + * If a msr is allowed by L0, we should check whether it is allowed by L1. + * The corresponding bit will be cleared unless both of L0 and L1 allow it. + */ +static void nested_vmx_disable_intercept_for_msr(unsigned long *msr_bitmap_l1, + unsigned long *msr_bitmap_nested, + u32 msr, int type) { + int f = sizeof(unsigned long); + + if (!cpu_has_vmx_msr_bitmap()) { + WARN_ON(1); + return; + } + + /* +* See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals +* have the write-low and read-high bitmap offsets the wrong way round. +* We can control MSRs 0x-0x1fff and 0xc000-0xc0001fff. +*/ + if (msr = 0x1fff) { + if (type MSR_TYPE_R + !test_bit(msr, msr_bitmap_l1 + 0x000 / f)) + /* read-low */ + __clear_bit(msr, msr_bitmap_nested + 0x000 / f); + + if (type MSR_TYPE_W + !test_bit(msr, msr_bitmap_l1 + 0x800 / f)) + /* write-low */ + __clear_bit(msr, msr_bitmap_nested + 0x800 / f); + + } else if ((msr = 0xc000) (msr = 0xc0001fff)) { + msr = 0x1fff; + if (type MSR_TYPE_R + !test_bit(msr, msr_bitmap_l1 + 0x400 / f)) + /* read-high */ + __clear_bit(msr, msr_bitmap_nested + 0x400 / f); + + if (type MSR_TYPE_W + !test_bit(msr, msr_bitmap_l1 + 0xc00 / f)) + /* write-high */ + __clear_bit(msr, msr_bitmap_nested + 0xc00 / f); + + } +} + static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only) { if (!longmode_only) @@ -8350,7 +8402,59 @@ static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { - return false; + struct page *page; + unsigned long *msr_bitmap; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return false; + + page = nested_get_page(vcpu, vmcs12-msr_bitmap); + if (!page) { + WARN_ON(1); + return false; + } + msr_bitmap = (unsigned long *)kmap(page); + if (!msr_bitmap) { + nested_release_page_clean(page); + WARN_ON(1); + return false; + } + + if (nested_cpu_has_virt_x2apic_mode(vmcs12)) { + /* TPR is allowed */ + nested_vmx_disable_intercept_for_msr(msr_bitmap, + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); + } else + __vmx_enable_intercept_for_msr( + vmx_msr_bitmap_nested, + APIC_BASE_MSR + (APIC_TASKPRI 4), + MSR_TYPE_R | MSR_TYPE_W); + kunmap(page); + nested_release_page_clean(page); + + return true; +} + +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, +
RE: [PATCH v4 2/6] KVM: nVMX: Enable nested virtualize x2apic mode
Wincy Van wrote on 2015-01-29: On Thu, Jan 29, 2015 at 10:54 AM, Zhang, Yang Z yang.z.zh...@intel.com wrote: -8646,7 +8750,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) else vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vmx-nested.apic_access_page)); - } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { + } else if + (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) + + + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) { exec_control |= You don't load L2's apic_page in your patch correctly when x2apic mode is used. Here is the right change for prepare_vmcs02()(maybe other place also need change too): @@ -8585,7 +8585,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) exec_control |= vmcs12-secondary_vm_exec_control; - if (exec_control SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) { + if (exec_control (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) { /* * If translation failed, no matter: This feature asks * to exit when accessing the given address, and if it @@ -8594,7 +8595,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) */ if (!vmx-nested.apic_access_page) exec_control = - ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; + ~ (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); else vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vmx-nested.apic_access_page)); I think we don't need to do this, if L1 enables x2apic mode, we have already checked that the vmcs12-SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES is 0. exec_control |= vmcs12-secondary_vm_exec_control; merged L1's settings, including x2apic mode. the special case is vm_need_virtualize_apic_accesses, if vm_need_virtualize_apic_accesses returns true, the nested_cpu_has_virt_x2apic_mode will prevent us to set the apic access bit. I just realized that we are talking different thing. I am thinking about VIRTUAL_APIC_PAGE_ADDR(my mistake :)). And it has been handled correctly already. For APIC_ACCESS_ADDR, you are right. The current implementation can handle it well. Thanks, Wincy Best regards, Yang
RE: [PATCH v4 0/6] KVM: nVMX: Enable nested apicv support
Wincy Van wrote on 2015-01-28: v1 --- v2: Use spin lock to ensure vmcs12 is safe when doing nested posted interrupt delivery. v2 --- v3: 1. Add a new field in nested_vmx to avoid the spin lock in v2. 2. Drop send eoi to L1 when doing nested interrupt delivery. 3. Use hardware MSR bitmap to enable nested virtualize x2apic mode. v3 --- v4: 1. Optimize nested msr bitmap merging. 2. Allocate nested msr bitmap only when nested == 1. 3. Inline the nested vmx control checking functions. This version looks good to me. Only minor comment: EXIT_REASON_APIC_WRITE vmexit is introduced by apic register virtualization not virtual interrupt delivery, so it's better add it in 4th patch not 5th patch.(If no other comments, I guess Paolo can help do it when applying it). Reviewed-by: Yang Zhang yang.z.zh...@intel.com Wincy Van (6): KVM: nVMX: Use hardware MSR bitmap KVM: nVMX: Enable nested virtualize x2apic mode KVM: nVMX: Make nested control MSRs per-cpu KVM: nVMX: Enable nested apic register virtualization KVM: nVMX: Enable nested virtual interrupt delivery KVM: nVMX: Enable nested posted interrupt processing arch/x86/kvm/vmx.c | 580 +++- 1 files changed, 480 insertions(+), 100 deletions(-) Best regards, Yang
RE: [PATCH 1/5] KVM: nVMX: Make nested control MSRs per-cpu.
Wincy Van wrote on 2015-01-16: To enable nested apicv support, we need per-cpu vmx control MSRs: 1. If in-kernel irqchip is enabled, we can enable nested posted interrupt, we should set posted intr bit in the nested_vmx_pinbased_ctls_high. 2. If in-kernel irqchip is disabled, we can not enable nested posted interrupt, the posted intr bit in the nested_vmx_pinbased_ctls_high will be cleared. Since there would be different settings about in-kernel irqchip between VMs, different nested control MSRs are needed. I'd suggest you to check irqchip_in_kernel() instead moving the whole ctrl msr to per vcpu. Best regards, Yang
RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
Wincy Van wrote on 2015-01-21: On Wed, Jan 21, 2015 at 4:07 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: + if (vector == vmcs12-posted_intr_nv + nested_cpu_has_posted_intr(vmcs12)) { + if (vcpu-mode == IN_GUEST_MODE) + apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), + POSTED_INTR_VECTOR); + else { + r = -1; + goto out; + } + + /* +* if posted intr is done by hardware, the +* corresponding eoi was sent to L0. Thus +* we should send eoi to L1 manually. + */ + kvm_apic_set_eoi_accelerated(vcpu, + vmcs12-posted_intr_nv); Why this is necessary? As your comments mentioned, it is done by hardware not L1, why L1 should aware of it? According to SDM 29.6, if the processor recognizes a posted interrupt, it will send an EOI to LAPIC. If the posted intr is done by hardware, the processor will send eoi to hardware LAPIC, not L1's, just like the none-nested case(the physical interrupt is dismissed). So we should take care of the L1's LAPIC and send an eoi to it. No. You are not emulating the PI feature. You just reuse the hardware's capability. So you don't need to let L1 know it. Thanks, Wincy Best regards, Yang N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: [PATCH v2 5/5] KVM: nVMX: Enable nested posted interrupt processing.
Wincy Van wrote on 2015-01-20: If vcpu has a interrupt in vmx non-root mode, we will kick that vcpu to inject interrupt timely. With posted interrupt processing, the kick intr is not needed, and interrupts are fully taken care of by hardware. In nested vmx, this feature avoids much more vmexits than non-nested vmx. This patch use L0's POSTED_INTR_NV to avoid unexpected interrupt if L1's vector is different with L0's. If vcpu is in hardware's non-root mode, we use a physical ipi to deliver posted interrupts, otherwise we will deliver that interrupt to L1 and kick that vcpu out of nested non-root mode. Signed-off-by: Wincy Van fanwenyi0...@gmail.com --- arch/x86/kvm/vmx.c | 136 ++-- 1 files changed, 132 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ea56e9f..cda9133 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -215,6 +215,7 @@ struct __packed vmcs12 { u64 tsc_offset; u64 virtual_apic_page_addr; u64 apic_access_addr; + u64 posted_intr_desc_addr; u64 ept_pointer; u64 eoi_exit_bitmap0; u64 eoi_exit_bitmap1; @@ -334,6 +335,7 @@ struct __packed vmcs12 { u32 vmx_preemption_timer_value; u32 padding32[7]; /* room for future expansion */ u16 virtual_processor_id; + u16 posted_intr_nv; u16 guest_es_selector; u16 guest_cs_selector; u16 guest_ss_selector; @@ -387,6 +389,7 @@ struct nested_vmx { /* The host-usable pointer to the above */ struct page *current_vmcs12_page; struct vmcs12 *current_vmcs12; + spinlock_t vmcs12_lock; struct vmcs *current_shadow_vmcs; /* * Indicates if the shadow vmcs must be updated with the @@ -406,6 +409,8 @@ struct nested_vmx { */ struct page *apic_access_page; struct page *virtual_apic_page; + struct page *pi_desc_page; + struct pi_desc *pi_desc; u64 msr_ia32_feature_control; struct hrtimer preemption_timer; @@ -621,6 +626,7 @@ static int max_shadow_read_write_fields = static const unsigned short vmcs_field_to_offset_table[] = { FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id), + FIELD(POSTED_INTR_NV, posted_intr_nv), FIELD(GUEST_ES_SELECTOR, guest_es_selector), FIELD(GUEST_CS_SELECTOR, guest_cs_selector), FIELD(GUEST_SS_SELECTOR, guest_ss_selector), @@ -646,6 +652,7 @@ static const unsigned short vmcs_field_to_offset_table[] = { FIELD64(TSC_OFFSET, tsc_offset), FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr), FIELD64(APIC_ACCESS_ADDR, apic_access_addr), + FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr), FIELD64(EPT_POINTER, ept_pointer), FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0), FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1), @@ -798,6 +805,7 @@ static void kvm_cpu_vmxon(u64 addr); static void kvm_cpu_vmxoff(void); static bool vmx_mpx_supported(void); static bool vmx_xsaves_supported(void); +static int vmx_vm_has_apicv(struct kvm *kvm); static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -1159,6 +1167,11 @@ static inline bool nested_cpu_has_vid(struct vmcs12 *vmcs12) return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY); } +static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12) { + return vmcs12-pin_based_vm_exec_control +PIN_BASED_POSTED_INTR; } + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2362,6 +2375,9 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | PIN_BASED_VMX_PREEMPTION_TIMER; + if (vmx_vm_has_apicv(vmx-vcpu.kvm)) + vmx-nested.nested_vmx_pinbased_ctls_high |= + PIN_BASED_POSTED_INTR; /* exit controls */ rdmsr(MSR_IA32_VMX_EXIT_CTLS, @@ -4267,6 +4283,46 @@ static int vmx_vm_has_apicv(struct kvm *kvm) return enable_apicv irqchip_in_kernel(kvm); } +static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, + int vector) { + int r = 0; + struct vmcs12 *vmcs12; + + /* +* Since posted intr delivery is async, +* we must aquire a spin-lock to avoid +* the race of vmcs12. +*/ + spin_lock(to_vmx(vcpu)-nested.vmcs12_lock); + vmcs12 = get_vmcs12(vcpu); + if (!is_guest_mode(vcpu) || !vmcs12) { + r = -1; + goto out; + } + if (vector == vmcs12-posted_intr_nv +
RE: [PATCH 2/5] KVM: nVMX: Enable nested virtualize x2apic mode.
Wincy Van wrote on 2015-01-16: When L2 is using x2apic, we can use virtualize x2apic mode to gain higher performance. This patch also introduces nested_vmx_check_apicv_controls for the nested apicv patches. Signed-off-by: Wincy Van fanwenyi0...@gmail.com To enable x2apic, should you to consider the behavior changes to rdmsr and wrmsr. I didn't see your patch do it. Is it correct? BTW, this patch has nothing to do with APICv, it's better to not use x2apic here and change to apicv in following patch. --- arch/x86/kvm/vmx.c | 49 - 1 files changed, 48 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 954dd54..10183ee 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1134,6 +1134,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 *vmcs12) vmx_xsaves_supported(); } +static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 +*vmcs12) { + return nested_cpu_has2(vmcs12, +SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2426,6 +2431,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx) vmx-nested.nested_vmx_secondary_ctls_low = 0; vmx-nested.nested_vmx_secondary_ctls_high = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | + SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_XSAVES; @@ -7333,6 +7339,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_APIC_ACCESS: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); + case EXIT_REASON_APIC_WRITE: + /* apic_write should exit unconditionally. */ + return 1; APIC_WRITE vmexit is introduced by APIC register virtualization not virtualize x2apic. Move it to next patch. case EXIT_REASON_EPT_VIOLATION: /* * L0 always deals with the EPT violation. If nested EPT is @@ -8356,6 +8365,38 @@ static void vmx_start_preemption_timer(struct kvm_vcpu *vcpu) ns_to_ktime(preemption_timeout), HRTIMER_MODE_REL); } +static inline int nested_vmx_check_virt_x2apic(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) + return -EINVAL; + return 0; +} + +static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) { + int r; + + if (!nested_cpu_has_virt_x2apic_mode(vmcs12)) + return 0; + + r = nested_vmx_check_virt_x2apic(vcpu, vmcs12); + if (r) + goto fail; + + /* tpr shadow is needed by all apicv features. */ + if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { + r = -EINVAL; + goto fail; + } + + return 0; + +fail: + return r; +} + static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu, unsigned long count_field, unsigned long addr_field, @@ -8649,7 +8690,8 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) else vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(vmx-nested.apic_access_page)); - } else if (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm)) { + } else if (!(nested_cpu_has_virt_x2apic_mode(vmcs12)) + + (vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))) { exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; kvm_vcpu_reload_apic_access_page(vcpu); @@ -8856,6 +8898,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) return 1; } + if (nested_vmx_check_apicv_controls(vcpu, vmcs12)) { + nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); + return 1; + } + if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12)) { nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); return 1; -- 1.7.1 Best regards, Yang
RE: [PATCH 0/5] KVM: nVMX: Enable nested apicv support.
Wincy Van wrote on 2015-01-20: Hi, Yang, Could you please have a look at this patch set? Your comment is very appreciated! Sure. I will take a look. Thanks, Wincy Best regards, Yang
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Paolo Bonzini wrote on 2014-12-23: On 23/12/2014 10:07, Wu, Feng wrote: On 23/12/2014 01:37, Zhang, Yang Z wrote: I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? No, the interrupt is not delivered at all in the host. Normally you'd have: - interrupt delivered to CPU from host affinity - VFIO interrupt handler writes to irqfd - interrupt delivered to vCPU from guest affinity Here, you just skip the first two steps. The interrupt is delivered to the thread that is running the vCPU directly, so the host affinity is bypassed entirely. ... unless you are considering the case where the vCPU is blocked and the host is processing the posted interrupt wakeup vector. In that case yes, it would be better to set NDST to a CPU matching the host affinity. In my understanding, wakeup vector should have no relationship with the host affinity of the irq. Wakeup notification event should be delivered to the pCPU which the vCPU was blocked on. And in kernel's point of view, the irq is not associated with the wakeup vector, right? That is correct indeed. It is not associated to the wakeup vector, hence this patch is right, I think. However, the wakeup vector has the same function as the VFIO interrupt handler, so you could argue that it is tied to the host affinity rather than the guest. Let's wait for Yang to answer. Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Jiang Liu wrote on 2014-12-24: On 2014/12/24 9:38, Zhang, Yang Z wrote: Paolo Bonzini wrote on 2014-12-23: On 23/12/2014 10:07, Wu, Feng wrote: On 23/12/2014 01:37, Zhang, Yang Z wrote: I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? No, the interrupt is not delivered at all in the host. Normally you'd have: - interrupt delivered to CPU from host affinity - VFIO interrupt handler writes to irqfd - interrupt delivered to vCPU from guest affinity Here, you just skip the first two steps. The interrupt is delivered to the thread that is running the vCPU directly, so the host affinity is bypassed entirely. ... unless you are considering the case where the vCPU is blocked and the host is processing the posted interrupt wakeup vector. In that case yes, it would be better to set NDST to a CPU matching the host affinity. In my understanding, wakeup vector should have no relationship with the host affinity of the irq. Wakeup notification event should be delivered to the pCPU which the vCPU was blocked on. And in kernel's point of view, the irq is not associated with the wakeup vector, right? That is correct indeed. It is not associated to the wakeup vector, hence this patch is right, I think. However, the wakeup vector has the same function as the VFIO interrupt handler, so you could argue that it is tied to the host affinity rather than the guest. Let's wait for Yang to answer. Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience. Hi Yang, Hi Jiang, Originally we have a proposal to return failure when user sets IRQ affinity through native OS interfaces if an IRQ is in PI mode. But that proposal will break CPU hot-removal because OS needs to migrate away all IRQs binding to the CPU to be offlined. Then we propose saving user IRQ affinity setting without changing hardware configuration (keeping PI configuration). Later when PI mode is disabled, the cached affinity setting will be used to setup IRQ destination for native OS. On the other hand, for IRQ in PI mode, it won't be delivered to native OS, so user may not sense that the IRQ is delivered to CPUs other than those in the affinity set. The IRQ is still there but will be delivered to host in the form of PI event(if the VCPU is running in root-mode). I am not sure whether those interrupts should be reflected in /proc/interrupts? If the answer is yes, then which entries should be used, a new PI entry or use the original IRQ entry? In that aspect, I think it's acceptable:) Regards! Yes, if all of you guys(especially the IRQ maintainer) are think it is acceptable then we can follow current implementation and document it. Gerry Paolo Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Wu, Feng wrote on 2014-12-24: Zhang, Yang Z wrote on 2014-12-24: Cc: io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org; KVM list; Eric Auger Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts Jiang Liu wrote on 2014-12-24: On 2014/12/24 9:38, Zhang, Yang Z wrote: Paolo Bonzini wrote on 2014-12-23: On 23/12/2014 10:07, Wu, Feng wrote: On 23/12/2014 01:37, Zhang, Yang Z wrote: I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? No, the interrupt is not delivered at all in the host. Normally you'd have: - interrupt delivered to CPU from host affinity - VFIO interrupt handler writes to irqfd - interrupt delivered to vCPU from guest affinity Here, you just skip the first two steps. The interrupt is delivered to the thread that is running the vCPU directly, so the host affinity is bypassed entirely. ... unless you are considering the case where the vCPU is blocked and the host is processing the posted interrupt wakeup vector. In that case yes, it would be better to set NDST to a CPU matching the host affinity. In my understanding, wakeup vector should have no relationship with the host affinity of the irq. Wakeup notification event should be delivered to the pCPU which the vCPU was blocked on. And in kernel's point of view, the irq is not associated with the wakeup vector, right? That is correct indeed. It is not associated to the wakeup vector, hence this patch is right, I think. However, the wakeup vector has the same function as the VFIO interrupt handler, so you could argue that it is tied to the host affinity rather than the guest. Let's wait for Yang to answer. Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience. Hi Yang, Hi Jiang, Originally we have a proposal to return failure when user sets IRQ affinity through native OS interfaces if an IRQ is in PI mode. But that proposal will break CPU hot-removal because OS needs to migrate away all IRQs binding to the CPU to be offlined. Then we propose saving user IRQ affinity setting without changing hardware configuration (keeping PI configuration). Later when PI mode is disabled, the cached affinity setting will be used to setup IRQ destination for native OS. On the other hand, for IRQ in PI mode, it won't be delivered to native OS, so user may not sense that the IRQ is delivered to CPUs other than those in the affinity set. The IRQ is still there but will be delivered to host in the form of PI event(if the VCPU is running in root-mode). I am not sure whether those interrupts should be reflected in /proc/interrupts? If the answer is yes, then which entries should be used, a new PI entry or use the original IRQ entry? Even though, setting the affinity of the IRQ in host should not affect the destination of the PI event (normal notification event of wakeup This is your implementation. To me, disable PI if the VCPU is going to run in the CPU out of IRQ affinity bitmap also is acceptable. And it will keep the user interface looks the same as before. Hi Thomas, Ingo, Peter Can you guys help to review this patch? Really appreciate if you can give some feedbacks. notification event), because the destination of the PI event is determined in NDST field of Posted-interrupts descriptor and PI notification vector is global. Just had a discussion with Jiang offline, maybe we can add the statistics information for the notification vector in /proc/interrupts just like any other global interrupts. Thanks, Feng In that aspect, I think it's acceptable:) Regards! Yes, if all of you guys(especially the IRQ maintainer) are think it is acceptable then we can follow current implementation and document it. Gerry Paolo Best regards, Yang Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Paolo Bonzini wrote on 2014-12-19: On 19/12/2014 02:46, Zhang, Yang Z wrote: If the IRQ is posted, its affinity is controlled by guest (irq --- vCPU pCPU), it has no effect when host changes its affinity. That's the problem: User is able to changes it in host but it never takes effect since it is actually controlled by guest. I guess it will break the IRQ balance too. I don't think that's a problem. Controlling the affinity in the host affects which CPU in the host takes care of signaling the guest. If this signaling is done directly by the chipset, there is no need to do anything in the host and thus the host affinity can be bypassed. I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior? Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Paolo Bonzini wrote on 2014-12-23: The problem is we still need to support PI with lowest priority delivery mode even if guest does not configure irq affinity via /proc/irq/. Don't we? Yes, but we can get the basic support working first. I and Feng talked on irc and agreed to start with a simple implementation. Then he can add support for vector hashing for all interrupts, both vt-d pi and normal LAPIC interrupts. Agree. Obviously, current PI has some limitations to achieve highest performance. We can start with a simple implementation but must ensure we don't change hardware's behavior(From guest's point). Feng's current implementation or use the way I suggested are both ok since both of them cannot solve the problem well. Paolo Best regards, Yang N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Feng Wu wrote on 2014-12-12: We don't need to migrate the irqs for VT-d Posted-Interrupts here. When 'pst' is set in IRTE, the associated irq will be posted to guests instead of interrupt remapping. The destination of the interrupt is set in Posted-Interrupts Descriptor, and the migration happens during vCPU scheduling. However, we still update the cached irte here, which can be used when changing back to remapping mode. Signed-off-by: Feng Wu feng...@intel.com Reviewed-by: Jiang Liu jiang@linux.intel.com --- drivers/iommu/intel_irq_remapping.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 --- a/drivers/iommu/intel_irq_remapping.c +++ b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, { struct intel_ir_data *ir_data = data-chip_data;struct irte *irte = ir_data-irte_entry; + struct irte_pi *irte_pi = (struct irte_pi *)irte; struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent = data-parent_data; int ret; @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, */ irte-vector = cfg-vector; irte-dest_id = IRTE_DEST(cfg-dest_apicid); - modify_irte(ir_data-irq_2_iommu, irte); + + /* We don't need to modify irte if the interrupt is for posting. */ + if (irte_pi-pst != 1) + modify_irte(ir_data-irq_2_iommu, irte); What happens if user changes the IRQ affinity manually? /* * After this point, all the interrupts will start arriving Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Feng Wu wrote on 2014-12-12: This patch defines a new interface kvm_find_dest_vcpu for VT-d PI, which can returns the destination vCPU of the interrupt for guests. Since VT-d PI cannot handle broadcast/multicast interrupt, Here we only handle Fixed and Lowest priority interrupts. The current method of handling guest lowest priority interrtups is to use a counter 'apic_arb_prio' for each vCPU, we choose the vCPU with smallest 'apic_arb_prio' and then increase it by 1. However, for VT-d PI, we cannot re-use this, since we no longer have control to 'apic_arb_prio' with posted interrupt direct delivery by Hardware. Here, we introduce a similar way with 'apic_arb_prio' to handle guest lowest priority interrtups when VT-d PI is used. Here is the ideas: - Each vCPU has a counter 'round_robin_counter'. - When guests sets an interrupts to lowest priority, we choose the vCPU with smallest 'round_robin_counter' as the destination, then increase it. How this can work well? All subsequent interrupts are delivered to one vCPU? It shouldn't be the best solution, need more consideration. Also, I think you should take the apic_arb_prio into consider since the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 21/26] x86, irq: Define a global vector for VT-d Posted-Interrupts
Feng Wu wrote on 2014-12-12: Currently, we use a global vector as the Posted-Interrupts Notification Event for all the vCPUs in the system. We need to introduce another global vector for VT-d Posted-Interrtups, which will be used to wakeup the sleep vCPU when an external interrupt from a direct-assigned device happens for that vCPU. Hi Feng, Since the idea of two global vectors mechanism is from me, please add me to the comments. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/include/asm/entry_arch.h | 2 ++ arch/x86/include/asm/hardirq.h | 1 + arch/x86/include/asm/hw_irq.h | 2 ++ arch/x86/include/asm/irq_vectors.h | 1 + arch/x86/kernel/entry_64.S | 2 ++ arch/x86/kernel/irq.c | 27 +++ arch/x86/kernel/irqinit.c | 2 ++ 7 files changed, 37 insertions(+) diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h index dc5fa66..27ca0af 100644 --- a/arch/x86/include/asm/entry_arch.h +++ b/arch/x86/include/asm/entry_arch.h @@ -23,6 +23,8 @@ BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR) #ifdef CONFIG_HAVE_KVM BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR, smp_kvm_posted_intr_ipi) +BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR, + smp_kvm_posted_intr_wakeup_ipi) #endif /* diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 0f5fb6b..9866065 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -14,6 +14,7 @@ typedef struct { #endif #ifdef CONFIG_HAVE_KVMunsigned int kvm_posted_intr_ipis; +unsigned int kvm_posted_intr_wakeup_ipis; #endifunsigned int x86_platform_ipis; /* arch dependent */unsigned int apic_perf_irqs; diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h index e7ae6eb..38fac9b 100644 --- a/arch/x86/include/asm/hw_irq.h +++ b/arch/x86/include/asm/hw_irq.h @@ -29,6 +29,7 @@ extern asmlinkage void apic_timer_interrupt(void); extern asmlinkage void x86_platform_ipi(void); extern asmlinkage void kvm_posted_intr_ipi(void); +extern asmlinkage void kvm_posted_intr_wakeup_ipi(void); extern asmlinkage void error_interrupt(void); extern asmlinkage void irq_work_interrupt(void); @@ -92,6 +93,7 @@ extern void trace_call_function_single_interrupt(void); #define trace_irq_move_cleanup_interrupt irq_move_cleanup_interrupt #define trace_reboot_interrupt reboot_interrupt #define trace_kvm_posted_intr_ipi kvm_posted_intr_ipi +#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi #endif /* CONFIG_TRACING */ struct irq_domain; diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index b26cb12..dca94f2 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -105,6 +105,7 @@ /* Vector for KVM to deliver posted interrupt IPI */ #ifdef CONFIG_HAVE_KVM #define POSTED_INTR_VECTOR 0xf2 +#define POSTED_INTR_WAKEUP_VECTOR0xf1 #endif /* diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index e61c14a..a598447 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -960,6 +960,8 @@ apicinterrupt X86_PLATFORM_IPI_VECTOR \ #ifdef CONFIG_HAVE_KVM apicinterrupt3 POSTED_INTR_VECTOR \ kvm_posted_intr_ipi smp_kvm_posted_intr_ipi +apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \ + kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi #endif #ifdef CONFIG_X86_MCE_THRESHOLD diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 922d285..47408c3 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs *regs) } #ifdef CONFIG_HAVE_KVM +void (*wakeup_handler_callback)(void) = NULL; +EXPORT_SYMBOL_GPL(wakeup_handler_callback); + /* * Handler for POSTED_INTERRUPT_VECTOR. */ @@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs) set_irq_regs(old_regs); } + +/* + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR. + */ +__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs) { + struct pt_regs *old_regs = set_irq_regs(regs); + + ack_APIC_irq(); + + irq_enter(); + + exit_idle(); + + inc_irq_stat(kvm_posted_intr_wakeup_ipis); + + if (wakeup_handler_callback) + wakeup_handler_callback(); + + irq_exit(); + + set_irq_regs(old_regs); +} + #endif __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index 70e181e..844673c 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -144,6 +144,8 @@ static void __init apic_intr_init(void) #ifdef CONFIG_HAVE_KVM /* IPI for KVM to deliver
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Paolo Bonzini wrote on 2014-12-18: On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; joro-zLv9SwRftAIdnm+Jiang Liu Cc: io...@lists.linux-foundation.org; linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. Can this happen? If the vcpu is in guest mode, it cannot have been scheduled out, and that's the only case when SN is set. Paolo Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed when SN is set. I would change that to a WARN_ON_ONCE then. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 12/26] KVM: Initialize VT-d Posted-Interrupts Descriptor
Feng Wu wrote on 2014-12-12: This patch initializes the VT-d Posted-Interrupts Descriptor. Signed-off-by: Feng Wu feng...@intel.com --- arch/x86/kvm/vmx.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 0b1383e..66ca275 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -45,6 +45,7 @@ #include asm/perf_event.h #include asm/debugreg.h #include asm/kexec.h +#include asm/irq_remapping.h #include trace.h @@ -4433,6 +4434,30 @@ static void ept_set_mmio_spte_mask(void) kvm_mmu_set_mmio_spte_mask((0x3ull 62) | 0x6ull); } +static void pi_desc_init(struct vcpu_vmx *vmx) { + unsigned int dest; + + if (!irq_remapping_cap(IRQ_POSTING_CAP)) + return; + + /* + * Initialize Posted-Interrupt Descriptor + */ + + pi_clear_sn(vmx-pi_desc); + vmx-pi_desc.nv = POSTED_INTR_VECTOR; Here. + + /* Physical mode for Notificaiton Event */ + vmx-pi_desc.ndm = 0; And from here.. + dest = cpu_physical_id(vmx-vcpu.cpu); + + if (x2apic_enabled()) + vmx-pi_desc.ndst = dest; + else + vmx-pi_desc.ndst = (dest 8) 0xFF00; } + ..to here are useless. The right place to update PI descriptor is where vcpu got loaded not in initialization. /* * Sets up the vmcs for emulated real mode. */ @@ -4476,6 +4501,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR); vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx-pi_desc))); + + pi_desc_init(vmx); } if (ple_gap) { Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Paolo Bonzini wrote on 2014-12-19: On 18/12/2014 15:49, Zhang, Yang Z wrote: Here, we introduce a similar way with 'apic_arb_prio' to handle guest lowest priority interrtups when VT-d PI is used. Here is the ideas: - Each vCPU has a counter 'round_robin_counter'. - When guests sets an interrupts to lowest priority, we choose the vCPU with smallest 'round_robin_counter' as the destination, then increase it. How this can work well? All subsequent interrupts are delivered to one vCPU? It shouldn't be the best solution, need more consideration. Well, it's a hardware limitation. The alternative (which is easy to Agree, it is limited by hardware. But lowest priority distributes the interrupt more efficient than fixed mode. And current implementation more likes to switch the lowest priority mode to fixed mode. In case of interrupt intensive environment, this may be a bottleneck and VM may not benefit greatly from VT-d PI. But agree again, it is really a hardware limitation. implement) is to only do PI for single-CPU interrupts. This should work well for multiqueue NICs (and of course for UP guests :)), so perhaps it's a good idea to only support that as a first attempt. The more easy way is to deliver the interrupt to the first matched VCPU we find. The round_robin_counter really helps nothing here since the interrupt is delivered by hardware directly. Paolo Also, I think you should take the apic_arb_prio into consider since the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-18: jiang@linux.intel.com Cc: eric.au...@linaro.org; linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; kvm@vger.kernel.org; Wu, Feng Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts Feng Wu wrote on 2014-12-12: We don't need to migrate the irqs for VT-d Posted-Interrupts here. When 'pst' is set in IRTE, the associated irq will be posted to guests instead of interrupt remapping. The destination of the interrupt is set in Posted-Interrupts Descriptor, and the migration happens during vCPU scheduling. However, we still update the cached irte here, which can be used when changing back to remapping mode. Signed-off-by: Feng Wu feng...@intel.com Reviewed-by: Jiang Liu jiang@linux.intel.com --- drivers/iommu/intel_irq_remapping.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 --- a/drivers/iommu/intel_irq_remapping.c +++ b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, { struct intel_ir_data *ir_data = data-chip_data;struct irte *irte = ir_data-irte_entry; +struct irte_pi *irte_pi = (struct irte_pi *)irte;struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent = data-parent_data; int ret; @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, */ irte-vector = cfg-vector; irte-dest_id = IRTE_DEST(cfg-dest_apicid); - modify_irte(ir_data-irq_2_iommu, irte); + + /* We don't need to modify irte if the interrupt is for posting. */ + if (irte_pi-pst != 1) + modify_irte(ir_data-irq_2_iommu, irte); What happens if user changes the IRQ affinity manually? If the IRQ is posted, its affinity is controlled by guest (irq --- vCPU pCPU), it has no effect when host changes its affinity. That's the problem: User is able to changes it in host but it never takes effect since it is actually controlled by guest. I guess it will break the IRQ balance too. Thanks, Feng /* * After this point, all the interrupts will start arriving Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Wu, Feng wrote on 2014-12-19: Paolo Bonzini wrote on 2014-12-19: jiang@linux.intel.com Cc: eric.au...@linaro.org; linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; kvm@vger.kernel.org Subject: Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI On 18/12/2014 15:49, Zhang, Yang Z wrote: Here, we introduce a similar way with 'apic_arb_prio' to handle guest lowest priority interrtups when VT-d PI is used. Here is the ideas: - Each vCPU has a counter 'round_robin_counter'. - When guests sets an interrupts to lowest priority, we choose the vCPU with smallest 'round_robin_counter' as the destination, then increase it. How this can work well? All subsequent interrupts are delivered to one vCPU? It shouldn't be the best solution, need more consideration. Well, it's a hardware limitation. The alternative (which is easy to implement) is to only do PI for single-CPU interrupts. This should work well for multiqueue NICs (and of course for UP guests :)), so perhaps it's a good idea to only support that as a first attempt. Paolo Paolo, what do you mean by single-CPU interrupts? Do you mean we It should be same idea as I mentioned on another thread: deliver the interrupt to a single CPU(maybe the first matched VCPU?) don't support lowest priority interrupts for PI? But Linux OS uses lowest priority for most of the case? If so, we can hardly get benefit from this feature for Linux guest OS. Thanks, Feng Also, I think you should take the apic_arb_prio into consider since the priority is for the whole vCPU not for one interrupt. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: iommu-boun...@lists.linux-foundation.org wrote on mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: Cc: io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Paolo Bonzini wrote on 2014-12-18: On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; joro-zLv9SwRftAIdnm+Jiang Liu Cc: io...@lists.linux-foundation.org; linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. Can this happen? If the vcpu is in guest mode, it cannot have been scheduled out, and that's the only case when SN is set. Paolo Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts are suppressed for posting. I mean what happens if we don't set SN bit. From my point, if preempter already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, since there already has ON bit, so this means there only have one interrupt arrived at most and it doesn't hurt performance. Do we really need to set SN bit? Thanks, Feng waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed when SN is set. I would change that to a WARN_ON_ONCE then. Best regards, Yang ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-19: Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Wu, Feng wrote on 2014-12-19: iommu-boun...@lists.linux-foundation.org wrote on mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: Cc: io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Paolo Bonzini wrote on 2014-12-18: On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; joro-zLv9SwRftAIdnm+Jiang Liu Cc: io...@lists.linux-foundation.org; linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. Can this happen? If the vcpu is in guest mode, it cannot have been scheduled out, and that's the only case when SN is set. Paolo Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts are suppressed for posting. I mean what happens if we don't set SN bit. From my point, if preempter already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, since there already has ON bit, so this means there only have one interrupt arrived at most and it doesn't hurt performance. Do we really need to set SN bit? See this scenario: vCPU0 is running on pCPU0 -- vCPU0 is preempted by vCPU1 -- Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule -- in runqueue If the we don't set SN for vCPU0, then all subsequent interrupts for vCPU0 is posted to vCPU1, this will consume hardware and software The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event? efforts and in fact it is not needed at all. If SN is set for vCPU0, VT-d hardware will not issue Notification Event for vCPU0 when an interrupt is for it, but just setting the related PIR bit. Thanks, Feng Thanks, Feng waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed when SN is set. I would change that to a WARN_ON_ONCE then. Best regards, Yang ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-19: Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-19: Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Wu, Feng wrote on 2014-12-19: iommu-boun...@lists.linux-foundation.org wrote on mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: Cc: io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Paolo Bonzini wrote on 2014-12-18: On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; joro-zLv9SwRftAIdnm+Jiang Liu Cc: io...@lists.linux-foundation.org; linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. Can this happen? If the vcpu is in guest mode, it cannot have been scheduled out, and that's the only case when SN is set. Paolo Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts are suppressed for posting. I mean what happens if we don't set SN bit. From my point, if preempter already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, since there already has ON bit, so this means there only have one interrupt arrived at most and it doesn't hurt performance. Do we really need to set SN bit? See this scenario: vCPU0 is running on pCPU0 -- vCPU0 is preempted by vCPU1 -- Then vCPU1 is running on pCPU0 and vCPU0 is waiting for -- schedule in runqueue If the we don't set SN for vCPU0, then all subsequent interrupts for vCPU0 is posted to vCPU1, this will consume hardware and software The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event? Wakeup vector is only used for blocking case, when vCPU is preempted and waiting in the runqueue, the NV is the notification vector. I see your point. But from performance point, if we can schedule the vCPU to another PCPU to handle the interrupt, it would helpful. But I remember current KVM will not schedule the vCPU in run queue (even though it got preempted) to another pCPU to run(Am I right?). So it may hard to do it. Thanks, Feng efforts and in fact it is not needed at all. If SN is set for vCPU0, VT-d hardware will not issue Notification Event for vCPU0 when an interrupt is for it, but just setting the related PIR bit. Thanks, Feng Thanks, Feng waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed when SN is set. I would change that to a WARN_ON_ONCE then. Best regards, Yang ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu Best regards, Yang Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-19: Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-19: Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Wu, Feng wrote on 2014-12-19: Zhang, Yang Z wrote on 2014-12-19: Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Wu, Feng wrote on 2014-12-19: iommu-boun...@lists.linux-foundation.org wrote on mailto:iommu-boun...@lists.linux-foundation.org] On Behalf Of: Cc: io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set Paolo Bonzini wrote on 2014-12-18: On 18/12/2014 04:14, Wu, Feng wrote: linux-kernel-ow...@vger.kernel.org wrote on mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Paolo: x...@kernel.org; Gleb Natapov; Paolo Bonzini; dw...@infradead.org; joro-zlv9swrftaidnm+yrof...@public.gmane.org; Alex Williamson; joro-zLv9SwRftAIdnm+Jiang Liu Cc: io...@lists.linux-foundation.org; linux-kernel-u79uwxl29ty76z2rm5m...@public.gmane.org; KVM list; Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set On 12/12/2014 16:14, Feng Wu wrote: Currently, we don't support urgent interrupt, all interrupts are recognized as non-urgent interrupt, so we cannot send posted-interrupt when 'SN' is set. Can this happen? If the vcpu is in guest mode, it cannot have been scheduled out, and that's the only case when SN is set. Paolo Currently, the only place where SN is set is vCPU is preempted and If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted? If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts are suppressed for posting. I mean what happens if we don't set SN bit. From my point, if preempter already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, since there already has ON bit, so this means there only have one interrupt arrived at most and it doesn't hurt performance. Do we really need to set SN bit? See this scenario: vCPU0 is running on pCPU0 -- vCPU0 is preempted by vCPU1 -- Then vCPU1 is running on pCPU0 and vCPU0 is waiting for -- schedule in runqueue If the we don't set SN for vCPU0, then all subsequent interrupts for vCPU0 is posted to vCPU1, this will consume hardware and software The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event? Wakeup vector is only used for blocking case, when vCPU is preempted and waiting in the runqueue, the NV is the notification vector. I see your point. But from performance point, if we can schedule the vCPU to another PCPU to handle the interrupt, it would helpful. But I remember current KVM will not schedule the vCPU in run queue (even though it got preempted) to another pCPU to run(Am I right?). So it may hard to do it. KVM is using the Linux scheduler, when the preempted vCPU (in runqueue) is scheduled again depends on the scheduling algorithm itself, I think it is a little hard for us to get involved. I think what you mentioned is a little like the urgent interrupt in VT-d PI Spec. For this kind of interrupts, if an interrupt is coming for an preempted vCPU (waiting in the run queue), we need to schedule the vCPU immediately. This is some real time things. And we don't support urgent interrupt so far. Yes. IIRC, if we use two global vectors mechanism properly, there should no need to use hardware urgent interrupt mechanism. :) Thanks, Feng Thanks, Feng efforts and in fact it is not needed at all. If SN is set for vCPU0, VT-d hardware will not issue Notification Event for vCPU0 when an interrupt is for it, but just setting the related PIR bit. Thanks, Feng Thanks, Feng waiting for the next scheduling in the runqueue. But I am not sure whether we need to set SN for other purpose in future. Adding SN checking here is just to follow the Spec. non-urgent interrupts are suppressed when SN is set. I would change that to a WARN_ON_ONCE then. Best regards, Yang ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu Best regards, Yang Best regards, Yang Best regards, Yang Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: x86: reset RVI upon system reset
Zhang Haoyu wrote on 2014-12-11: Then? It's already in upstream KVM commit 4114c27d450bef228be9c7b0c40a888e18a3a636 Author: Wei Wang wei.w.w...@intel.com Date: Wed Nov 5 10:53:43 2014 +0800 KVM: x86: reset RVI upon system reset A bug was reported as follows: when running Windows 7 32-bit guests on qemu-kvm, sometimes the guests run into blue screen during reboot. The problem was that a guest's RVI was not cleared when it rebooted. This patch has fixed the problem. Signed-off-by: Wei Wang wei.w.w...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Tested-by: Rongrong Liu rongrongx@intel.com, Da Chun ng...@qq.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com On 05/11/2014 10:02, Chen, Tiejun wrote: I think both are ok. If we zero max_irr in vmx_set_rvi(), we still need this check: if ((is_guest_mode(vcpu) nested_exit_on_intr(vcpu)) || max_irr == -1) No, I don't think we need to add this. You don't, because the code will look like: if (is_guest_mode(vcpu) nested_exit_on_intr(vcpu)) return; if (!is_guest_mode(vcpu)) { vmx_set_rvi(max_irr); return; } if (max_irr == -1) return; and thus vmx_set_rvi() is never reached if is_guest_mode(vcpu) !nested_exit_on_intr(vcpu). I don't think the above code is perfect. Since hwapic_irr_update() is a hot point, it's better to move the first check after the second check. In this case, Wei's patch looks more reasonable. I applied the lapic.c part of Wei's patch, and the vmx.c part of Tiejun's patch. Paolo Best regards, Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes
Wu, Feng wrote on 2014-11-13: kvm-ow...@vger.kernel.org wrote on 2014-11-12: kvm@vger.kernel.org; io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes On 12/11/2014 10:19, Wu, Feng wrote: You can certainly backport these patches to distros that do not have VFIO. But upstream we should work on VFIO first. VFIO has feature parity with legacy device assignment, and adding a new feature that is not in VFIO would be a bad idea. By the way, do you have benchmark results for it? We have not been able to see any performance improvement for APICv on e.g. netperf. Do you mean benchmark results for APICv itself or VT-d Posted-Interrtups? Especially for VT-d posted interrupts---but it'd be great to know which workloads see the biggest speedup from APICv. We have some draft performance data internally, please see the attached. For VT-d PI, I think we can get the biggest performance gain if the VCPU is running in non-root mode for most of the time (not in HLT state), since external interrupt from assigned devices will be delivered by guest directly in this case. That means we can run some cpu intensive workload in the guests. Have you check that the CPU side posted interrupt is taking effect in w/o VT-D PI case? Per my understanding, the performance gap should be so large if you use CPU side posted interrupt. This data more like the VT-d PI vs non PI(both VT-d and CPU). Thanks, Feng Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang
RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes
Wu, Feng wrote on 2014-11-13: Zhang, Yang Z wrote on 2014-11-13: kvm@vger.kernel.org; io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org Subject: RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes Wu, Feng wrote on 2014-11-13: kvm-ow...@vger.kernel.org wrote on 2014-11-12: kvm@vger.kernel.org; io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes On 12/11/2014 10:19, Wu, Feng wrote: You can certainly backport these patches to distros that do not have VFIO. But upstream we should work on VFIO first. VFIO has feature parity with legacy device assignment, and adding a new feature that is not in VFIO would be a bad idea. By the way, do you have benchmark results for it? We have not been able to see any performance improvement for APICv on e.g. netperf. Do you mean benchmark results for APICv itself or VT-d Posted-Interrtups? Especially for VT-d posted interrupts---but it'd be great to know which workloads see the biggest speedup from APICv. We have some draft performance data internally, please see the attached. For VT-d PI, I think we can get the biggest performance gain if the VCPU is running in non-root mode for most of the time (not in HLT state), since external interrupt from assigned devices will be delivered by guest directly in this case. That means we can run some cpu intensive workload in the guests. Have you check that the CPU side posted interrupt is taking effect in w/o VT-D PI case? Per my understanding, the performance gap should be so large if you use CPU side posted interrupt. This data more like the VT-d PI vs non PI(both VT-d and CPU). Yes, this data is VT-d PI vs Non VT-d PI. The CPU side APICv mechanism (including CPU side Posted-Interrtups) is enabled. From the CPU utilization data, it seems the environment of APICv is not reasonable to me. with current APICv, the interrupt should not deliver to the PCPU where vcpu is running. Otherwise, it will force the vcpu vmexit and the CPU side posted interrupt cannot take effect. Do you set the interrupt affinity manually? Thanks, Feng Thanks, Feng Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang Best regards, Yang
RE: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes
Paolo Bonzini wrote on 2014-11-11: On 11/11/2014 10:20, Wu, Feng wrote: Since legacy KVM device assignment is effectively deprecated, have you considered how we might do this with VFIO? Thanks, I haven't thought about how to enable this in VFIO so far. I think I can continue to implement that if needed after this patch set is finished. What do you think of this? Hi Feng, we are not applying new features to legacy KVM device assignment, since it is unsafe (it does not honor ACS). Personally, I think this feature will be helpful to the legacy device assignment. Agree, vfio is the right solution for future feature enabling. But the old kvm without the good vfio supporting is still used largely today. The user really looking for this feature but they will not upgrade their kernel. It's easy for us to backport this feature to old kvm with the legacy device assignment, but it is impossible to backport the whole vfio. So I think you guys can take a consider to add this feature to both vfio and legacy device assignment. I and Alex can help you with designing a way to interface VFIO with KVM posted interrupts. Give us a few days to study these patches more, or feel free to request comments if you have ideas about it yourself. Paolo Best regards, Yang
RE: [PATCH] KVM: x86: reset RVI upon system reset
Paolo Bonzini wrote on 2014-11-05: On 05/11/2014 10:02, Chen, Tiejun wrote: I think both are ok. If we zero max_irr in vmx_set_rvi(), we still need this check: if ((is_guest_mode(vcpu) nested_exit_on_intr(vcpu)) || max_irr == -1) No, I don't think we need to add this. You don't, because the code will look like: if (is_guest_mode(vcpu) nested_exit_on_intr(vcpu)) return; if (!is_guest_mode(vcpu)) { vmx_set_rvi(max_irr); return; } if (max_irr == -1) return; and thus vmx_set_rvi() is never reached if is_guest_mode(vcpu) !nested_exit_on_intr(vcpu). I don't think the above code is perfect. Since hwapic_irr_update() is a hot point, it's better to move the first check after the second check. In this case, Wei's patch looks more reasonable. I applied the lapic.c part of Wei's patch, and the vmx.c part of Tiejun's patch. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: nVMX: APICv seems to cause breakages
Jan Kiszka wrote on 2014-10-20: Hi all, we just started a test with Jailhouse in a VM on a decently recent Intel box. It has APICv / posted interrupts support. And it breaks Jailhouse activation (L1 breakage, host seems to be fine). Loading kvm-intel with enable_apicv=0 resolves the issue. Henning started to debug from Jailhouse POV and found out it may be NMI IPI related. We will try to dig deeper, but any support is welcome regarding the APICv specialties. It is interesting. I will spend some time to look this issue too. Could this also explain some of the incompatibilities you found, Yang? Maybe, I will do a double check. Maybe you want to check if more hypervisor work when APICv is off. Jan Best regards, Yang
RE: [PATCH v2] KVM: x86: keep eoi exit bitmap accurate before loading it.
Paolo Bonzini wrote on 2014-08-27: Il 27/08/2014 16:05, Wei Wang ha scritto: Guest may mask the IOAPIC entry before issue EOI. In such case, EOI will not be intercepted by the hypervisor, since the corresponding bit in eoi_exit_bitmap is not set after the masking of IOAPIC entry. The solution here is to OR eoi_exit_bitmap with tmr to make sure that all level-triggered interrupts have their bits in eoi_exit_bitmap set. This commit message does not explain why this change is necessary, and the relationship between this patch and the previous one. For example: -- Commit 0f6c0a740b7d (KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table, 2014-07-30) fixed an APICv bug where an incorrect EOI exit bitmap triggered an interrupt storm inside the guest. There is a corner case for which that patch would have disabled accelerated EOI unnecessarily. Suppose you have: - a device that was the sole user of an INTx interrupt and is hot-unplugged - an OS that masks the INTx interrupt entry in the IOAPIC after the unplug - another device that uses MSI and is subsequently hot-plugged If the OS chooses to reuse the same LAPIC interrupt vector for the two interrupts, the patch would have left the vector in the EOI exit bitmap, because KVM takes into account the stale entry in the IOAPIC redirection table. We do know exactly which masked interrupts are still in-service and thus require broadcasting an EOI to the IOAPIC: this information is in the TMR. So, this patch ORs the EOI exit bitmap provided by the ioapic with the TMR register. Thanks to the previous patch, an active level-triggered interrupt will always be included in the EOI exit bitmap. -- However, see below. Tested-by: Rongrong Liu rongrongx@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Wei Wang wei.w.w...@intel.com --- arch/x86/kvm/lapic.c | 12 arch/x86/kvm/lapic.h |1 + arch/x86/kvm/x86.c |1 + virt/kvm/ioapic.c|7 --- 4 files changed, 18 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 8c1162d..0fcac3c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -539,6 +539,18 @@ void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr) } } +void kvm_apic_update_eoi_exitmap(struct kvm_vcpu *vcpu, u64 +*eoi_exit_bitmap) { + struct kvm_lapic *apic = vcpu-arch.apic; + u32 i; + u32 tmr; + + for (i = 0; i 8; i++) { + tmr = kvm_apic_get_reg(apic, APIC_TMR + 0x10 * i); + *((u32 *)eoi_exit_bitmap + i) |= tmr; + } +} + static void apic_update_ppr(struct kvm_lapic *apic) { u32 tpr, isrv, ppr, old_ppr; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 6a11845..d2b96f2 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -55,6 +55,7 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu); void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr); void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir); +void kvm_apic_update_eoi_exitmap(struct kvm_vcpu *vcpu, u64 +*eoi_exit_bitmap); int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest); int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda); int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d401684..d23b558 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5992,6 +5992,7 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) kvm_ioapic_scan_entry(vcpu, eoi_exit_bitmap, tmr); kvm_apic_update_tmr(vcpu, tmr); + kvm_apic_update_eoi_exitmap(vcpu, eoi_exit_bitmap); kvm_x86_ops-load_eoi_exitmap(vcpu, eoi_exit_bitmap); } diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index e8ce34c..ed15936 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -254,9 +254,10 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, spin_lock(ioapic-lock); for (index = 0; index IOAPIC_NUM_PINS; index++) { e = ioapic-redirtbl[index]; - if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG || - kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index) || - index == RTC_GSI) { + if ((!e-fields.mask +e-fields.trig_mode == IOAPIC_LEVEL_TRIG) + || kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, + index) || index == RTC_GSI) { if (kvm_apic_match_dest(vcpu, NULL, 0, e-fields.dest_id, e-fields.dest_mode)) { __set_bit(e-fields.vector, There's still something missing here. Suppose you have the following: Program edge-triggered MSI for vector 123 Interrupt comes
RE: [PATCH v2] KVM: x86: check ISR and TMR to construct eoi exit bitmap
Chen, Tiejun wrote on 2014-08-15: On 2014/8/14 3:16, Wei Wang wrote: From: Yang Zhang yang.z.zh...@intel.com Guest may mask the IOAPIC entry before issue EOI. In such case, EOI will not be intercepted by hypervisor due to the corrensponding s/corrensponding/corresponding bit in eoi exit bitmap is not setting. The solution is to check ISR + TMR to construct the EOI exit bitmap. This patch is a better fixing for the issue that commit 0f6c0a740b tries to solve. Hi Paolo and Alex, Any other comments with this version? Tested-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Wei Wang wei.w.w...@intel.com --- arch/x86/kvm/lapic.c | 17 + arch/x86/kvm/lapic.h |2 ++ arch/x86/kvm/x86.c |9 + virt/kvm/ioapic.c|7 --- 4 files changed, 32 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 08e8a89..0ed4bcb 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -515,6 +515,23 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu) __clear_bit(KVM_APIC_PV_EOI_PENDING, vcpu-arch.apic_attention); } +void kvm_apic_zap_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, + u32 *tmr) +{ + u32 i, reg_off, intr_in_service; +struct kvm_lapic *apic = vcpu-arch.apic; + + for (i = 0; i 8; i++) { + reg_off = 0x10 * i; + intr_in_service = apic_read_reg(apic, APIC_ISR + reg_off) + kvm_apic_get_reg(apic, APIC_TMR + reg_off); +if (intr_in_service) { +*((u32 *)eoi_exit_bitmap + i) |= intr_in_service; + tmr[i] |= intr_in_service; + } + } +} + void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr) { struct kvm_lapic *apic = vcpu-arch.apic; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 6a11845..4ee3d70 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -53,6 +53,8 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); +void kvm_apic_zap_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, + u32 *tmr); void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr); void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir); int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 204422d..755b556 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6005,6 +6005,15 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) memset(tmr, 0, 32); kvm_ioapic_scan_entry(vcpu, eoi_exit_bitmap, tmr); +/* + * Guest may mask the IOAPIC entry before issue EOI. In such case, + * EOI will not be intercepted by hypervisor due to the corrensponding s/corrensponding/corresponding + * bit in eoi exit bitmap is not setting. + * + * The solution is to check ISR + TMR to construct the EOI exit bitmap. + */ +kvm_apic_zap_eoi_exitmap(vcpu, eoi_exit_bitmap, tmr); + kvm_x86_ops-load_eoi_exitmap(vcpu, eoi_exit_bitmap); kvm_apic_update_tmr(vcpu, tmr); } diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index e8ce34c..2458a1d 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -254,9 +254,10 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, spin_lock(ioapic-lock); for (index = 0; index IOAPIC_NUM_PINS; index++) { e = ioapic-redirtbl[index]; -if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG || - kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index) || - index == RTC_GSI) { +if (!e-fields.mask + (e-fields.trig_mode == IOAPIC_LEVEL_TRIG || + kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, + index) || index == RTC_GSI)) { if (kvm_apic_match_dest(vcpu, NULL, 0, e-fields.dest_id, e-fields.dest_mode)) { __set_bit(e-fields.vector, Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
Paolo Bonzini wrote on 2014-08-07: Il 07/08/2014 03:31, Zhang, Yang Z ha scritto: Let me give an example to see whether my concern is a real problem: Guest allocates a vector and set it in IOAPIC entry to deliver interrupt. Later it masks the IOAPIC entry(means stop the corresponding device) and assign this vector to a MSI device. With this patch, even the vector is not used by IOAPIC, but it still set eoi exit bitmap unconditionally. The subsequent EOIs to MSI device will force vmexit. Could this happen? Yes, I guess it could. I'm not sure whether it could on Linux or Windows. I think the right fixing is to check the ISR plus TMR to construct the eoi exit bitmap. Do you care enough to propose a patch? :) Sure. I will do it. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3] KVM: nVMX: nested TPR shadow/threshold emulation
Paolo Bonzini wrote on 2014-08-05: Il 05/08/2014 09:56, Zhang, Yang Z ha scritto: Wanpeng Li wrote on 2014-08-04: This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=61411 TPR shadow/threshold feature is important to speed up the Windows guest. Besides, it is a must feature for certain VMM. We map virtual APIC page address and TPR threshold from L1 VMCS. If TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest and L1 interested in, we inject it into L1 VMM for handling. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- v2 - v3: * nested vm entry failure if both tpr shadow and cr8 exiting bits are not set v1 - v2: * don't take L0's virtualize APIC accesses setting into account * virtual_apic_page do exactly the same thing that is done for apic_access_page * add the tpr threshold field to the read-write fields for shadow VMCS arch/x86/kvm/vmx.c | 38 -- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c604f3c..7a56e2c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -379,6 +379,7 @@ struct nested_vmx { * we must keep them pinned while L2 runs. */ struct page *apic_access_page; + struct page *virtual_apic_page; u64 msr_ia32_feature_control; struct hrtimer preemption_timer; @@ -533,6 +534,7 @@ static int max_shadow_read_only_fields = ARRAY_SIZE(shadow_read_only_fields); static unsigned long shadow_read_write_fields[] = { + TPR_THRESHOLD, GUEST_RIP, GUEST_RSP, GUEST_CR0, @@ -2330,7 +2332,7 @@ static __init void nested_vmx_setup_ctls_msrs(void) CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING | CPU_BASED_RDPMC_EXITING | CPU_BASED_RDTSC_EXITING | - CPU_BASED_PAUSE_EXITING | + CPU_BASED_PAUSE_EXITING | CPU_BASED_TPR_SHADOW | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; /* * We can allow some features even when not supported by the @@ -6148,6 +6150,10 @@ static void free_nested(struct vcpu_vmx *vmx) nested_release_page(vmx-nested.apic_access_page); vmx-nested.apic_access_page = 0; } + if (vmx-nested.virtual_apic_page) { + nested_release_page(vmx-nested.virtual_apic_page); + vmx-nested.virtual_apic_page = 0; + } nested_free_all_saved_vmcss(vmx); } @@ -6936,7 +6942,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)case EXIT_REASON_MCE_DURING_VMENTRY:return 0; case EXIT_REASON_TPR_BELOW_THRESHOLD: - return 1; + return nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW); case EXIT_REASON_APIC_ACCESS: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); @@ -7057,6 +7063,9 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) { + if (is_guest_mode(vcpu)) + return; + if (irr == -1 || tpr irr) { vmcs_write32(TPR_THRESHOLD, 0); return; @@ -8024,6 +8033,27 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) exec_control = ~CPU_BASED_VIRTUAL_NMI_PENDING; exec_control = ~CPU_BASED_TPR_SHADOW; exec_control |= vmcs12-cpu_based_vm_exec_control; + + if (exec_control CPU_BASED_TPR_SHADOW) { + if (vmx-nested.virtual_apic_page) + nested_release_page(vmx-nested.virtual_apic_page); + vmx-nested.virtual_apic_page = + nested_get_page(vcpu, vmcs12-virtual_apic_page_addr); + if (!vmx-nested.virtual_apic_page) + exec_control = + ~CPU_BASED_TPR_SHADOW; + else + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, + page_to_phys(vmx-nested.virtual_apic_page)); + + if (!(exec_control CPU_BASED_TPR_SHADOW) + !((exec_control CPU_BASED_CR8_LOAD_EXITING) + (exec_control CPU_BASED_CR8_STORE_EXITING))) + nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); I think this is not correct. The vmx-nested.virtual_apic_page may not valid due to two reasons: 1. The virtual_apic_page_addr is not a valid gfn. In this case, the vmx failure must be injected to L1 unconditionally regardless of the setting of CR8 load/store exiting. You're right that accesses to the APIC-access page may also end up writing to the virtual-APIC page. Hence, if the virtualize APIC accesses setting is 1 in the secondary exec controls, you also have to fail the vmentry. Doing it unconditionally is not correct, but it is the simplest thing Why not correct? to do and it would be okay
RE: [PATCH] KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
Paolo Bonzini wrote on 2014-07-31: Currently, the EOI exit bitmap (used for APICv) does not include interrupts that are masked. However, this can cause a bug that manifests as an interrupt storm inside the guest. Alex Williamson reported the bug and is the one who really debugged this; I only wrote the patch. :) The scenario involves a multi-function PCI device with OHCI and EHCI USB functions and an audio function, all assigned to the guest, where both USB functions use legacy INTx interrupts. As soon as the guest boots, interrupts for these devices turn into an interrupt storm in the guest; the host does not see the interrupt storm. Basically the EOI path does not work, and the guest continues to see the interrupt over and over, even after it attempts to mask it at the APIC. The bug is only visible with older kernels (RHEL6.5, based on 2.6.32 with not many changes in the area of APIC/IOAPIC handling). Alex then tried forcing bit 59 (corresponding to the USB functions' IRQ) on in the eoi_exit_bitmap and TMR, and things then work. What happens is that VFIO asserts IRQ11, then KVM recomputes the EOI exit bitmap. It does not have set bit 59 because the RTE was masked, so the IOAPIC never sees the EOI and the interrupt continues to fire in the guest. Probably, the guest is masking the interrupt in the redirection table in the interrupt routine, i.e. while the interrupt is set in a LAPIC's ISR. The simplest fix is to ignore the masking state, we would rather have an unnecessary exit rather than a missed IRQ ACK and anyway IOAPIC interrupts are not as performance-sensitive as for example MSIs. I feel this fixing may hurt performance in some cases. If the mask bit is set, this means the vector in this entry may be used by other devices(like a assigned device). But here you set it in eoi exit bitmap and this will cause vmexit on each EOI which should not happen. [Thanks to Alex for his precise description of the problem and initial debugging effort. A lot of the text above is based on emails exchanged with him.] Reported-by: Alex Williamson alex.william...@redhat.com Cc: sta...@vger.kernel.org Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- virt/kvm/ioapic.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 2458a1dc2ba9..e8ce34c9db32 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -254,10 +254,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap, spin_lock(ioapic-lock); for (index = 0; index IOAPIC_NUM_PINS; index++) { e = ioapic-redirtbl[index]; - if (!e-fields.mask - (e-fields.trig_mode == IOAPIC_LEVEL_TRIG || - kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, - index) || index == RTC_GSI)) { + if (e-fields.trig_mode == IOAPIC_LEVEL_TRIG || + kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index) || + index == RTC_GSI) { if (kvm_apic_match_dest(vcpu, NULL, 0, e-fields.dest_id, e-fields.dest_mode)) { __set_bit(e-fields.vector, Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
Paolo Bonzini wrote on 2014-08-06: Il 06/08/2014 16:03, Zhang, Yang Z ha scritto: Paolo Bonzini wrote on 2014-07-31: Probably, the guest is masking the interrupt in the redirection table in the interrupt routine, i.e. while the interrupt is set in a LAPIC's ISR. The simplest fix is to ignore the masking state, we would rather have an unnecessary exit rather than a missed IRQ ACK and anyway IOAPIC interrupts are not as performance-sensitive as for example MSIs. I feel this fixing may hurt performance in some cases. If the mask bit is set, this means the vector in this entry may be used by other devices(like a assigned device). But here you set it in eoi exit bitmap and this will cause vmexit on each EOI which should not happen. Note that this *was* reported on an assigned device. IOAPIC should not be a performance-sensitive path. High-performance assigned devices should be using MSIs. Let me give an example to see whether my concern is a real problem: Guest allocates a vector and set it in IOAPIC entry to deliver interrupt. Later it masks the IOAPIC entry(means stop the corresponding device) and assign this vector to a MSI device. With this patch, even the vector is not used by IOAPIC, but it still set eoi exit bitmap unconditionally. The subsequent EOIs to MSI device will force vmexit. Could this happen? I think the right fixing is to check the ISR plus TMR to construct the eoi exit bitmap. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3] KVM: nVMX: nested TPR shadow/threshold emulation
Wanpeng Li wrote on 2014-08-04: This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=61411 TPR shadow/threshold feature is important to speed up the Windows guest. Besides, it is a must feature for certain VMM. We map virtual APIC page address and TPR threshold from L1 VMCS. If TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest and L1 interested in, we inject it into L1 VMM for handling. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- v2 - v3: * nested vm entry failure if both tpr shadow and cr8 exiting bits are not set v1 - v2: * don't take L0's virtualize APIC accesses setting into account * virtual_apic_page do exactly the same thing that is done for apic_access_page * add the tpr threshold field to the read-write fields for shadow VMCS arch/x86/kvm/vmx.c | 38 -- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c604f3c..7a56e2c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -379,6 +379,7 @@ struct nested_vmx { * we must keep them pinned while L2 runs. */ struct page *apic_access_page; + struct page *virtual_apic_page; u64 msr_ia32_feature_control; struct hrtimer preemption_timer; @@ -533,6 +534,7 @@ static int max_shadow_read_only_fields = ARRAY_SIZE(shadow_read_only_fields); static unsigned long shadow_read_write_fields[] = { + TPR_THRESHOLD, GUEST_RIP, GUEST_RSP, GUEST_CR0, @@ -2330,7 +2332,7 @@ static __init void nested_vmx_setup_ctls_msrs(void) CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING | CPU_BASED_RDPMC_EXITING | CPU_BASED_RDTSC_EXITING | - CPU_BASED_PAUSE_EXITING | + CPU_BASED_PAUSE_EXITING | CPU_BASED_TPR_SHADOW | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; /* * We can allow some features even when not supported by the @@ -6148,6 +6150,10 @@ static void free_nested(struct vcpu_vmx *vmx) nested_release_page(vmx-nested.apic_access_page); vmx-nested.apic_access_page = 0; } + if (vmx-nested.virtual_apic_page) { + nested_release_page(vmx-nested.virtual_apic_page); + vmx-nested.virtual_apic_page = 0; + } nested_free_all_saved_vmcss(vmx); } @@ -6936,7 +6942,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_MCE_DURING_VMENTRY: return 0; case EXIT_REASON_TPR_BELOW_THRESHOLD: - return 1; + return nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW); case EXIT_REASON_APIC_ACCESS: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); @@ -7057,6 +7063,9 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) { + if (is_guest_mode(vcpu)) + return; + if (irr == -1 || tpr irr) { vmcs_write32(TPR_THRESHOLD, 0); return; @@ -8024,6 +8033,27 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) exec_control = ~CPU_BASED_VIRTUAL_NMI_PENDING; exec_control = ~CPU_BASED_TPR_SHADOW; exec_control |= vmcs12-cpu_based_vm_exec_control; + + if (exec_control CPU_BASED_TPR_SHADOW) { + if (vmx-nested.virtual_apic_page) + nested_release_page(vmx-nested.virtual_apic_page); + vmx-nested.virtual_apic_page = +nested_get_page(vcpu, vmcs12-virtual_apic_page_addr); + if (!vmx-nested.virtual_apic_page) + exec_control = + ~CPU_BASED_TPR_SHADOW; + else + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, + page_to_phys(vmx-nested.virtual_apic_page)); + + if (!(exec_control CPU_BASED_TPR_SHADOW) + !((exec_control CPU_BASED_CR8_LOAD_EXITING) + (exec_control CPU_BASED_CR8_STORE_EXITING))) + nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); I think this is not correct. The vmx-nested.virtual_apic_page may not valid due to two reasons: 1. The virtual_apic_page_addr is not a valid gfn. In this case, the vmx failure must be injected to L1 unconditionally regardless of the setting of CR8 load/store exiting. 2. The virtual_apic_page is swapped by L0. In this case, we should not inject failure to L1. + + vmcs_write32(TPR_THRESHOLD, vmcs12-tpr_threshold); + } Miss else here: If L2 owns the APIC and doesn't use TPR_SHADOW, we need to setup the vmcs02 based on vmcs01. For example, if vmcs01 is using TPR_SHADOW, then vmcs02 must set
RE: [PATCH] KVM: nVMX: nested TPR shadow/threshold emulation
Paolo Bonzini wrote on 2014-08-01: Il 01/08/2014 02:57, Zhang, Yang Z ha scritto: TPR_THRESHOLD will be likely written as zero, but the processor will never use it anyway. It's just a small optimization because nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) will almost always be true. Theoretically, you are right. But we should not expect all VMMs follow it. It is not worth to violate the SDM just for saving two or three instructions' cost. Yes, you do need an if (cpu_has_vmx_tpr_shadow()) around the vmcs_write32. But still, checking nested_cpu_has is not strictly necessary. Right now they both are a single AND, but I have plans to change all of the cpu_has_*() checks to static keys. See v2 patch. It isn't a problem anymore. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: nVMX: nested TPR shadow/threshold emulation
Paolo Bonzini wrote on 2014-07-31: Il 31/07/2014 10:03, Wanpeng Li ha scritto: One thing: + if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) + vmcs_write32(TPR_THRESHOLD, vmcs12-tpr_threshold); I think you can just do this write unconditionally, since most hypervisors will enable this. Also, you probably can add the tpr What will happen if a hypervisor doesn't enable it? I make it more cleaner in version two. TPR_THRESHOLD will be likely written as zero, but the processor will never use it anyway. It's just a small optimization because nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) will almost always be true. Theoretically, you are right. But we should not expect all VMMs follow it. It is not worth to violate the SDM just for saving two or three instructions' cost. Paolo threshold field to the read-write fields for shadow VMCS. Agreed. Regards, Wanpeng Li Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
deadline of CFP for kvm forum
Hi all, I see the deadline of CFP for KVM forum is July 27, 2014. But I found there is no kvm forum selection list when I tried to submit a presentation yesterday. Is the CFP closed early than expected? BTW, it is in July 27 in US when sending this mail. best regards yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] KVM: nVMX: Fix virtual interrupt delivery injection
Paolo Bonzini wrote on 2014-07-17: Il 17/07/2014 06:56, Wanpeng Li ha scritto: This patch fix bug reported in https://bugzilla.kernel.org/show_bug.cgi?id=73331, after the patch http://www.spinics.net/lists/kvm/msg105230.html applied, there is some progress and the L2 can boot up, however, slowly. The original idea of this fix vid injection patch is from Zhang, Yang Z yang.z.zh...@intel.com. Interrupt which delivered by vid should be injected to L1 by L0 if current is in L1, or should be injected to L2 by L0 through the old injection way if L1 doesn't have set VM_EXIT_ACK_INTR_ON_EXIT. The current logic doen't consider these cases. This patch fix it by vid intr to L1 if current is L1 or L2 through old injection way if L1 doen't have VM_EXIT_ACK_INTR_ON_EXIT set. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com Signed-off-by: Zhang, Yang Z yang.z.zh...@intel.com --- arch/x86/kvm/vmx.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 021d84a..ad36646 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7112,8 +7112,22 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr) { if (max_irr == -1) return; - -vmx_set_rvi(max_irr); +if (!is_guest_mode(vcpu)) { +vmx_set_rvi(max_irr); +} else if (is_guest_mode(vcpu) !nested_exit_on_intr(vcpu)) { +/* + * Fall back to old way to inject the interrupt since there + * is no vAPIC-v for L2. + */ +if (vcpu-arch.exception.pending || +vcpu-arch.nmi_injected || +vcpu-arch.interrupt.pending) +return; +else if (vmx_interrupt_allowed(vcpu)) { +kvm_queue_interrupt(vcpu, max_irr, false); +vmx_inject_irq(vcpu); +} +} } static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap) What hypervisor did you test with? nested_exit_on_intr(vcpu) will Jailhouse will clear External-interrupt exiting bit. Am I right? Jan. return true for both Xen and KVM (nested_exit_on_intr is not the same thing as ACK_INTR_ON_EXIT). I guess he want to say External-interrupt exiting bit not ACK_INTR_ON_EXIT. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] KVM: nVMX: Fix fail to get nested ack intr's vector during nested vmexit
Paolo Bonzini wrote on 2014-07-17: Il 17/07/2014 06:56, Wanpeng Li ha scritto: nested_exit_intr_ack_set(vcpu)) { int irq = kvm_cpu_get_interrupt(vcpu); + +if (irq 0 kvm_apic_vid_enabled(vcpu-kvm)) +irq = kvm_get_apic_interrupt(vcpu); There's something weird in this patch. If you inline kvm_cpu_get_interrupt, what you get is this: int irq; /* Beginning of kvm_cpu_get_interrupt... */ if (!irqchip_in_kernel(v-kvm)) irq = v-arch.interrupt.nr; else { irq = kvm_cpu_get_extint(v); /* PIC */ if (!kvm_apic_vid_enabled(v-kvm) irq == -1) irq = kvm_get_apic_interrupt(v); /* APIC */ } /* kvm_cpu_get_interrupt done. */ if (irq 0 kvm_apic_vid_enabled(vcpu-kvm)) irq = kvm_get_apic_interrupt(vcpu); There are just two callers of kvm_cpu_get_interrupt, and the other is protected by kvm_cpu_has_injectable_intr so it won't be executed if virtual interrupt delivery is enabled. So you patch is effectively the same as this: diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c index bd0da43..a1ec6a5 100644 --- a/arch/x86/kvm/irq.c +++ b/arch/x86/kvm/irq.c @@ -108,7 +108,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v) vector = kvm_cpu_get_extint(v); - if (kvm_apic_vid_enabled(v-kvm) || vector != -1) + if (vector != -1) return vector; /* PIC */ return kvm_get_apic_interrupt(v); /* APIC */ But in kvm_get_apic_interrupt I have just added this comment: /* Note that we never get here with APIC virtualization * enabled. */ because kvm_get_apic_interrupt calls apic_set_isr, and apic_set_isr must never be called with APIC virtualization enabled either. With APIC virtualization enabled, isr_count is always 1, and highest_isr_cache is always -1, and apic_set_isr breaks both of these invariants. You are right. kvm_lapic_find_highest_irr should be the right one. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] KVM: nVMX: Fix fail to get nested ack intr's vector during nested vmexit
Wanpeng Li wrote on 2014-07-17: WARNING: CPU: 9 PID: 7251 at arch/x86/kvm/vmx.c:8719 nested_vmx_vmexit+0xa4/0x233 [kvm_intel]() Modules linked in: tun nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs fscache lockd sunrpc pci_stub netconsole kvm_intel kvm bridge stp llc autofs4 8021q ipv6 uinput joydev microcode pcspkr igb i2c_algo_bit ehci_pci ehci_hcd e1000e ixgbe ptp pps_core hwmon mdio i2c_i801 i2c_core tpm_tis tpm ipmi_si ipmi_msghandler isci libsas scsi_transport_sas button dm_mirror dm_region_hash dm_log dm_mod CPU: 9 PID: 7251 Comm: qemu-system-x86 Tainted: GW 3.16.0-rc1 #2 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.131329 11/11/2013 220f 880ffd107bf8 81493563 220f 880ffd107c38 8103f0eb 880ffd107c48 a059709a 881ffc9e0040 8800b74b8000 Call Trace: [81493563] dump_stack+0x49/0x5e [8103f0eb] warn_slowpath_common+0x7c/0x96 [a059709a] ? nested_vmx_vmexit+0xa4/0x233 [kvm_intel] [8103f11a] warn_slowpath_null+0x15/0x17 [a059709a] nested_vmx_vmexit+0xa4/0x233 [kvm_intel] [a0594295] ? nested_vmx_exit_handled+0x6a/0x39e [kvm_intel] [a0537931] ? kvm_apic_has_interrupt+0x80/0xd5 [kvm] [a05972ec] vmx_check_nested_events+0xc3/0xd3 [kvm_intel] [a051ebe9] inject_pending_event+0xd0/0x16e [kvm] [a051efa0] vcpu_enter_guest+0x319/0x704 [kvm] After commit 77b0f5d (KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to), Acknowledge interrupt on exit behavior can be emulated. Current logic will ask for intr vector if it is nested vmexit and VM_EXIT_ACK_INTR_ON_EXIT is set by L1. However, intr vector for posted intr can't be got by generic read pending interrupt vector and intack routine, there is a requirement to sync from pir to irr. This patch fix it by ask the intr vector after sync pir to irr. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com Reviewed-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c | 1 + arch/x86/kvm/vmx.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 0069118..b7d45dc 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1637,6 +1637,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu) apic_clear_irr(vector, apic); return vector; } +EXPORT_SYMBOL_GPL(kvm_get_apic_interrupt); void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 4ae5ad8..31f1479 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8697,6 +8697,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason, if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT) nested_exit_intr_ack_set(vcpu)) { int irq = kvm_cpu_get_interrupt(vcpu); + + if (irq 0 kvm_apic_vid_enabled(vcpu-kvm)) + irq = kvm_get_apic_interrupt(vcpu); WARN_ON(irq 0); vmcs12-vm_exit_intr_info = irq | INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR; Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/4] KVM: Add SMAP support when setting CR4
Paolo Bonzini wrote on 2014-03-27: Il 27/03/2014 13:25, Feng Wu ha scritto: +void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool ept) { unsigned bit, byte, pfec; u8 map; -bool fault, x, w, u, wf, uf, ff, smep; +bool fault, x, w, u, wf, uf, ff, smep, smap; smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); + smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); for (byte = 0; byte ARRAY_SIZE(mmu-permissions); ++byte) { pfec = byte 1; map = 0; @@ -3617,11 +3618,26 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, w |= !is_write_protection(vcpu) !uf; /* Disallow supervisor fetches of user code if cr4.smep */ x = !(smep u !uf); + +/* + * SMAP:kernel-mode data accesses from user-mode + * mappings should fault. A fault is considered + * as a SMAP violation if all of the following + * conditions are ture: + * - X86_CR4_SMAP is set in CR4 + * - An user page is accessed + * - !(CPL3 X86_EFLAGS_AC is set) + * - Page fault in kernel mode + */ +smap = smap u !uf +!((kvm_x86_ops-get_cpl(vcpu) 3) +((kvm_x86_ops-get_rflags(vcpu) +X86_EFLAGS_AC) == 1)); Unfortunately this doesn't work. The reason is that changing X86_EFLAGS_AC doesn't trigger update_permission_bitmask. So the value of CPL 3 AC = 1 must not be checked in update_permission_bitmask; instead, it must be included in the index into the permissions array. You can reuse the PFERR_RSVD_MASK bit, like smapf = pfec PFERR_RSVD_MASK; ... smap = smap smapf u !uf; The VCPU can then be passed to permission_fault in order to get the value of the CPL and the AC bit. Is CPL check needed? Shouldn't it already have been covered by U bit? Please test nested virtualization too. I think PFERR_RSVD_MASK should be removed in translate_nested_gpa. Paolo Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: hyper-v support in KVM
Vadim Rozenfeld wrote on 2014-02-24: On Mon, 2014-02-24 at 03:01 +, Zhang, Yang Z wrote: Vadim Rozenfeld wrote on 2014-02-14: On Fri, 2014-02-14 at 02:35 +, Liu, RongrongX wrote: Vadim Rozenfeld wrote on 2014-02-12: On Wed, 2014-02-12 at 01:33 +, Zhang, Yang Z wrote: Vadim Rozenfeld wrote on 2014-02-10: On Mon, 2014-02-10 at 08:21 +, Zhang, Yang Z wrote: Hi Vadim, Do you know the latest status of Hyper-v Enlightenments supporting in KVM? Like how many Hyper-v interfaces are supported in KVM? Hi Yang, There is no many at the moment. KVM currently supports the following Hyper-V features: Guest Spinlocks http://msdn.microsoft.com/en-us/library/windows/hardware/ff539 08 1% 28v=vs.85%29.aspx Local APIC MSR Accesses http://msdn.microsoft.com/en-us/library/windows/hardware/ff542 39 6% 28v=vs.85%29.aspx Reference Time Counter http://msdn.microsoft.com/en-us/library/windows/hardware/ff542 63 7% 28v=vs.85%29.aspx We are going to add: Reference TSC Page http://msdn.microsoft.com/en-us/library/windows/hardware/ff542 64 3% 28v=vs.85%29.aspx Lazy EOI support, maybe more. Thanks for your update. More questions: I want to measure the performance improvement with hyper-v features enabled. So I want to know: Are those features enabled by default in KVM? In KVM - yes, but you also need to specify them in QEMU command line. They can be enabled by -cpu features hv_vapic, hv_spinlocks, hv_time, and hv_relaxed Hi Vadim, in QEMU command line, how to enable these feature? I try it with 1. [root@vt-snb9 ]#qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -net none win8.1.img -cpu feature hv_vapic,+hv_spinlocks,+hv_time,+hv_relaxed qemu-system-x86_64: -cpu feature: drive with bus=0, unit=0 (index=0) exists 2. [root@vt-snb9]# qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -net none win8.1.img -cpu qemu64,+hv_vapic,+hv_spinlocks,+hv_time,+hv_relaxed something like this: -cpu qemu64, +x2apic,family=0xf,hv_vapic,hv_spinlocks=0xfff,hv_relaxed,hv_time (for hv_vapic we also need x2apic to be enabled) I saw the win8.1 guest boot up fail with error code 0x005c after enabling hv_vapic on my ivy bridge-EP box. But it works well on my Sandy bridge-EP. Any thought? What are the bug check parameters coming with HAL_INITIALIZATION_FAILED bug check? Is your guest 32 or 64-bit? How does it work with Win8? Parameters: 0x110, 0xffd11000, 0x19, 0xc001 Also saw this issue with Win8. Thanks, Vadim. Best regards, Vadim. CPU feature hv_vapic not found CPU feature hv_spinlocks not found CPU feature hv_time not found CPU feature hv_relaxed not found CPU feature hv_vapic not found CPU feature hv_spinlocks not found CPU feature hv_time not found CPU feature hv_relaxed not found VNC server running on `::1:5900' How to turn off/on it manually? Yes. From the QEMU command line. And how can I know whether guest is using it really? There are two options - printk from the KVM side or WinDbg from the guest side. But in case of hv_time you can check the value returned by QueryPerformanceFrequency http://msdn.microsoft.com/en-us/library/windows/desktop/ms644905 % 28v=vs.85%29.aspx it should be 10MHz Also, Do you have any performance data? http://www.linux-kvm.org/wiki/images/0/0a/2012-forum-kvm_hyperv. pd f pp 16, 18 I compared DPC and ISR times with xperf for two cases - with and without enlightenment. I also have seen reports mentioned around 5-10 percent CPU usage drop on the host side, when loading guest with some disk-stress tests. Kind regards. Vadim. best regards yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang Best regards, Yang Best regards, Yang
RE: hyper-v support in KVM
Vadim Rozenfeld wrote on 2014-02-14: On Fri, 2014-02-14 at 02:35 +, Liu, RongrongX wrote: Vadim Rozenfeld wrote on 2014-02-12: On Wed, 2014-02-12 at 01:33 +, Zhang, Yang Z wrote: Vadim Rozenfeld wrote on 2014-02-10: On Mon, 2014-02-10 at 08:21 +, Zhang, Yang Z wrote: Hi Vadim, Do you know the latest status of Hyper-v Enlightenments supporting in KVM? Like how many Hyper-v interfaces are supported in KVM? Hi Yang, There is no many at the moment. KVM currently supports the following Hyper-V features: Guest Spinlocks http://msdn.microsoft.com/en-us/library/windows/hardware/ff539 08 1% 28v=vs.85%29.aspx Local APIC MSR Accesses http://msdn.microsoft.com/en-us/library/windows/hardware/ff542 39 6% 28v=vs.85%29.aspx Reference Time Counter http://msdn.microsoft.com/en-us/library/windows/hardware/ff542 63 7% 28v=vs.85%29.aspx We are going to add: Reference TSC Page http://msdn.microsoft.com/en-us/library/windows/hardware/ff542 64 3% 28v=vs.85%29.aspx Lazy EOI support, maybe more. Thanks for your update. More questions: I want to measure the performance improvement with hyper-v features enabled. So I want to know: Are those features enabled by default in KVM? In KVM - yes, but you also need to specify them in QEMU command line. They can be enabled by -cpu features hv_vapic, hv_spinlocks, hv_time, and hv_relaxed Hi Vadim, in QEMU command line, how to enable these feature? I try it with 1. [root@vt-snb9 ]#qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -net none win8.1.img -cpu feature hv_vapic,+hv_spinlocks,+hv_time,+hv_relaxed qemu-system-x86_64: -cpu feature: drive with bus=0, unit=0 (index=0) exists 2. [root@vt-snb9]# qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -net none win8.1.img -cpu qemu64,+hv_vapic,+hv_spinlocks,+hv_time,+hv_relaxed something like this: -cpu qemu64, +x2apic,family=0xf,hv_vapic,hv_spinlocks=0xfff,hv_relaxed,hv_time (for hv_vapic we also need x2apic to be enabled) I saw the win8.1 guest boot up fail with error code 0x005c after enabling hv_vapic on my ivy bridge-EP box. But it works well on my Sandy bridge-EP. Any thought? Best regards, Vadim. CPU feature hv_vapic not found CPU feature hv_spinlocks not found CPU feature hv_time not found CPU feature hv_relaxed not found CPU feature hv_vapic not found CPU feature hv_spinlocks not found CPU feature hv_time not found CPU feature hv_relaxed not found VNC server running on `::1:5900' How to turn off/on it manually? Yes. From the QEMU command line. And how can I know whether guest is using it really? There are two options - printk from the KVM side or WinDbg from the guest side. But in case of hv_time you can check the value returned by QueryPerformanceFrequency http://msdn.microsoft.com/en-us/library/windows/desktop/ms644905% 28v=vs.85%29.aspx it should be 10MHz Also, Do you have any performance data? http://www.linux-kvm.org/wiki/images/0/0a/2012-forum-kvm_hyperv.pd f pp 16, 18 I compared DPC and ISR times with xperf for two cases - with and without enlightenment. I also have seen reports mentioned around 5-10 percent CPU usage drop on the host side, when loading guest with some disk-stress tests. Kind regards. Vadim. best regards yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang Best regards, Yang
RE: hyper-v support in KVM
Vadim Rozenfeld wrote on 2014-02-10: On Mon, 2014-02-10 at 08:21 +, Zhang, Yang Z wrote: Hi Vadim, Do you know the latest status of Hyper-v Enlightenments supporting in KVM? Like how many Hyper-v interfaces are supported in KVM? Hi Yang, There is no many at the moment. KVM currently supports the following Hyper-V features: Guest Spinlocks http://msdn.microsoft.com/en-us/library/windows/hardware/ff539081% 28v=vs.85%29.aspx Local APIC MSR Accesses http://msdn.microsoft.com/en-us/library/windows/hardware/ff542396% 28v=vs.85%29.aspx Reference Time Counter http://msdn.microsoft.com/en-us/library/windows/hardware/ff542637% 28v=vs.85%29.aspx We are going to add: Reference TSC Page http://msdn.microsoft.com/en-us/library/windows/hardware/ff542643% 28v=vs.85%29.aspx Lazy EOI support, maybe more. Thanks for your update. More questions: I want to measure the performance improvement with hyper-v features enabled. So I want to know: Are those features enabled by default in KVM? How to turn off/on it manually? And how can I know whether guest is using it really? Also, Do you have any performance data? Kind regards. Vadim. best regards yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang
hyper-v support in KVM
Hi Vadim, Do you know the latest status of Hyper-v Enlightenments supporting in KVM? Like how many Hyper-v interfaces are supported in KVM? best regards yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] iommu/intel: SNP bit is not dependent on iommu domain coherency
Alex Williamson wrote on 2013-12-24: David, Any comments on this patch? Thanks, Hi Alex, There do have some IOMMUs will treat SNP bit in the PTE as reserved (0) and will cause a reserved field violation fault if it is set but hardware not support snoop-control(bit 7 in ECAP_REG is 0). So your patch seems wrong to me. Alex On Tue, 2013-10-29 at 10:21 -0600, Alex Williamson wrote: The setting of the SNP bit in the intel-iommu page tables should not be dependent on the current capability of the iommu domain. The current VT-d spec (2.2) indicates the SNP bit is treated as reserved[0] by hardware implementations not supporting Snoop Control. Furthermore, section 3.7.3 indicates: If the Snoop Control (SC) field in extended capability Register is reported as 0, snoop behavior for access to the page mapped through second-level translation is determined by the no-snoop attribute in the request. This all seems to indicate that hardware incapable of Snoop Control will handle the SNP bit as zero regardless of the value stored in the PTE. The trouble with the current implementation is that mapping flags depend on the state of the iommu domain at the time of the mapping, yet no attempt is made to update existing mappings when the iommu domain composition changes. This leaves the iommu domain in a state where some mappings may enforce coherency, others do not, and the user of the IOMMU API has no ability to later enable the desired flags atomically with respect to DMA. If we always honor the IOMMU_CACHE flag then an IOMMU API user who specifies IOMMU_CACHE for all mappings can assume that the coherency of the mappings within a domain follow the coherency capability of the domain itself. Signed-off-by: Alex Williamson alex.william...@redhat.com --- drivers/iommu/intel-iommu.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 15e9b57..c46c6a6 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -4084,7 +4084,7 @@ static int intel_iommu_map(struct iommu_domain *domain, prot |= DMA_PTE_READ; if (iommu_prot IOMMU_WRITE) prot |= DMA_PTE_WRITE; -if ((iommu_prot IOMMU_CACHE) dmar_domain-iommu_snooping) +if (iommu_prot IOMMU_CACHE) prot |= DMA_PTE_SNP; max_addr = iova + size; Best regards, Yang N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
Arthur Chunqi Li wrote on 2013-09-04: This patch contains the following two changes: 1. Fix the bug in nested preemption timer support. If vmexit L2-L0 with some reasons not emulated by L1, preemption timer value should be save in such exits. 2. Add support of Save VMX-preemption timer value VM-Exit controls to nVMX. With this patch, nested VMX preemption timer features are fully supported. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- This series depends on queue. arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kvm/vmx.c| 51 ++--- 2 files changed, 48 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index bb04650..b93e09a 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -536,6 +536,7 @@ /* MSR_IA32_VMX_MISC bits */ #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL 29) +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F /* AMD-V MSRs */ #define MSR_VM_CR 0xc0010114 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2204,7 +2204,14 @@ static __init void nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; + if (!(nested_vmx_pinbased_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER)) + nested_vmx_exit_ctls_high = + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); + if (!(nested_vmx_exit_ctls_high VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) + nested_vmx_pinbased_ctls_high = + (~PIN_BASED_VMX_PREEMPTION_TIMER); The following logic is more clearly: if(nested_vmx_pinbased_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER) nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's capability when expose those vmx features(not just preemption timer) to L1. nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER); @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { + u64 delta_tsc_l1; + u32 preempt_val_l1, preempt_val_l2, preempt_scale; + + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu, + native_read_tsc()) - vcpu-arch.last_guest_tsc; + preempt_val_l1 = delta_tsc_l1 preempt_scale; + if (preempt_val_l2 - preempt_val_l1 0) + preempt_val_l2 = 0; + else + preempt_val_l2 -= preempt_val_l1; + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); } /* * The guest has exited. See if we can fix it or if we need userspace * assistance. @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) struct vcpu_vmx *vmx = to_vmx(vcpu); u32 exit_reason = vmx-exit_reason; u32 vectoring_info = vmx-idt_vectoring_info; + int ret; /* If guest state is invalid, start emulating */ if (vmx-emulation_required) @@ -6795,12 +6820,15 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) if (exit_reason kvm_vmx_max_exit_handlers kvm_vmx_exit_handlers[exit_reason]) - return kvm_vmx_exit_handlers[exit_reason](vcpu); + ret = kvm_vmx_exit_handlers[exit_reason](vcpu); else { vcpu-run-exit_reason = KVM_EXIT_UNKNOWN; vcpu-run-hw.hardware_exit_reason = exit_reason; + ret = 0; } - return 0; + if (is_guest_mode(vcpu)) + nested_adjust_preemption_timer(vcpu); Move this forward to avoid the changes for ret. + return ret; } static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) @@ -7518,6 +7546,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); u32 exec_control; + u32 exit_control; vmcs_write16(GUEST_ES_SELECTOR, vmcs12-guest_es_selector); vmcs_write16(GUEST_CS_SELECTOR, vmcs12-guest_cs_selector); @@ -7691,7 +7720,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER * bits are further modified by
RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
Arthur Chunqi Li wrote on 2013-09-05: On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: Arthur Chunqi Li wrote on 2013-09-04: This patch contains the following two changes: 1. Fix the bug in nested preemption timer support. If vmexit L2-L0 with some reasons not emulated by L1, preemption timer value should be save in such exits. 2. Add support of Save VMX-preemption timer value VM-Exit controls to nVMX. With this patch, nested VMX preemption timer features are fully supported. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- This series depends on queue. arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kvm/vmx.c| 51 ++--- 2 files changed, 48 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index bb04650..b93e09a 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -536,6 +536,7 @@ /* MSR_IA32_VMX_MISC bits */ #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL 29) +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F /* AMD-V MSRs */ #define MSR_VM_CR 0xc0010114 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2204,7 +2204,14 @@ static __init void nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; + if (!(nested_vmx_pinbased_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER)) + nested_vmx_exit_ctls_high = + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); + if (!(nested_vmx_exit_ctls_high VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) + nested_vmx_pinbased_ctls_high = + (~PIN_BASED_VMX_PREEMPTION_TIMER); The following logic is more clearly: if(nested_vmx_pinbased_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER) nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER Here I have such consideration: this logic is wrong if CPU support PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does occurs. So the codes above reads the MSR and reserves the features it supports, and here I just check if these two features are supported simultaneously. No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on PIN_BASED_VMX_PREEMPTION_TIMER. PIN_BASED_VMX_PREEMPTION_TIMER is an independent feature You remind that this piece of codes can write like this: if (!(nested_vmx_pin_based_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER) || !(nested_vmx_exit_ctls_high VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { nested_vmx_exit_ctls_high =(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); nested_vmx_pinbased_ctls_high = (~PIN_BASED_VMX_PREEMPTION_TIMER); } This may reflect the logic I describe above that these two flags should support simultaneously, and brings less confusion. BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's capability when expose those vmx features(not just preemption timer) to L1. The codes just above here, when setting pinbased control for nested vmx, it firstly rdmsr MSR_IA32_VMX_PINBASED_CTLS, then use this to mask the features hardware not support. So does other control fields. Yes, I saw it. nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER); @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { + u64 delta_tsc_l1; + u32 preempt_val_l1, preempt_val_l2, preempt_scale; + + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu, + native_read_tsc()) - vcpu-arch.last_guest_tsc; + preempt_val_l1 = delta_tsc_l1 preempt_scale; + if (preempt_val_l2 - preempt_val_l1 0) + preempt_val_l2 = 0; + else + preempt_val_l2 -= preempt_val_l1; + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); } /* * The guest has exited. See if we can fix it or if we need userspace * assistance. @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) struct vcpu_vmx *vmx = to_vmx(vcpu); u32 exit_reason = vmx-exit_reason; u32
RE: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer
Arthur Chunqi Li wrote on 2013-09-05: Arthur Chunqi Li wrote on 2013-09-05: On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z yang.z.zh...@intel.com wrote: Arthur Chunqi Li wrote on 2013-09-04: This patch contains the following two changes: 1. Fix the bug in nested preemption timer support. If vmexit L2-L0 with some reasons not emulated by L1, preemption timer value should be save in such exits. 2. Add support of Save VMX-preemption timer value VM-Exit controls to nVMX. With this patch, nested VMX preemption timer features are fully supported. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- This series depends on queue. arch/x86/include/uapi/asm/msr-index.h |1 + arch/x86/kvm/vmx.c| 51 ++--- 2 files changed, 48 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h index bb04650..b93e09a 100644 --- a/arch/x86/include/uapi/asm/msr-index.h +++ b/arch/x86/include/uapi/asm/msr-index.h @@ -536,6 +536,7 @@ /* MSR_IA32_VMX_MISC bits */ #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL 29) +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F /* AMD-V MSRs */ #define MSR_VM_CR 0xc0010114 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2204,7 +2204,14 @@ static __init void nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; + if (!(nested_vmx_pinbased_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER)) + nested_vmx_exit_ctls_high = + (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); + if (!(nested_vmx_exit_ctls_high VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) + nested_vmx_pinbased_ctls_high = + (~PIN_BASED_VMX_PREEMPTION_TIMER); The following logic is more clearly: if(nested_vmx_pinbased_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER) nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER Here I have such consideration: this logic is wrong if CPU support PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does occurs. So the codes above reads the MSR and reserves the features it supports, and here I just check if these two features are supported simultaneously. No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on PIN_BASED_VMX_PREEMPTION_TIMER. PIN_BASED_VMX_PREEMPTION_TIMER is an independent feature You remind that this piece of codes can write like this: if (!(nested_vmx_pin_based_ctls_high PIN_BASED_VMX_PREEMPTION_TIMER) || !(nested_vmx_exit_ctls_high VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { nested_vmx_exit_ctls_high =(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); nested_vmx_pinbased_ctls_high = (~PIN_BASED_VMX_PREEMPTION_TIMER); } This may reflect the logic I describe above that these two flags should support simultaneously, and brings less confusion. BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's capability when expose those vmx features(not just preemption timer) to L1. The codes just above here, when setting pinbased control for nested vmx, it firstly rdmsr MSR_IA32_VMX_PINBASED_CTLS, then use this to mask the features hardware not support. So does other control fields. Yes, I saw it. nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER); @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { + u64 delta_tsc_l1; + u32 preempt_val_l1, preempt_val_l2, preempt_scale; + + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) + MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; + preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); + delta_tsc_l1 = kvm_x86_ops-read_l1_tsc(vcpu, + native_read_tsc()) - vcpu-arch.last_guest_tsc; + preempt_val_l1 = delta_tsc_l1 preempt_scale; + if (preempt_val_l2 - preempt_val_l1 0) + preempt_val_l2 = 0; + else + preempt_val_l2 -= preempt_val_l1; + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); } /* * The guest has exited. See if we can fix it or if we need userspace * assistance. @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu
RE: [PATCH v2 5/8] KVM: nVMX: Fix guest CR3 read-back on VM-exit
Gleb Natapov wrote on 2013-08-06: On Tue, Aug 06, 2013 at 10:39:59AM +0200, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If nested EPT is enabled, the L2 guest may change CR3 without any exits. We therefore have to read the current value from the VMCS when switching to L1. However, if paging wasn't enabled, L0 tracks L2's CR3, and GUEST_CR3 rather contains the real-mode identity map. So we need to retrieve CR3 from the architectural state after conditionally updating it - and this is what kvm_read_cr3 does. I have a headache from trying to think about it already, but shouldn't L1 be the one who setups identity map for L2? I traced what vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) return here and do not see Here is my understanding: In vmx_set_cr3(), if enabled ept, it will check whether target vcpu is enabling paging. When L2 running in real mode, then target vcpu is not enabling paging and it will use L0's identity map for L2. If you read GUEST_CR3 from VMCS, then you may get the L2's identity map not L1's. different values in real mode. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/vmx.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index b482d47..09666aa 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8106,7 +8106,7 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) * Additionally, restore L2's PDPTR to vmcs12. */ if (enable_ept) { -vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3); +vmcs12-guest_cr3 = kvm_read_cr3(vcpu); vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); -- 1.7.3.4 Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 5/8] KVM: nVMX: Fix guest CR3 read-back on VM-exit
Gleb Natapov wrote on 2013-08-06: On Tue, Aug 06, 2013 at 11:44:41AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-08-06: On Tue, Aug 06, 2013 at 10:39:59AM +0200, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If nested EPT is enabled, the L2 guest may change CR3 without any exits. We therefore have to read the current value from the VMCS when switching to L1. However, if paging wasn't enabled, L0 tracks L2's CR3, and GUEST_CR3 rather contains the real-mode identity map. So we need to retrieve CR3 from the architectural state after conditionally updating it - and this is what kvm_read_cr3 does. I have a headache from trying to think about it already, but shouldn't L1 be the one who setups identity map for L2? I traced what vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) return here and do not see Here is my understanding: In vmx_set_cr3(), if enabled ept, it will check whether target vcpu is enabling paging. When L2 running in real mode, then target vcpu is not enabling paging and it will use L0's identity map for L2. If you read GUEST_CR3 from VMCS, then you may get the L2's identity map not L1's. Yes, but why it makes sense to use L0 identity map for L2? I didn't see different vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) values because L0 and L1 use the same identity map address. When I changed identity address L1 configures vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) are indeed different, but the real CR3 L2 uses points to L0 identity map. If I zero L1 identity map page L2 still works. If L2 in real mode, then L2PA == L1PA. So L0's identity map also works if L2 is in real mode. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 5/8] KVM: nVMX: Fix guest CR3 read-back on VM-exit
Gleb Natapov wrote on 2013-08-06: On Tue, Aug 06, 2013 at 02:12:51PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-08-06: On Tue, Aug 06, 2013 at 11:44:41AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-08-06: On Tue, Aug 06, 2013 at 10:39:59AM +0200, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If nested EPT is enabled, the L2 guest may change CR3 without any exits. We therefore have to read the current value from the VMCS when switching to L1. However, if paging wasn't enabled, L0 tracks L2's CR3, and GUEST_CR3 rather contains the real-mode identity map. So we need to retrieve CR3 from the architectural state after conditionally updating it - and this is what kvm_read_cr3 does. I have a headache from trying to think about it already, but shouldn't L1 be the one who setups identity map for L2? I traced what vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) return here and do not see Here is my understanding: In vmx_set_cr3(), if enabled ept, it will check whether target vcpu is enabling paging. When L2 running in real mode, then target vcpu is not enabling paging and it will use L0's identity map for L2. If you read GUEST_CR3 from VMCS, then you may get the L2's identity map not L1's. Yes, but why it makes sense to use L0 identity map for L2? I didn't see different vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) values because L0 and L1 use the same identity map address. When I changed identity address L1 configures vmcs_read64(GUEST_CR3)/kvm_read_cr3(vcpu) are indeed different, but the real CR3 L2 uses points to L0 identity map. If I zero L1 identity map page L2 still works. If L2 in real mode, then L2PA == L1PA. So L0's identity map also works if L2 is in real mode. That not the point. It may work accidentally for kvm on kvm, but what if other hypervisor plays different tricks and builds different ident map for its guest? Yes, if other hypervisor doesn't build the 1:1 mapping for its guest, it will fail to work. But I cannot imagine what kind of hypervisor will do this and what the purpose is. Anyway, current logic is definitely wrong. It should use L1's identity map instead L0's. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
Jan Kiszka wrote on 2013-08-02: On 2013-08-02 05:04, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-08-01: From: Nadav Har'El n...@il.ibm.com Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Reviewed-by: Orit Wasserman owass...@redhat.com Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) #else nested_vmx_exit_ctls_high = 0; #endif - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_entry_ctls_high = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; - + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | + VM_ENTRY_LOAD_IA32_EFER); Just saw it, we didn't expose bit22 save VMX-preemption timer in vm-exit control but we already allowed guest to set active VMX-preemption timer in pin based vm-execution conrols. This is wrong. Does the presence of preemption timer support imply that saving its value is also supported? Then we could demand this combination (ie. do not expose preemption timer support at all to L1 if value saving is missing) and build our preemption timer emulation on top. I don't see we saved the preemption timer value to vmcs12 in prepare_vmcs12(). Will it be saved automatically? There is more broken /wrt VMX preemption timer, patches are welcome. Arthur will also try to develop test cases for it. But that topic is unrelated to this series. Jan /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl); - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls | + /* L2-L1 exit controls are emulated - the hardware exit is to L0 so +* we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER +* bits are further modified by vmx_set_efer() below. +*/ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); Should we mentioned that save vmx preemption bit must use host|guest, not just host? + + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are +* emulated by vmx_set_efer(), below. +*/ + vmcs_write32(VM_ENTRY_CONTROLS, + (vmcs12-vm_entry_controls ~VM_ENTRY_LOAD_IA32_EFER + ~VM_ENTRY_IA32E_MODE) | (vmcs_config.vmentry_ctrl ~VM_ENTRY_IA32E_MODE)); if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_PAT) -- 1.7.10.4 Best regards, Yang Best regards, Yang
RE: [PATCH v6 01/15] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
Gleb Natapov wrote on 2013-08-01: From: Nadav Har'El n...@il.ibm.com Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Reviewed-by: Orit Wasserman owass...@redhat.com Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) #else nested_vmx_exit_ctls_high = 0; #endif - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_entry_ctls_high = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; - + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | +VM_ENTRY_LOAD_IA32_EFER); Just saw it, we didn't expose bit22 save VMX-preemption timer in vm-exit control but we already allowed guest to set active VMX-preemption timer in pin based vm-execution conrols. This is wrong. /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl); - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls | + /* L2-L1 exit controls are emulated - the hardware exit is to L0 so + * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER + * bits are further modified by vmx_set_efer() below. + */ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); Should we mentioned that save vmx preemption bit must use host|guest, not just host? + + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are + * emulated by vmx_set_efer(), below. + */ + vmcs_write32(VM_ENTRY_CONTROLS, + (vmcs12-vm_entry_controls ~VM_ENTRY_LOAD_IA32_EFER + ~VM_ENTRY_IA32E_MODE) | (vmcs_config.vmentry_ctrl ~VM_ENTRY_IA32E_MODE)); if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_PAT) -- 1.7.10.4 Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
-Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Monday, July 08, 2013 8:38 PM To: Zhang, Yang Z Cc: Jan Kiszka; Paolo Bonzini; Nakajima, Jun; kvm@vger.kernel.org Subject: Re: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 On Thu, Jul 04, 2013 at 08:42:53AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-07-02: On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote: On 2013-07-02 17:15, Gleb Natapov wrote: On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote: On 2013-07-02 15:59, Gleb Natapov wrote: On Tue, Jul 02, 2013 at 03:01:24AM +, Zhang, Yang Z wrote: Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :) See comments below: Paolo Bonzini wrote on 2013-05-20: Il 19/05/2013 06:52, Jun Nakajima ha scritto: From: Nadav Har'El n...@il.ibm.com Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/vmx.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 260a919..fb9cae5 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) #else nested_vmx_exit_ctls_high = 0; #endif - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_entry_ctls_high = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; - + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | + VM_ENTRY_LOAD_IA32_EFER); /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl); -vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls | + /* L2-L1 exit controls are emulated - the hardware exit is +to L0 so + * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER +* bits are further modified by vmx_set_efer() below. + */ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); This is wrong. We cannot use L0 exit control directly. LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host). I do not see why. We always intercept DR7/PAT/EFER, so save is emulated too. Host address space size always come from L0 and preemption timer is not supported for nested IIRC and when it will be host will have to save it on exit anyway for correct emulation. Preemption timer is already supported and works fine as far as I tested. KVM doesn't use it for L1, so we do not need to save/restore it - IIRC. So what
RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
Gleb Natapov wrote on 2013-07-02: On Tue, Jul 02, 2013 at 05:34:56PM +0200, Jan Kiszka wrote: On 2013-07-02 17:15, Gleb Natapov wrote: On Tue, Jul 02, 2013 at 04:28:56PM +0200, Jan Kiszka wrote: On 2013-07-02 15:59, Gleb Natapov wrote: On Tue, Jul 02, 2013 at 03:01:24AM +, Zhang, Yang Z wrote: Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :) See comments below: Paolo Bonzini wrote on 2013-05-20: Il 19/05/2013 06:52, Jun Nakajima ha scritto: From: Nadav Har'El n...@il.ibm.com Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/vmx.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 260a919..fb9cae5 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) #else nested_vmx_exit_ctls_high = 0; #endif - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_entry_ctls_high = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; - + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | + VM_ENTRY_LOAD_IA32_EFER); /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl); -vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls | + /* L2-L1 exit controls are emulated - the hardware exit is +to L0 so + * we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER +* bits are further modified by vmx_set_efer() below. + */ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); This is wrong. We cannot use L0 exit control directly. LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host). I do not see why. We always intercept DR7/PAT/EFER, so save is emulated too. Host address space size always come from L0 and preemption timer is not supported for nested IIRC and when it will be host will have to save it on exit anyway for correct emulation. Preemption timer is already supported and works fine as far as I tested. KVM doesn't use it for L1, so we do not need to save/restore it - IIRC. So what happens if L1 configures it to value X after X/2 ticks L0 exit happen and L0 gets back to L2 directly. The counter will be X again instead of X/2. Likely. Yes, we need to improve our emulation by setting Save VMX-preemption timer value or emulate this in software if the hardware lacks support for it (was this flag introduced after the preemption timer itself?). Not sure, but my point was that for correct emulation host needs to set save preempt timer
RE: IA32_FEATURE_CONTROL MSR in nested virt
Il 03/07/2013 10:24, Arthur Chunqi Li ha scritto: Hi Gleb and Paolo, When I write test cases for nested virt and found that reading/writing IA32_FEATURE_CONTROL will be simply ignored or return 0 (in arch/x86/kvm/vmx.c) in VM. Checking this MSR will be done by some hypervisors (e.g. NOVA) and may cause error then, so it is necessary to behave right when read/write it in VM. Are there any difficulties to handle this MSR? I have two solutions. The first one is return the value of physical CPU's and always return true when write. This is simple but may behave as if it is a VM because write to it after VMXON will not return GP exception. This solution can solve most basic problems since this MSR is not commonly used. Another solution is adding a field in VCPU to handle this MSR. This is a complex but better method. I think I can complete this if needed. Would it be enough to return 5 (binary 101) on reads if nested VMX is enabled, and #GP on writes? Agree!, It should be enough. You cannot return physical value directly since there is no SMX in nested. Best regards, Yang
RE: [PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
Since this series is pending in mail list for long time. And it's really a big feature for Nested. Also, I doubt the original authors(Jun and Nahav)should not have enough time to continue it. So I will pick it up. :) See comments below: Paolo Bonzini wrote on 2013-05-20: Il 19/05/2013 06:52, Jun Nakajima ha scritto: From: Nadav Har'El n...@il.ibm.com Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/vmx.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 260a919..fb9cae5 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) #else nested_vmx_exit_ctls_high = 0; #endif - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_entry_ctls_high = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; - + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | + VM_ENTRY_LOAD_IA32_EFER); /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); @@ -7492,10 +7493,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl); - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls | + /* L2-L1 exit controls are emulated - the hardware exit is to L0 so +* we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER +* bits are further modified by vmx_set_efer() below. +*/ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); This is wrong. We cannot use L0 exit control directly. LOAD_PERF_GLOBAL_CTRL, LOAD_HOST_EFE, LOAD_HOST_PAT, ACK_INTR_ON_EXIT should use host's exit control. But others, still need use (vmcs12|host). + + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are +* emulated by vmx_set_efer(), below. VM_ENTRY_LOAD_IA32_EFER is not emulated by vmx_set_efer, so: VM_ENTRY_LOAD_IA32_EFER is hanlded in setup_msrs(), and vmx_set_efer already call it. /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE * are emulated below. VM_ENTRY_IA32E_MODE is handled in * vmx_set_efer(). */ Paolo +*/ + vmcs_write32(VM_ENTRY_CONTROLS, + (vmcs12-vm_entry_controls ~VM_ENTRY_LOAD_IA32_EFER + ~VM_ENTRY_IA32E_MODE) | (vmcs_config.vmentry_ctrl ~VM_ENTRY_IA32E_MODE)); if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_PAT) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt
Yangminqiang wrote on 2013-05-03: Nakajima, Jun wrote on 2013-04-26: Subject: Re: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt On Fri, Apr 26, 2013 at 2:29 AM, Yangminqiang yangminqi...@huawei.com wrote: Ivytown or newer platform supported it. Ivytown? Do you mean Ivy Bridge? Ivy Town is the codename of Ivy Bridge-based servers. One more question, what is the relationship between x2APIC and APIC virtualization? APIC-v requires x2APIC or APIC-v includes x2APIC? If you are using x2apic way(MSR base access)inside guest and want to benefit from apic virtualization technology, then you should set virtual x2apic bit in Secondary Processor-Based VM-Execution Controls. Best regards, Yang
RE: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt
Yangminqiang wrote on 2013-04-26: Hi Yang Zhang, Could you please let me know your CPU model or the CPU models which supports apic-v which your patch requires()? So that I could try you patches. Intel Software Developer's Manualm, Volume 3C, System Programming Guide, Part 3. Ch29, APIC VIRTUALIZATION AND VIRTUAL INTERRUPTS Or how can I know whether my hardware support those features listed in the manual above? Ivytown or newer platform supported it. Thanks, Steven kvm-ow...@vger.kernel.org wrote on 2013-04-11: Subject: [PATCH v10 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt From: Yang Zhang yang.z.zh...@intel.com If posted interrupt is avaliable, then uses it to inject virtual interrupt to guest. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c | 30 +++--- arch/x86/kvm/vmx.c |2 +- arch/x86/kvm/x86.c |1 + 3 files changed, 21 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index dbf74c9..e29883c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -353,6 +353,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic) if (!apic-irr_pending) return -1; +kvm_x86_ops-sync_pir_to_irr(apic-vcpu); result = apic_search_irr(apic); ASSERT(result == -1 || result = 16); @@ -683,18 +684,25 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, if (dest_map) __set_bit(vcpu-vcpu_id, dest_map); -result = !apic_test_and_set_irr(vector, apic); -trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, - trig_mode, vector, !result); -if (!result) { -if (trig_mode) -apic_debug(level trig mode repeatedly for -vector %d, vector); -break; -} +if (kvm_x86_ops-deliver_posted_interrupt) { +result = 1; +kvm_x86_ops-deliver_posted_interrupt(vcpu, vector); +} else { +result = !apic_test_and_set_irr(vector, apic); -kvm_make_request(KVM_REQ_EVENT, vcpu); -kvm_vcpu_kick(vcpu); +if (!result) { +if (trig_mode) +apic_debug(level trig mode repeatedly +for vector %d, vector); +goto out; +} + +kvm_make_request(KVM_REQ_EVENT, vcpu); +kvm_vcpu_kick(vcpu); +} +out: +trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, +trig_mode, vector, !result); break; case APIC_DM_REMRD: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 314b2ed..52b21da 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -84,7 +84,7 @@ module_param(vmm_exclusive, bool, S_IRUGO); static bool __read_mostly fasteoi = 1; module_param(fasteoi, bool, S_IRUGO); -static bool __read_mostly enable_apicv; +static bool __read_mostly enable_apicv = 1; module_param(enable_apicv, bool, S_IRUGO); /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6147d24..628582f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2685,6 +2685,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { + kvm_x86_ops-sync_pir_to_irr(vcpu); memcpy(s-regs, vcpu-arch.apic-regs, sizeof *s); return 0; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang
RE: [patch] x86, kvm: fix build failure with CONFIG_SMP disabled
David Rientjes wrote on 2013-04-18: On Wed, 17 Apr 2013, Randy Dunlap wrote: On 04/17/13 16:12, David Rientjes wrote: The build fails when CONFIG_SMP is disabled: arch/x86/kvm/vmx.c: In function 'vmx_deliver_posted_interrupt': arch/x86/kvm/vmx.c:3950:3: error: 'apic' undeclared (first use in this function) Fix it by including the necessary header. Sorry, i386 build still fails with the same error message plus this one: ERROR: apic [arch/x86/kvm/kvm-intel.ko] undefined! Ahh, that's because you don't have CONFIG_X86_LOCAL_APIC as you already mentioned. So it looks like this error can manifest in two different ways and we got different reports. This failure came from KVM: VMX: Add the deliver posted interrupt algorithm, so adding Yang to the cc to specify the dependency this has on apic and how it can be protected without CONFIG_X86_LOCAL_APIC on i386. How about the follow patch? commit a49dd819f502c1029c5a857e87201ef25ec06ce6 Author: Yang Zhang yang.z.zh...@intel.com Date: Wed Apr 17 05:34:07 2013 -0400 KVM: x86: Don't sending posted interrupt if not config CONFIG_SMP In UP, posted interrupt logic will not work. So we should not send posted interrupt and let vcpu to pick the pending interrupt before vmentry. Signed-off-by: Yang Zhang yang.z.zh...@intel.com diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 52b21da..d5c6b95 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3946,10 +3946,12 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) r = pi_test_and_set_on(vmx-pi_desc); kvm_make_request(KVM_REQ_EVENT, vcpu); +#ifdef CONFIG_SMP if (!r (vcpu-mode == IN_GUEST_MODE)) apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), POSTED_INTR_VECTOR); else +#endif kvm_vcpu_kick(vcpu); } Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch] x86, kvm: fix build failure with CONFIG_SMP disabled
Randy Dunlap wrote on 2013-04-18: On 04/17/13 17:35, Zhang, Yang Z wrote: David Rientjes wrote on 2013-04-18: On Wed, 17 Apr 2013, Randy Dunlap wrote: On 04/17/13 16:12, David Rientjes wrote: The build fails when CONFIG_SMP is disabled: arch/x86/kvm/vmx.c: In function 'vmx_deliver_posted_interrupt': arch/x86/kvm/vmx.c:3950:3: error: 'apic' undeclared (first use in this function) Fix it by including the necessary header. Sorry, i386 build still fails with the same error message plus this one: ERROR: apic [arch/x86/kvm/kvm-intel.ko] undefined! Ahh, that's because you don't have CONFIG_X86_LOCAL_APIC as you already mentioned. So it looks like this error can manifest in two different ways and we got different reports. This failure came from KVM: VMX: Add the deliver posted interrupt algorithm, so adding Yang to the cc to specify the dependency this has on apic and how it can be protected without CONFIG_X86_LOCAL_APIC on i386. How about the follow patch? commit a49dd819f502c1029c5a857e87201ef25ec06ce6 Author: Yang Zhang yang.z.zh...@intel.com Date: Wed Apr 17 05:34:07 2013 -0400 KVM: x86: Don't sending posted interrupt if not config CONFIG_SMP In UP, posted interrupt logic will not work. So we should not send posted interrupt and let vcpu to pick the pending interrupt before vmentry. Signed-off-by: Yang Zhang yang.z.zh...@intel.com Missing Reported-by: and the patch does not apply cleanly (looks like lots of spaces instead of tabs in it)... but it does build now after massaging the patch. Thanks. Just copy it to you for a quick testing. I will resend a formal patch. Thanks. Acked-by: Randy Dunlap rdun...@infradead.org diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 52b21da..d5c6b95 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3946,10 +3946,12 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) r = pi_test_and_set_on(vmx-pi_desc); kvm_make_request(KVM_REQ_EVENT, vcpu); +#ifdef CONFIG_SMP if (!r (vcpu-mode == IN_GUEST_MODE)) apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), POSTED_INTR_VECTOR); else +#endif kvm_vcpu_kick(vcpu); } Best regards, Yang -- ~Randy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v9 0/7] KVM: VMX: Add Posted Interrupt supporting
Gleb Natapov wrote on 2013-04-11: On Thu, Apr 11, 2013 at 01:03:30AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-10: On Wed, Apr 10, 2013 at 09:22:50PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com The follwoing patches are adding the Posted Interrupt supporting to KVM: The first patch enables the feature 'acknowledge interrupt on vmexit'.Since it is required by Posted interrupt, we need to enable it firstly. And the subsequent patches are adding the posted interrupt supporting: Posted Interrupt allows APIC interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by next vm entry Changes from v8 to v9: * Add tracing in PI case when deliver interrupt. * Scan ioapic when updating SPIV register. Do not see it at the patch series. Have I missed it? The change is in forth patch: diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 6796218..4ccdc94 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -134,11 +134,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) static_key_slow_inc(apic_sw_disabled.key); } apic_set_reg(apic, APIC_SPIV, val); -} - -static inline int apic_enabled(struct kvm_lapic *apic) -{ -return kvm_apic_sw_enabled(apic) kvm_apic_hw_enabled(apic); +kvm_make_request(KVM_REQ_SCAN_IOAPIC, apic-vcpu); } OK, see it now. Thanks. As you mentioned, since it will call apic_enabled() to check whether apic is enabled in vcpu_scan_ioapic. So we must ensure rescan ioapic when apic state changed. And I found recalculate_apic_map() doesn't track the enable/disable apic by software approach. So make_scan_ioapic_request in recalculate_apic_map() is not enough. We also should force rescan ioapic when apic state is changed via software approach(update spiv reg). 10.4.7.2 Local APIC State After It Has Been Software Disabled says: Pending interrupts in the IRR and ISR registers are held and require masking or handling by the CPU. My understanding is that we should treat software disabled APIC as a valid target for an interrupt. vcpu_scan_ioapic() should check kvm_apic_hw_enabled() only. Indeed. kvm_apic_hw_enabled() is the right one. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v9 7/7] KVM: Use eoi to track RTC interrupt delivery status
Gleb Natapov wrote on 2013-04-11: On Wed, Apr 10, 2013 at 09:22:20PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Current interrupt coalescing logci which only used by RTC has conflict with Posted Interrupt. This patch introduces a new mechinism to use eoi to track interrupt: When delivering an interrupt to vcpu, the pending_eoi set to number of vcpu that received the interrupt. And decrease it when each vcpu writing eoi. No subsequent RTC interrupt can deliver to vcpu until all vcpus write eoi. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- virt/kvm/ioapic.c | 39 ++- 1 files changed, 38 insertions(+), 1 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index a49fcd5..aeac154 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -147,6 +147,26 @@ static void kvm_rtc_eoi_tracking_restore_all(struct kvm_ioapic *ioapic) __rtc_irq_eoi_tracking_restore_one(vcpu); } +static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu) +{ +if (test_and_clear_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map)) +--ioapic-rtc_status.pending_eoi; + +WARN_ON(ioapic-rtc_status.pending_eoi 0); +} + +static bool rtc_irq_check_coalesced(struct kvm_ioapic *ioapic, int irq, +bool line_status) +{ +if (irq != RTC_GSI || !line_status) +return false; Please move the check from rtc_irq_check_coalesced() to kvm_ioapic_set_irq() like this: if (irq == RTC_GSI line_status rtc_irq_check_coalesced(ioapic, irq, line_status)) I was going to fix it myself while applying, but since there will be new posted interrupt series anyway you can as well fix this one too. You mean fix it and send out it with posted interrupt series? Or just rebase the posted interrupt series on the top of this fix, but needn't to send out it? + +if (ioapic-rtc_status.pending_eoi 0) +return true; /* coalesced */ + +return false; +} + static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx, bool line_status) { @@ -260,6 +280,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq, bool line_status) { union kvm_ioapic_redirect_entry *entry = ioapic-redirtbl[irq]; struct kvm_lapic_irq irqe; +int ret; ioapic_debug(dest=%x dest_mode=%x delivery_mode=%x vector=%x trig_mode=%x\n, @@ -275,7 +296,15 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq, bool line_status) irqe.level = 1; irqe.shorthand = 0; -return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL); +if (irq == RTC_GSI line_status) { +BUG_ON(ioapic-rtc_status.pending_eoi != 0); +ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, +ioapic-rtc_status.dest_map); +ioapic-rtc_status.pending_eoi = ret; +} else +ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL); + +return ret; } int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id, @@ -299,6 +328,11 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id, ret = 1; } else { int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG); + +if (rtc_irq_check_coalesced(ioapic, irq, line_status)) { +ret = 0; /* coalesced */ +goto out; +} ioapic-irr |= mask; if ((edge old_irr != ioapic-irr) || (!edge !entry.fields.remote_irr)) @@ -306,6 +340,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id, elseret = 0; /* report coalesced interrupt */ } +out: trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0); spin_unlock(ioapic-lock); @@ -333,6 +368,8 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu, if (ent-fields.vector != vector) continue; +if (i == RTC_GSI) +rtc_irq_eoi(ioapic, vcpu); /* * We are dropping lock while calling ack notifiers because ack * notifier callbacks for assigned devices call into IOAPIC -- 1.7.1 -- Gleb. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v9 7/7] KVM: Use eoi to track RTC interrupt delivery status
Gleb Natapov wrote on 2013-04-11: On Thu, Apr 11, 2013 at 07:54:01AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-11: On Wed, Apr 10, 2013 at 09:22:20PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Current interrupt coalescing logci which only used by RTC has conflict with Posted Interrupt. This patch introduces a new mechinism to use eoi to track interrupt: When delivering an interrupt to vcpu, the pending_eoi set to number of vcpu that received the interrupt. And decrease it when each vcpu writing eoi. No subsequent RTC interrupt can deliver to vcpu until all vcpus write eoi. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- virt/kvm/ioapic.c | 39 ++- 1 files changed, 38 insertions(+), 1 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index a49fcd5..aeac154 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -147,6 +147,26 @@ static void kvm_rtc_eoi_tracking_restore_all(struct kvm_ioapic *ioapic) __rtc_irq_eoi_tracking_restore_one(vcpu); } +static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu) +{ + if (test_and_clear_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map)) + --ioapic-rtc_status.pending_eoi; + + WARN_ON(ioapic-rtc_status.pending_eoi 0); +} + +static bool rtc_irq_check_coalesced(struct kvm_ioapic *ioapic, int irq, + bool line_status) +{ + if (irq != RTC_GSI || !line_status) + return false; Please move the check from rtc_irq_check_coalesced() to kvm_ioapic_set_irq() like this: if (irq == RTC_GSI line_status rtc_irq_check_coalesced(ioapic, irq, line_status)) I was going to fix it myself while applying, but since there will be new posted interrupt series anyway you can as well fix this one too. You mean fix it and send out it with posted interrupt series? Or just rebase the posted interrupt series on the top of this fix, but needn't to send out it? Send both series. RTC one with this change. Sure. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v9 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt
Gleb Natapov wrote on 2013-04-10: On Wed, Apr 10, 2013 at 09:22:57PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com If posted interrupt is avaliable, then uses it to inject virtual interrupt to guest. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c | 32 +--- arch/x86/kvm/vmx.c |2 +- arch/x86/kvm/x86.c |1 + 3 files changed, 23 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 42a87ac..4fdb984 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -349,6 +349,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic) if (!apic-irr_pending) return -1; +kvm_x86_ops-sync_pir_to_irr(apic-vcpu); result = apic_search_irr(apic); ASSERT(result == -1 || result = 16); @@ -679,18 +680,27 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, if (dest_map) __set_bit(vcpu-vcpu_id, dest_map); -result = !apic_test_and_set_irr(vector, apic); -trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, - trig_mode, vector, !result); -if (!result) { -if (trig_mode) -apic_debug(level trig mode repeatedly for -vector %d, vector); -break; -} +if (kvm_x86_ops-deliver_posted_interrupt) { +result = 1; +kvm_x86_ops-deliver_posted_interrupt(vcpu, vector); +} else { +result = !apic_test_and_set_irr(vector, apic); + +trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, +trig_mode, vector, !result); +if (!result) { +if (trig_mode) +apic_debug(level trig mode repeatedly +for vector %d, vector); +goto out; +} -kvm_make_request(KVM_REQ_EVENT, vcpu); -kvm_vcpu_kick(vcpu); +kvm_make_request(KVM_REQ_EVENT, vcpu); +kvm_vcpu_kick(vcpu); +} +out: +trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, +trig_mode, vector, !result); Sigh, now you trace it twice. Yes. Forget to remove the old one. :( break; case APIC_DM_REMRD: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 314b2ed..52b21da 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -84,7 +84,7 @@ module_param(vmm_exclusive, bool, S_IRUGO); static bool __read_mostly fasteoi = 1; module_param(fasteoi, bool, S_IRUGO); -static bool __read_mostly enable_apicv; +static bool __read_mostly enable_apicv = 1; module_param(enable_apicv, bool, S_IRUGO); /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 72be079..486f627 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2685,6 +2685,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { + kvm_x86_ops-sync_pir_to_irr(vcpu); memcpy(s-regs, vcpu-arch.apic-regs, sizeof *s); return 0; -- 1.7.1 -- Gleb. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v9 0/7] KVM: VMX: Add Posted Interrupt supporting
Gleb Natapov wrote on 2013-04-10: On Wed, Apr 10, 2013 at 09:22:50PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com The follwoing patches are adding the Posted Interrupt supporting to KVM: The first patch enables the feature 'acknowledge interrupt on vmexit'.Since it is required by Posted interrupt, we need to enable it firstly. And the subsequent patches are adding the posted interrupt supporting: Posted Interrupt allows APIC interrupts to inject into guest directly without any vmexit. - When delivering a interrupt to guest, if target vcpu is running, update Posted-interrupt requests bitmap and send a notification event to the vcpu. Then the vcpu will handle this interrupt automatically, without any software involvemnt. - If target vcpu is not running or there already a notification event pending in the vcpu, do nothing. The interrupt will be handled by next vm entry Changes from v8 to v9: * Add tracing in PI case when deliver interrupt. * Scan ioapic when updating SPIV register. Do not see it at the patch series. Have I missed it? The change is in forth patch: diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 6796218..4ccdc94 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -134,11 +134,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) static_key_slow_inc(apic_sw_disabled.key); } apic_set_reg(apic, APIC_SPIV, val); -} - -static inline int apic_enabled(struct kvm_lapic *apic) -{ - return kvm_apic_sw_enabled(apic) kvm_apic_hw_enabled(apic); + kvm_make_request(KVM_REQ_SCAN_IOAPIC, apic-vcpu); } As you mentioned, since it will call apic_enabled() to check whether apic is enabled in vcpu_scan_ioapic. So we must ensure rescan ioapic when apic state changed. And I found recalculate_apic_map() doesn't track the enable/disable apic by software approach. So make_scan_ioapic_request in recalculate_apic_map() is not enough. We also should force rescan ioapic when apic state is changed via software approach(update spiv reg). * Rebase on top of KVM upstream + RTC eoi tracking patch. Changes from v7 to v8: * Remove unused memeber 'on' from struct pi_desc. * Register a dummy function to sync_pir_to_irr is apicv is disabled. * Minor fixup. * Rebase on top of KVM upstream + RTC eoi tracking patch. Yang Zhang (7): KVM: VMX: Enable acknowledge interupt on vmexit KVM: VMX: Register a new IPI for posted interrupt KVM: VMX: Check the posted interrupt capability KVM: Call common update function when ioapic entry changed. KVM: Set TMR when programming ioapic entry KVM: VMX: Add the algorithm of deliver posted interrupt KVM: VMX: Use posted interrupt to deliver virtual interrupt arch/ia64/kvm/lapic.h |6 - arch/x86/include/asm/entry_arch.h |4 + arch/x86/include/asm/hardirq.h |3 + arch/x86/include/asm/hw_irq.h |1 + arch/x86/include/asm/irq_vectors.h |5 + arch/x86/include/asm/kvm_host.h|3 + arch/x86/include/asm/vmx.h |4 + arch/x86/kernel/entry_64.S |5 + arch/x86/kernel/irq.c | 22 arch/x86/kernel/irqinit.c |4 + arch/x86/kvm/lapic.c | 66 arch/x86/kvm/lapic.h |7 ++ arch/x86/kvm/svm.c | 12 ++ arch/x86/kvm/vmx.c | 207 +++- arch/x86/kvm/x86.c | 19 +++- include/linux/kvm_host.h |4 +- virt/kvm/ioapic.c | 32 -- virt/kvm/ioapic.h |7 +- virt/kvm/irq_comm.c|4 +- virt/kvm/kvm_main.c |5 +- 20 files changed, 341 insertions(+), 79 deletions(-) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v8 4/7] KVM: Add reset/restore rtc_status support
Gleb Natapov wrote on 2013-04-09: On Mon, Apr 08, 2013 at 10:17:46PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c |9 +++ arch/x86/kvm/lapic.h |2 + virt/kvm/ioapic.c| 60 ++ virt/kvm/ioapic.h |1 + 4 files changed, 72 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 0b73402..6796218 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void *bitmap) return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); } +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector) +{ +struct kvm_lapic *apic = vcpu-arch.apic; + +return apic_test_vector(vector, apic-regs + APIC_ISR) || +apic_test_vector(vector, apic-regs + APIC_IRR); +} + static inline void apic_set_vector(int vec, void *bitmap) { set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); @@ -1618,6 +1626,7 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, apic-highest_isr_cache = -1; kvm_x86_ops-hwapic_isr_update(vcpu-kvm, apic_find_highest_isr(apic)); kvm_make_request(KVM_REQ_EVENT, vcpu); + kvm_rtc_eoi_tracking_restore_one(vcpu); } void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 3e5a431..16304b1 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -166,4 +166,6 @@ static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu) return vcpu-arch.apic-pending_events; } +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector); + #endif diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 27ae8dd..4699180 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -90,6 +90,64 @@ static unsigned long ioapic_read_indirect(struct kvm_ioapic *ioapic, return result; } +static void rtc_irq_reset(struct kvm_ioapic *ioapic) rtc_irq_eoi_tracking_reset() Sure. +{ +ioapic-rtc_status.pending_eoi = 0; +bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS); +} + +static void __rtc_irq_eoi_tracking_restore_one(struct kvm_vcpu *vcpu, +int vector) +{ +bool new_val, old_val; +struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic; +union kvm_ioapic_redirect_entry *e; + +e = ioapic-redirtbl[RTC_GSI]; +if (!kvm_apic_match_dest(vcpu, NULL, 0, e-fields.dest_id, +e-fields.dest_mode)) +return; + +new_val = kvm_apic_pending_eoi(vcpu, vector); +old_val = test_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map); + +if (new_val == old_val) +return; + +if (new_val) { +__set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map); +ioapic-rtc_status.pending_eoi++; +} else { +__clear_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map); +ioapic-rtc_status.pending_eoi--; +} WARN_ON(ioapic-rtc_status.pending_eoi 0); Sure. +} + +void kvm_rtc_eoi_tracking_restore_one(struct kvm_vcpu *vcpu) +{ +struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic; +int vector; + +vector = ioapic-redirtbl[RTC_GSI].fields.vector; Do not access ioapic outside of the lock. Also since you access ioapic-redirtbl[RTC_GSI] in __rtc_irq_eoi_tracking_restore_one() anyway what's the point passing vector to it? Right. +spin_lock(ioapic-lock); +__rtc_irq_eoi_tracking_restore_one(vcpu, vector); +spin_unlock(ioapic-lock); +} + +static void kvm_rtc_eoi_tracking_restore_all(struct kvm_ioapic *ioapic) +{ +struct kvm_vcpu *vcpu; +int i, vector; + +if (RTC_GSI = IOAPIC_NUM_PINS) +return; + +rtc_irq_reset(ioapic); +vector = ioapic-redirtbl[RTC_GSI].fields.vector; +kvm_for_each_vcpu(i, vcpu, ioapic-kvm) +__rtc_irq_eoi_tracking_restore_one(vcpu, vector); +} + static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx) { union kvm_ioapic_redirect_entry *pent; @@ -428,6 +486,7 @@ void kvm_ioapic_reset(struct kvm_ioapic *ioapic) ioapic-ioregsel = 0; ioapic-irr = 0;ioapic-id = 0; + rtc_irq_reset(ioapic); update_handled_vectors(ioapic); } @@ -494,6 +553,7 @@ int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state) memcpy(ioapic, state, sizeof(struct kvm_ioapic_state)); update_handled_vectors(ioapic); kvm_ioapic_make_eoibitmap_request(kvm); + kvm_rtc_eoi_tracking_restore_all(ioapic); spin_unlock(ioapic-lock); return 0; } diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h index 761e5b5..313fc4e 100644 --- a/virt/kvm/ioapic.h +++ b/virt/kvm/ioapic.h @@ -79,6 +79,7 @@ static inline struct kvm_ioapic *ioapic_irqchip(struct kvm *kvm) return
RE: [PATCH v8 7/7] KVM: VMX: Use posted interrupt to deliver virtual interrupt
Gleb Natapov wrote on 2013-04-09: On Mon, Apr 08, 2013 at 10:23:22PM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com If posted interrupt is avaliable, then uses it to inject virtual interrupt to guest. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c | 29 ++--- arch/x86/kvm/vmx.c |2 +- arch/x86/kvm/x86.c |1 + 3 files changed, 20 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 8948979..46a4cca 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -353,6 +353,7 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic) if (!apic-irr_pending) return -1; +kvm_x86_ops-sync_pir_to_irr(apic-vcpu); result = apic_search_irr(apic); ASSERT(result == -1 || result = 16); @@ -683,18 +684,24 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, if (dest_map) __set_bit(vcpu-vcpu_id, dest_map); -result = !apic_test_and_set_irr(vector, apic); -trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, - trig_mode, vector, !result); -if (!result) { -if (trig_mode) -apic_debug(level trig mode repeatedly for -vector %d, vector); -break; -} +if (kvm_x86_ops-deliver_posted_interrupt) { +result = 1; +kvm_x86_ops-deliver_posted_interrupt(vcpu, vector); +} else { +result = !apic_test_and_set_irr(vector, apic); + +trace_kvm_apic_accept_irq(vcpu-vcpu_id, delivery_mode, +trig_mode, vector, !result); Missed that in previous review. Do no drop tracing for PI case. Hmm. I remember I have added the tracing for PI case. Don't know why it is not existing in this patch. Anyway, I will add it again. +if (!result) { +if (trig_mode) +apic_debug(level trig mode repeatedly +for vector %d, vector); +break; +} -kvm_make_request(KVM_REQ_EVENT, vcpu); -kvm_vcpu_kick(vcpu); +kvm_make_request(KVM_REQ_EVENT, vcpu); +kvm_vcpu_kick(vcpu); +} break; case APIC_DM_REMRD: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3de2d7f..cd1c6ff 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -84,7 +84,7 @@ module_param(vmm_exclusive, bool, S_IRUGO); static bool __read_mostly fasteoi = 1; module_param(fasteoi, bool, S_IRUGO); -static bool __read_mostly enable_apicv; +static bool __read_mostly enable_apicv = 1; module_param(enable_apicv, bool, S_IRUGO); /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 72be079..486f627 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2685,6 +2685,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { + kvm_x86_ops-sync_pir_to_irr(vcpu); memcpy(s-regs, vcpu-arch.apic-regs, sizeof *s); return 0; -- 1.7.1 -- Gleb. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v8 6/7] KVM: VMX: Add the algorithm of deliver posted interrupt
Paolo Bonzini wrote on 2013-04-10: Il 08/04/2013 16:23, Yang Zhang ha scritto: + * interrupt from PIR in next vmentry. + */ +static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) +{ +struct vcpu_vmx *vmx = to_vmx(vcpu); +int r; + +if (pi_test_and_set_pir(vector, vmx-pi_desc)) +return; + +r = pi_test_and_set_on(vmx-pi_desc); +kvm_make_request(KVM_REQ_EVENT, vcpu); +if (!r (vcpu-mode == IN_GUEST_MODE)) +apic-send_IPI_mask(get_cpu_mask(vcpu-cpu), +POSTED_INTR_VECTOR); +else +kvm_vcpu_kick(vcpu); + +return; +} No need for this return. Right. Will remove it in next version. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v7 4/7] KVM: Add reset/restore rtc_status support
Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 01:16:51PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 01:05:02PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 12:39:32PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 02:30:15AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-04: On Mon, Apr 01, 2013 at 08:40:13AM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c |9 + arch/x86/kvm/lapic.h | 2 ++ virt/kvm/ioapic.c| 43 +++ virt/kvm/ioapic.h | 1 + 4 files changed, 55 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 96ab160..9c041fa 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void *bitmap) return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); } +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector) +{ +struct kvm_lapic *apic = vcpu-arch.apic; + +return apic_test_vector(vector, apic-regs + APIC_ISR) || +apic_test_vector(vector, apic-regs + APIC_IRR); +} + static inline void apic_set_vector(int vec, void *bitmap) { set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); @@ -1665,6 +1673,7 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, apic-highest_isr_cache = -1; kvm_x86_ops-hwapic_isr_update(vcpu-kvm, apic_find_highest_isr(apic)); kvm_make_request(KVM_REQ_EVENT, vcpu); +kvm_rtc_irq_restore(vcpu); } void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 967519c..004d2ad 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -170,4 +170,6 @@ static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu) return vcpu-arch.apic-pending_events; } +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector); + #endif diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 8664812..0b12b17 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -90,6 +90,47 @@ static unsigned long ioapic_read_indirect(struct kvm_ioapic *ioapic, return result; } +static void rtc_irq_reset(struct kvm_ioapic *ioapic) +{ +ioapic-rtc_status.pending_eoi = 0; +bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS); +} + +static void rtc_irq_restore(struct kvm_ioapic *ioapic) +{ +struct kvm_vcpu *vcpu; +int vector, i, pending_eoi = 0; + +if (RTC_GSI = IOAPIC_NUM_PINS) +return; + +vector = ioapic-redirtbl[RTC_GSI].fields.vector; +kvm_for_each_vcpu(i, vcpu, ioapic-kvm) { +if (kvm_apic_pending_eoi(vcpu, vector)) { +pending_eoi++; +__set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map); You should cleat dest_map at the beginning to get rid of stale bits. I thought kvm_set_ioapic is called only after save/restore or migration. And the ioapic should be reset successfully before call it. So the dest_map is empty before call rtc_irq_restore(). But it is possible kvm_set_ioapic is called beside save/restore or migration. Right? First of all userspace should not care when it calls kvm_set_ioapic() the kernel need to do the right thing. Second, believe it or not, kvm_ioapic_reset() is not called during system reset. Instead userspace reset it by calling kvm_set_ioapic() with ioapic state after reset. Ok. I see. As the logic you suggested, it will clear dest_map if no pending eoi in vcpu, so we don't need to do it again. You again rely on userspace doing thing in certain manner. What is set_lapic() is never called? Kernel internal state have to be correct after each ioctl call. Sorry. I cannot figure out what's the problem if don't clear dest_map? Can you elaborate it? What is not obvious about it? If there is a bit in dest_map that should be cleared after rtc_irq_restore() it will not. Why it will not? If new_val is false, and the old_val is true. __clear_bit() will clear the dest_map. Am I wrong? Ah, yes with new logic since we go over all vcpus and calculate new value for each one in theory it should be fine, but if we add cpu destruction this will be no longer true. new_val = kvm_apic_pending_eoi(vcpu, vector); Which reminds me there are more bugs in the current code. It is not enough to call kvm_apic_pending_eoi() to check the new value. You need to see if the entry is masked and vcpu is the destination of the interrupt too. No. kvm_apic_pending_eoi() is the right way. IOAPIC entry may change after vcpu received it before issue EOI, and we should not rely on the entry. For example: vcpu A received the interrupt, pending in IRR mask entry migration happened The only
RE: [PATCH v7 4/7] KVM: Add reset/restore rtc_status support
Gleb Natapov wrote on 2013-04-08: On Mon, Apr 08, 2013 at 11:21:34AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 01:16:51PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 01:05:02PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 12:39:32PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 02:30:15AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-04: On Mon, Apr 01, 2013 at 08:40:13AM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c |9 + arch/x86/kvm/lapic.h | 2 ++ virt/kvm/ioapic.c| 43 +++ virt/kvm/ioapic.h | 1 + 4 files changed, 55 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 96ab160..9c041fa 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void *bitmap) return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); } +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector) +{ + struct kvm_lapic *apic = vcpu-arch.apic; + + return apic_test_vector(vector, apic-regs + APIC_ISR) || + apic_test_vector(vector, apic-regs + APIC_IRR); +} + static inline void apic_set_vector(int vec, void *bitmap) { set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); @@ -1665,6 +1673,7 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, apic-highest_isr_cache = -1; kvm_x86_ops-hwapic_isr_update(vcpu-kvm, apic_find_highest_isr(apic)); kvm_make_request(KVM_REQ_EVENT, vcpu); + kvm_rtc_irq_restore(vcpu); } void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 967519c..004d2ad 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -170,4 +170,6 @@ static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu) return vcpu-arch.apic-pending_events; } +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector); + #endif diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 8664812..0b12b17 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -90,6 +90,47 @@ static unsigned long ioapic_read_indirect(struct kvm_ioapic *ioapic, return result; } +static void rtc_irq_reset(struct kvm_ioapic *ioapic) +{ + ioapic-rtc_status.pending_eoi = 0; + bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS); +} + +static void rtc_irq_restore(struct kvm_ioapic *ioapic) +{ + struct kvm_vcpu *vcpu; +int vector, i, pending_eoi = 0; + + if (RTC_GSI = IOAPIC_NUM_PINS) + return; + + vector = ioapic-redirtbl[RTC_GSI].fields.vector; + kvm_for_each_vcpu(i, vcpu, ioapic-kvm) { + if (kvm_apic_pending_eoi(vcpu, vector)) { + pending_eoi++; + __set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map); You should cleat dest_map at the beginning to get rid of stale bits. I thought kvm_set_ioapic is called only after save/restore or migration. And the ioapic should be reset successfully before call it. So the dest_map is empty before call rtc_irq_restore(). But it is possible kvm_set_ioapic is called beside save/restore or migration. Right? First of all userspace should not care when it calls kvm_set_ioapic() the kernel need to do the right thing. Second, believe it or not, kvm_ioapic_reset() is not called during system reset. Instead userspace reset it by calling kvm_set_ioapic() with ioapic state after reset. Ok. I see. As the logic you suggested, it will clear dest_map if no pending eoi in vcpu, so we don't need to do it again. You again rely on userspace doing thing in certain manner. What is set_lapic() is never called? Kernel internal state have to be correct after each ioctl call. Sorry. I cannot figure out what's the problem if don't clear dest_map? Can you elaborate it? What is not obvious about it? If there is a bit in dest_map that should be cleared after rtc_irq_restore() it will not. Why it will not? If new_val is false, and the old_val is true. __clear_bit() will clear the dest_map. Am I wrong? Ah, yes with new logic since we go over all vcpus and calculate new value for each one in theory it should be fine, but if we add cpu destruction this will be no longer true. new_val = kvm_apic_pending_eoi(vcpu, vector); Which reminds me there are more bugs in the current code. It is not enough to call kvm_apic_pending_eoi() to check the new value. You need to see if the entry is masked and vcpu is the destination of the interrupt too. No. kvm_apic_pending_eoi() is the right way. IOAPIC entry may change after vcpu received it before issue EOI, and we should not rely on the entry. For example: vcpu A received
RE: [PATCH v7 4/7] KVM: Add reset/restore rtc_status support
Gleb Natapov wrote on 2013-04-07: On Sun, Apr 07, 2013 at 02:30:15AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-04: On Mon, Apr 01, 2013 at 08:40:13AM +0800, Yang Zhang wrote: From: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c |9 + arch/x86/kvm/lapic.h |2 ++ virt/kvm/ioapic.c| 43 +++ virt/kvm/ioapic.h | 1 + 4 files changed, 55 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 96ab160..9c041fa 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void *bitmap) return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); } +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector) +{ + struct kvm_lapic *apic = vcpu-arch.apic; + + return apic_test_vector(vector, apic-regs + APIC_ISR) || + apic_test_vector(vector, apic-regs + APIC_IRR); +} + static inline void apic_set_vector(int vec, void *bitmap) { set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); @@ -1665,6 +1673,7 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu, apic-highest_isr_cache = -1; kvm_x86_ops-hwapic_isr_update(vcpu-kvm, apic_find_highest_isr(apic)); kvm_make_request(KVM_REQ_EVENT, vcpu); + kvm_rtc_irq_restore(vcpu); } void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 967519c..004d2ad 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -170,4 +170,6 @@ static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu) return vcpu-arch.apic-pending_events; } +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector); + #endif diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 8664812..0b12b17 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -90,6 +90,47 @@ static unsigned long ioapic_read_indirect(struct kvm_ioapic *ioapic, return result; } +static void rtc_irq_reset(struct kvm_ioapic *ioapic) +{ + ioapic-rtc_status.pending_eoi = 0; + bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS); +} + +static void rtc_irq_restore(struct kvm_ioapic *ioapic) +{ + struct kvm_vcpu *vcpu; + int vector, i, pending_eoi = 0; + + if (RTC_GSI = IOAPIC_NUM_PINS) + return; + + vector = ioapic-redirtbl[RTC_GSI].fields.vector; + kvm_for_each_vcpu(i, vcpu, ioapic-kvm) { + if (kvm_apic_pending_eoi(vcpu, vector)) { + pending_eoi++; + __set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map); You should cleat dest_map at the beginning to get rid of stale bits. I thought kvm_set_ioapic is called only after save/restore or migration. And the ioapic should be reset successfully before call it. So the dest_map is empty before call rtc_irq_restore(). But it is possible kvm_set_ioapic is called beside save/restore or migration. Right? First of all userspace should not care when it calls kvm_set_ioapic() the kernel need to do the right thing. Second, believe it or not, kvm_ioapic_reset() is not called during system reset. Instead userspace reset it by calling kvm_set_ioapic() with ioapic state after reset. Ok. I see. As the logic you suggested, it will clear dest_map if no pending eoi in vcpu, so we don't need to do it again. + } + } + ioapic-rtc_status.pending_eoi = pending_eoi; +} + +void kvm_rtc_irq_restore(struct kvm_vcpu *vcpu) +{ + struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic; + int vector; + + if (!ioapic) + return; + Can this be called if ioapic == NULL? Yes. IIRC, unit test will test lapic function without ioapic. Unit test is a guest code. This has nothing to do with a guest code. Unit test is not the one who creates lapic. Ok. Then !ioapic is unnecessary. Should check for if (RTC_GSI = IOAPIC_NUM_PINS) here too. Not necessary. kvm_rtc_irq_restore is called from arch/x86/ and we have the defination: #ifdef CONFIG_X86 #define RTC_GSI 8 The check will be false always. As the logic you suggested below, this check is necessary for _all() not _one(); OK. + spin_lock(ioapic-lock); + vector = ioapic-redirtbl[RTC_GSI].fields.vector; + if (kvm_apic_pending_eoi(vcpu, vector)) { + __set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map); + ioapic-rtc_status.pending_eoi++; The bit may have been set already. The logic should be: Right. new_val = kvm_apic_pending_eoi(vcpu, vector) old_val = set_bit(vcpu_id, dest_map) if (new_val == old_val) return; if (new_val) { __set_bit(vcpu_id, dest_map); pending_eoi++; } else { __clear_bit(vcpu_id, dest_map); pending_eoi--; } The naming of above two functions are not good either. Call them something like kvm_rtc_eoi_tracking_restore_all