Re: KVM call for agenda for 2013-10-01
On 09/24/2013 05:09 PM, Juan Quintela wrote: Hi Please, send any topic that you are interested in covering. Last week I forgot to send the call for topics. We still have a topic there. Thanks, Juan. Agenda so far: - Talk about qemu reverse executing (1st description was done this week) How to handle IO when we want to do reverse execution. How this relate to Kemari needs? Hi Juan, I'm sorry but I won't be able to attend to call tomorrow due to travel. Regards, Orit And to icount changes? Call details: 10:00 AM to 11:00 AM EDT Every two weeks If you need phone number details, contact me privately. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 01/14] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
On 07/31/2013 05:48 PM, Gleb Natapov wrote: From: Nadav Har'El n...@il.ibm.com Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e999dc7..27efa6a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) #else nested_vmx_exit_ctls_high = 0; #endif - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_entry_ctls_high = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; - + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | +VM_ENTRY_LOAD_IA32_EFER); /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl); - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls | + /* L2-L1 exit controls are emulated - the hardware exit is to L0 so + * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER + * bits are further modified by vmx_set_efer() below. + */ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); + + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are + * emulated by vmx_set_efer(), below. + */ + vmcs_write32(VM_ENTRY_CONTROLS, + (vmcs12-vm_entry_controls ~VM_ENTRY_LOAD_IA32_EFER + ~VM_ENTRY_IA32E_MODE) | (vmcs_config.vmentry_ctrl ~VM_ENTRY_IA32E_MODE)); if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_PAT) Reviewed-by: Orit Wasserman owass...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 02/14] nEPT: Fix cr3 handling in nested exit and entry
On 07/31/2013 05:48 PM, Gleb Natapov wrote: From: Nadav Har'El n...@il.ibm.com The existing code for handling cr3 and related VMCS fields during nested exit and entry wasn't correct in all cases: If L2 is allowed to control cr3 (and this is indeed the case in nested EPT), during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and we forgot to do so. This patch adds this copy. If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and whoever does control cr3 (L1 or L2) is using PAE, the processor might have saved PDPTEs and we should also save them in vmcs12 (and restore later). Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/vmx.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 27efa6a..4e98764 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7595,6 +7595,16 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) kvm_set_cr3(vcpu, vmcs12-guest_cr3); kvm_mmu_reset_context(vcpu); + /* + * L1 may access the L2's PDPTR, so save them to construct vmcs12 + */ + if (enable_ept) { + vmcs_write64(GUEST_PDPTR0, vmcs12-guest_pdptr0); + vmcs_write64(GUEST_PDPTR1, vmcs12-guest_pdptr1); + vmcs_write64(GUEST_PDPTR2, vmcs12-guest_pdptr2); + vmcs_write64(GUEST_PDPTR3, vmcs12-guest_pdptr3); + } + kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-guest_rsp); kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12-guest_rip); } @@ -7917,6 +7927,22 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vmcs12-guest_pending_dbg_exceptions = vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); + /* + * In some cases (usually, nested EPT), L2 is allowed to change its + * own CR3 without exiting. If it has changed it, we must keep it. + * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined + * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12. + * + * Additionally, restore L2's PDPTR to vmcs12. + */ + if (enable_ept) { + vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3); + vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); + vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); + vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); + vmcs12-guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); + } + vmcs12-vm_entry_controls = (vmcs12-vm_entry_controls ~VM_ENTRY_IA32E_MODE) | (vmcs_read32(VM_ENTRY_CONTROLS) VM_ENTRY_IA32E_MODE); Reviewed-by: Orit Wasserman owass...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 03/14] nEPT: Fix wrong test in kvm_set_cr3
On 07/31/2013 05:48 PM, Gleb Natapov wrote: From: Nadav Har'El n...@il.ibm.com kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical address. The problem is that with nested EPT, cr3 is an *L2* physical address, not an L1 physical address as this test expects. As the comment above this test explains, it isn't necessary, and doesn't correspond to anything a real processor would do. So this patch removes it. Note that this wrong test could have also theoretically caused problems in nested NPT, not just in nested EPT. However, in practice, the problem was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus circumventing the problem. Additional potential calls to the buggy function are avoided in that we don't trap cr3 modifications when nested NPT is enabled. However, because in nested VMX we did want to use kvm_set_cr3() (as requested in Avi Kivity's review of the original nested VMX patches), we can't avoid this problem and need to fix it. Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Signed-off-by: Yang Zhang yang.z.zh...@intel.com Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/x86.c | 11 --- 1 file changed, 11 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d2caeb9..e2fef8b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -682,17 +682,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) */ } - /* - * Does the new cr3 value map to physical memory? (Note, we - * catch an invalid cr3 even in real-mode, because it would - * cause trouble later on when we turn on paging anyway.) - * - * A real CPU would silently accept an invalid cr3 and would - * attempt to use it - with largely undefined (and often hard - * to debug) behavior on the guest side. - */ - if (unlikely(!gfn_to_memslot(vcpu-kvm, cr3 PAGE_SHIFT))) - return 1; vcpu-arch.cr3 = cr3; __set_bit(VCPU_EXREG_CR3, (ulong *)vcpu-arch.regs_avail); vcpu-arch.mmu.new_cr3(vcpu); Reviewed-by: Orit Wasserman owass...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/11] KVM: nVMX: shadow VMCS support, v5
On 04/18/2013 02:34 PM, Abel Gordon wrote: This series of patches implements shadow-vmcs capability for nested VMX. Shadow-vmcs - background and overview: In Intel VMX, vmread and vmwrite privileged instructions are used by the hypervisor to read and modify the guest and host specifications (VMCS). In a nested virtualization environment, L1 executes multiple vmread and vmwrite instruction to handle a single L2 exit. Each vmread and vmwrite executed by L1 traps (cause an exit) to the L0 hypervisor (KVM). L0 emulates the instruction behaviour and resumes L1 execution. Removing the need to trap and emulate these special instructions reduces the number of exits and improves nested virtualization performance. As it was first evaluated in [1], exit-less vmread and vmwrite can reduce nested virtualization overhead up-to 40%. Intel introduced a new feature to their processors called shadow-vmcs. Using shadow-vmcs, L0 can configure the processor to let L1 running in guest-mode access VMCS12 fields using vmread and vmwrite instructions but without causing an exit to L0. The VMCS12 fields' data is stored in a shadow-vmcs controlled by L0. Shadow-vmcs - design considerations: A shadow-vmcs is processor-dependent and must be accessed by L0 or L1 using vmread and vmwrite instructions. With nested virtualization we aim to abstract the hardware from the L1 hypervisor. Thus, to avoid hardware dependencies we prefered to keep the software defined VMCS12 format as part of L1 address space and hold the processor-specific shadow-vmcs format only in L0 address space. In other words, the shadow-vmcs is used by L0 as an accelerator but the format and content is never exposed to L1 directly. L0 syncs the content of the processor-specific shadow vmcs with the content of the software-controlled VMCS12 format. We could have been kept the processor-specific shadow-vmcs format in L1 address space to avoid using the software defined VMCS12 format, however, this type of design/implementation would have been created hardware dependencies and would complicate other capabilities (e.g. Live Migration of L1). Changes since v1: 1) Added sync_shadow_vmcs flag used to indicate when the content of VMCS12 must be copied to the shadow vmcs. The flag value is checked during vmx_vcpu_run. 2) Code quality improvements Changes since v2: 1) Allocate shadow vmcs only once per VCPU on handle_vmxon and re-use the same instance for multiple VMCS12s 2) More code quality improvements Changes since v3: 1) Fixed VMXON emulation (new patch). Previous nVMX code didn't verify if L1 is already in root mode (VMXON was previously called). Now we call nested_vmx_failValid if VMX is already ON. This is requird to avoid host leaks (due to shadow vmcs allocation) if L1 repetedly executes VMXON. 2) Improved comment: clarified we do not shadow fields that are modified when L1 executes vmx instructions like the VM_INSTRUCTION_ERROR field. Changes since v4: 1) Fixed free_nested: we now free the shadow vmcs also when there is no current vmcs. Acknowledgments: Many thanks to Natapov, Gleb g...@redhat.com Xu, Dongxiao dongxiao...@intel.com Nakajima, Jun jun.nakaj...@intel.com Har'El, Nadav na...@harel.org.il for the insightful discussions, comments and reviews. These patches were easily created and maintained using Patchouli -- patch creator http://patchouli.sourceforge.net/ [1] The Turtles Project: Design and Implementation of Nested Virtualization, http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf Hard to keep up :) Reviewed-by: Orit Wasserman owass...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/11] KVM: nVMX: shadow VMCS support, v1
On 03/10/2013 06:03 PM, Abel Gordon wrote: This series of patches implements shadow-vmcs capability for nested VMX. Shadow-vmcs - background and overview: In Intel VMX, vmread and vmwrite privileged instructions are used by the hypervisor to read and modify the guest and host specifications (VMCS). In a nested virtualization environment, L1 executes multiple vmread and vmwrite instruction to handle a single L2 exit. Each vmread and vmwrite executed by L1 traps (cause an exit) to the L0 hypervisor (KVM). L0 emulates the instruction behaviour and resumes L1 execution. Removing the need to trap and emulate these special instructions reduces the number of exits and improves nested virtualization performance. As it was first evaluated in [1], exit-less vmread and vmwrite can reduce nested virtualization overhead up-to 40%. Intel introduced a new feature to their processors called shadow-vmcs. Using shadow-vmcs, L0 can configure the processor to let L1 running in guest-mode access VMCS12 fields using vmread and vmwrite instructions but without causing an exit to L0. The VMCS12 fields' data is stored in a shadow-vmcs controlled by L0. Shadow-vmcs - design considerations: A shadow-vmcs is processor-dependent and must be accessed by L0 or L1 using vmread and vmwrite instructions. With nested virtualization we aim to abstract the hardware from the L1 hypervisor. Thus, to avoid hardware dependencies we prefered to keep the software defined VMCS12 format as part of L1 address space and hold the processor-specific shadow-vmcs format only in L0 address space. In other words, the shadow-vmcs is used by L0 as an accelerator but the format and content is never exposed to L1 directly. L0 syncs the content of the processor-specific shadow vmcs with the content of the software-controlled VMCS12 format. We could have been kept the processor-specific shadow-vmcs format in L1 address space to avoid using the software defined VMCS12 format, however, this type of design/implementation would have been created hardware dependencies and would complicate other capabilities (e.g. Live Migration of L1). Acknowledgments: Many thanks to Xu, Dongxiao dongxiao...@intel.com Nakajima, Jun jun.nakaj...@intel.com Har'El, Nadav na...@harel.org.il for the insightful discussions, comments and reviews. These patches were easily created and maintained using Patchouli -- patch creator http://patchouli.sourceforge.net/ [1] The Turtles Project: Design and Implementation of Nested Virtualization, http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Reviewed-by: Orit Wasserman owass...@redhat.com By the way do you have some performance results, how does it improve nested ? Orit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM
On 11/28/2012 02:29 AM, Marcelo Tosatti wrote: On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote: The launch state is not a member in the VMCS area, use a separate variable (list) to store it instead. Signed-off-by: Dongxiao Xu dongxiao...@intel.com 1. What is the problem with keeping launched state in the VMCS? Assuming there is a positive answer to the above: 2. Don't you have to change VMCS ID? 3. Can't it be kept somewhere else other than a list? Current scheme allows guest to allocate unlimited amounts of host memory. I agree with Marcelo you have to limit the number of VMCS in the list otherwise it will be easy to attack a host with nested :) 4. What is the state of migration / nested vmx again? If vmcs12 is migrated, this means launched state is not migrated anymore. Patches 1-3 seem fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] nested vmx: clean up for vmcs12 read and write
On 11/21/2012 11:04 AM, Dongxiao Xu wrote: abstract vmcs12_read and vmcs12_write functions to do the vmcs12 read/write operations. Signed-off-by: Dongxiao Xu dongxiao...@intel.com --- arch/x86/kvm/vmx.c | 86 +++- 1 files changed, 45 insertions(+), 41 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f858159..d8670e4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5407,32 +5407,67 @@ static inline int vmcs_field_readonly(unsigned long field) * some of the bits we return here (e.g., on 32-bit guests, only 32 bits of * 64-bit fields are to be returned). */ -static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu, - unsigned long field, u64 *ret) +static inline u64 vmcs12_read(struct kvm_vcpu *vcpu, unsigned long field) { short offset = vmcs_field_to_offset(field); char *p; - if (offset 0) + if (offset 0) { + nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); + skip_emulated_instruction(vcpu); return 0; + } p = ((char *)(get_vmcs12(vcpu))) + offset; switch (vmcs_field_type(field)) { case VMCS_FIELD_TYPE_NATURAL_WIDTH: - *ret = *((natural_width *)p); + return *((natural_width *)p); + case VMCS_FIELD_TYPE_U16: + return *((u16 *)p); + case VMCS_FIELD_TYPE_U32: + return *((u32 *)p); + case VMCS_FIELD_TYPE_U64: + return *((u64 *)p); + default: + nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); + skip_emulated_instruction(vcpu); + return 0; /* can never happen. */ + } +} + +static inline int vmcs12_write(struct kvm_vcpu *vcpu, + unsigned long field, + u64 value) +{ + short offset = vmcs_field_to_offset(field); + char *p; + + if (offset 0) { + nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); + skip_emulated_instruction(vcpu); + return 0; + } + + p = ((char *)(get_vmcs12(vcpu))) + offset; + + switch (vmcs_field_type(field)) { + case VMCS_FIELD_TYPE_NATURAL_WIDTH: + *(natural_width *)p = value; return 1; case VMCS_FIELD_TYPE_U16: - *ret = *((u16 *)p); + *(u16 *)p = value; return 1; case VMCS_FIELD_TYPE_U32: - *ret = *((u32 *)p); + *(u32 *)p = value; return 1; case VMCS_FIELD_TYPE_U64: - *ret = *((u64 *)p); + *(u64 *)p = value; return 1; default: - return 0; /* can never happen. */ + nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); + skip_emulated_instruction(vcpu); + return 0; } } @@ -5466,11 +5501,7 @@ static int handle_vmread(struct kvm_vcpu *vcpu) /* Decode instruction info and find the field to read */ field = kvm_register_read(vcpu, (((vmx_instruction_info) 28) 0xf)); /* Read the field, zero-extended to a u64 field_value */ - if (!vmcs12_read_any(vcpu, field, field_value)) { - nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); - skip_emulated_instruction(vcpu); - return 1; - } + field_value = vmcs12_read(vcpu, field); If you had an error here, you will still continue running the function, you will override the failure (by calling nested_vmx_succeed) and skip emulated instruction twice. You need to detect there was an error and return (like in vmwrite), Orit /* * Now copy part of this value to register or memory, as requested. * Note that the number of bits actually copied is 32 or 64 depending @@ -5500,8 +5531,6 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu) gva_t gva; unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); - char *p; - short offset; /* The value to write might be 32 or 64 bits, depending on L1's long * mode, and eventually we need to write that into a field of several * possible lengths. The code below first zero-extends the value to 64 @@ -5537,33 +5566,8 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu) skip_emulated_instruction(vcpu); return 1; } - - offset = vmcs_field_to_offset(field); - if (offset 0) { - nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); - skip_emulated_instruction(vcpu); - return 1; - } - p = ((char *) get_vmcs12(vcpu)) + offset; - - switch (vmcs_field_type(field)) { - case VMCS_FIELD_TYPE_U16: -
Re: [PATCH v3 00/35] postcopy live migration
Hi, I didn't have time yet to review in detail your patches, but I have one general comment about the interface to activate postcopy. As postcopy needs to be supported both by source and destination Qemu, for those kind of features we have migration capabilities interface, you can look at the XBZRLE patch series for more details. So in order to activate postcopy the user will need to do: migrate_set_capabilites postcopy on on source and destination Qemu before starting the migration process. Regards, Orit On 10/30/2012 10:32 AM, Isaku Yamahata wrote: This is the v3 patch series of postcopy migration. The trees is available at git://github.com/yamahata/qemu.git qemu-postcopy-oct-30-2012 git://github.com/yamahata/linux-umem.git linux-umem-oct-29-2012 Major changes v2 - v3: - implemented pre+post optimization - auto detection of postcopy by incoming side - using threads on destination instead of fork - using blocking io instead of select + non-blocking io loop - less memory overhead - various improvement and code simplification - kernel module name change umem - uvmem to avoid name conflict. Patches organization: 1-2: trivial fixes 3-5: prepartion for threading. cherry-picked from migration tree 6-18: refactoring existing code and preparation 19-25: implement postcopy live migration itself (essential part) 26-35: optimization/heuristic for postcopy Usage = You need load uvmem character device on the host before starting migration. Postcopy can be used for tcg and kvm accelarator. The implementation depend on only linux uvmem character device. But the driver dependent code is split into a file. I tested only host page size == guest page size case, but the implementation allows host page size != guest page size case. The following options are added with this patch series. - incoming part use -incoming as usual. Postcopy is automatically detected. example: qemu -incoming tcp:0: -monitor stdio -machine accel=kvm - outging part options for migrate command migrate [-p [-n] [-m]] URI [precopy count [prefault forward [prefault backword]]] Newly added options/arguments -p: indicate postcopy migration -n: disable background transferring pages: This is for benchmark/debugging -m: move background transfer of postcopy mode precopy count: The number of precopy RAM scan before postcopy. default 0 (0 means no precopy) prefault forward: The number of forward pages which is sent with on-demand prefault backward: The number of backward pages which is sent with on-demand example: migrate -p -n tcp:dest ip address: migrate -p -n -m tcp:dest ip address: 42 42 0 TODO - benchmark/evaluation - improve/optimization At the moment at least what I'm aware of is - pre+post case On desitnation side reading dirty bitmap would cause long latency. create thread for that. - consider on FUSE/CUSE possibility basic postcopy work flow qemu on the destination | V open(/dev/uvmem) | V UVMEM_INIT | V Here we have two file descriptors to umem device and shmem file | | umem threads | on the destination | Vcreate pipe to communicate crete threads, | | V mmap(shmem file) mmap(uvmem device) for guest RAM close(shmem file) | | | | V | wait for ready from daemon pipe-send ready message | | | Here the daemon takes over send okpipe--- the owner of the socket | to the source V | entering post copy stage | start guest execution| | | V V access guest RAM read() to get faulted pages | | V V page fault --page offset is returned block| V
Re: [Qemu-devel] Block Migration and xbzrle
On 10/02/2012 10:33 AM, lieven-li...@dlh.net wrote: Orit Wasserman wrote: On 09/16/2012 01:39 PM, Peter Lieven wrote: Hi, I remember that this was broken some time ago and currently with qemu-kvm 1.2.0 I am still not able to use block migration plus xbzrle. The migration fails if both are used together. XBZRLE without block migration works. Can someone please advise what is the current expected behaviour? XBZRLE only work on guest memory so it shouldn't be effected by block migration. What is the error you are getting? What command line ? Meanwhile I can confirm that it happens with and without block migration. I I observe 2 errors: a) qemu: warning: error while loading state section id 2 load of migration failed b) the vm does not enter running state after migration. The command-line: /usr/bin/qemu-kvm-1.2.0 -net tap,vlan=798,script=no,downscript=no,ifname=tap1 -net nic,vlan=798,model=e1000,macaddr=52:54:00:ff:01:15 -drive format=host_device,file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-d85f4e007-3f30017ce11505df-ubuntu-tools-hd0,if=virtio,cache=none,aio=native -m 4096 -smp 2,sockets=1,cores=2,threads=1 -monitor tcp:0:4002,server,nowait -vnc :2 -qmp tcp:0:3002,server,nowait -name 'Ubuntu-Tools' -boot order=dc,menu=off -k de -incoming tcp:172.21.55.34:5002 -pidfile /var/run/qemu/vm-250.pid -mem-path /hugepages -mem-prealloc -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus -cpu host,+x2apic,model_id='Intel(R) Xeon(R) CPU Migration with -cpu host is very problemtic, because the source and destination can have different cpu resulting in different cpu features. Does regular migration works with this setup? Can you try with a different cpu type? What are the source and destination /proc/cpuinfo output ? Cheers, Orit L5640 @ 2.27GHz',-tsc Thanks, Peter Regards, Orit Thanks, Peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Block Migration and xbzrle
On 10/02/2012 11:30 AM, Peter Lieven wrote: Am 02.10.2012 um 11:28 schrieb Orit Wasserman: On 10/02/2012 10:33 AM, lieven-li...@dlh.net wrote: Orit Wasserman wrote: On 09/16/2012 01:39 PM, Peter Lieven wrote: Hi, I remember that this was broken some time ago and currently with qemu-kvm 1.2.0 I am still not able to use block migration plus xbzrle. The migration fails if both are used together. XBZRLE without block migration works. Can someone please advise what is the current expected behaviour? XBZRLE only work on guest memory so it shouldn't be effected by block migration. What is the error you are getting? What command line ? Meanwhile I can confirm that it happens with and without block migration. I I observe 2 errors: a) qemu: warning: error while loading state section id 2 load of migration failed Did you enabled XBZRLE on the destination also? (migrate_set_capability xbzrle on) Orit b) the vm does not enter running state after migration. The command-line: /usr/bin/qemu-kvm-1.2.0 -net tap,vlan=798,script=no,downscript=no,ifname=tap1 -net nic,vlan=798,model=e1000,macaddr=52:54:00:ff:01:15 -drive format=host_device,file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-d85f4e007-3f30017ce11505df-ubuntu-tools-hd0,if=virtio,cache=none,aio=native -m 4096 -smp 2,sockets=1,cores=2,threads=1 -monitor tcp:0:4002,server,nowait -vnc :2 -qmp tcp:0:3002,server,nowait -name 'Ubuntu-Tools' -boot order=dc,menu=off -k de -incoming tcp:172.21.55.34:5002 -pidfile /var/run/qemu/vm-250.pid -mem-path /hugepages -mem-prealloc -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus -cpu host,+x2apic,model_id='Intel(R) Xeon(R) CPU Migration with -cpu host is very problemtic, because the source and destination can have different cpu resulting in different cpu features. Does regular migration works with this setup? Can you try with a different cpu type? What are the source and destination /proc/cpuinfo output ? The CPUs are identical, we also check if flags and cpu types match if cpu type is set to host. Regular migration does work. BR, Peter Cheers, Orit L5640 @ 2.27GHz',-tsc Thanks, Peter Regards, Orit Thanks, Peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Block Migration and xbzrle
On 09/16/2012 01:39 PM, Peter Lieven wrote: Hi, I remember that this was broken some time ago and currently with qemu-kvm 1.2.0 I am still not able to use block migration plus xbzrle. The migration fails if both are used together. XBZRLE without block migration works. Can someone please advise what is the current expected behaviour? XBZRLE only work on guest memory so it shouldn't be effected by block migration. What is the error you are getting? What command line ? Regards, Orit Thanks, Peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v2 00/41] postcopy live migration
On 06/04/2012 03:37 PM, Anthony Liguori wrote: On 06/04/2012 05:57 PM, Isaku Yamahata wrote: After the long time, we have v2. This is qemu part. The linux kernel part is sent separatedly. Changes v1 - v2: - split up patches for review - buffered file refactored - many bug fixes Espcially PV drivers can work with postcopy - optimization/heuristic Patches 1 - 30: refactoring exsiting code and preparation 31 - 37: implement postcopy itself (essential part) 38 - 41: some optimization/heuristic for postcopy Intro = This patch series implements postcopy live migration.[1] As discussed at KVM forum 2011, dedicated character device is used for distributed shared memory between migration source and destination. Now we can discuss/benchmark/compare with precopy. I believe there are much rooms for improvement. [1] http://wiki.qemu.org/Features/PostCopyLiveMigration Usage = You need load umem character device on the host before starting migration. Postcopy can be used for tcg and kvm accelarator. The implementation depend on only linux umem character device. But the driver dependent code is split into a file. I tested only host page size == guest page size case, but the implementation allows host page size != guest page size case. The following options are added with this patch series. - incoming part command line options -postcopy [-postcopy-flagsflags] where flags is for changing behavior for benchmark/debugging Currently the following flags are available 0: default 1: enable touching page request example: qemu -postcopy -incoming tcp:0: -monitor stdio -machine accel=kvm - outging part options for migrate command migrate [-p [-n] [-m]] URI [prefault forward [prefault backword]] -p: indicate postcopy migration -n: disable background transferring pages: This is for benchmark/debugging -m: move background transfer of postcopy mode prefault forward: The number of forward pages which is sent with on-demand prefault backward: The number of backward pages which is sent with on-demand example: migrate -p -n tcp:dest ip address: migrate -p -n -m tcp:dest ip address: 32 0 TODO - benchmark/evaluation. Especially how async page fault affects the result. I don't mean to beat on a dead horse, but I really don't understand the point of postcopy migration other than the fact that it's possible. It's a lot of code and a new ABI in an area where we already have too much difficulty maintaining our ABI. Without a compelling real world case with supporting benchmarks for why we need postcopy and cannot improve precopy, I'm against merging this. Hi Anthony, The example is quite simple lets look at a 300G guest that is dirtying 10 percent of it memory every second. (for example SAP ...) Even if we have a 30G/s network we will need 1 second of downtime of this guest , many workload time out in this kind of downtime. The guests are getting bigger and bigger so for those big guest the only way to do live migration is using post copy. I agree we are losing reliability with post copy but we can try to limit the risk : - do a full copy of the guest ram (precopy) and than switch to post copy only for the updates - the user will use a private LAN ,maybe with redundancy which is much safer - maybe backup the memory to storage so in case of network failure we can recover In the end it is up to the user , he can decide what he is willing to risk. The default of course should always be precopy live migration, maybe we should even have a different command for post copy. In the end I can see some users that will have no choice but use post copy live migration or stop the guest in order to move them to another host. Regards, Orit Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix migration of real mode guests from host with unrestricted guest to a host without it.
For example migration between Westmere and Nehelem hosts. The patch fixes the guest segments similar to enter_rmode function. Signed-off-by: Orit Wasserman owass...@rehdat.com --- arch/x86/kvm/vmx.c | 38 ++ 1 files changed, 38 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 32eb588..2eca18f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3229,6 +3229,44 @@ static void vmx_set_segment(struct kvm_vcpu *vcpu, vmcs_write32(sf-ar_bytes, ar); __clear_bit(VCPU_EXREG_CPL, (ulong *)vcpu-arch.regs_avail); + + /* +* Fix segments for real mode guest in hosts that don't have +* unrestricted_mode or it was disabled. +* This is done to allow migration of the guests from hosts with +* unrestricted guest like Westmere to older host that don't have +* unrestricted guest like Nehelem. +*/ + if (!enable_unrestricted_guest vmx-rmode.vm86_active) { + switch (seg) { + case VCPU_SREG_CS: + vmcs_write32(GUEST_CS_AR_BYTES, 0xf3); + vmcs_write32(GUEST_CS_LIMIT, 0x); + if (vmcs_readl(GUEST_CS_BASE) == 0x) + vmcs_writel(GUEST_CS_BASE, 0xf); + vmcs_write16(GUEST_CS_SELECTOR, +vmcs_readl(GUEST_CS_BASE) 4); + break; + case VCPU_SREG_ES: + fix_rmode_seg(VCPU_SREG_ES, vmx-rmode.es); + break; + case VCPU_SREG_DS: + fix_rmode_seg(VCPU_SREG_DS, vmx-rmode.ds); + break; + case VCPU_SREG_GS: + fix_rmode_seg(VCPU_SREG_GS, vmx-rmode.gs); + break; + case VCPU_SREG_FS: + fix_rmode_seg(VCPU_SREG_FS, vmx-rmode.fs); + break; + case VCPU_SREG_SS: + vmcs_write16(GUEST_SS_SELECTOR, +vmcs_readl(GUEST_SS_BASE) 4); + vmcs_write32(GUEST_SS_LIMIT, 0x); + vmcs_write32(GUEST_SS_AR_BYTES, 0xf3); + break; + } + } } static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Fix migration of real mode guests from host with unrestricted guest to a host without it.
For example migration between Westmere and Nehelem hosts. The code that fixes the segments for real mode guest was moved from enter_rmode to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment. Signed-off-by: Orit Wasserman owass...@rehdat.com --- arch/x86/kvm/vmx.c | 70 +++- 1 files changed, 58 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 32eb588..548c9b5 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -615,6 +615,10 @@ static void kvm_cpu_vmxon(u64 addr); static void kvm_cpu_vmxoff(void); static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3); static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); +static void vmx_set_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg); +static void vmx_get_segment(struct kvm_vcpu *vcpu, + struct kvm_segment *var, int seg); static DEFINE_PER_CPU(struct vmcs *, vmxarea); static DEFINE_PER_CPU(struct vmcs *, current_vmcs); @@ -2770,6 +2774,7 @@ static void enter_rmode(struct kvm_vcpu *vcpu) { unsigned long flags; struct vcpu_vmx *vmx = to_vmx(vcpu); + struct kvm_segment var; if (enable_unrestricted_guest) return; @@ -2813,20 +2818,23 @@ static void enter_rmode(struct kvm_vcpu *vcpu) if (emulate_invalid_guest_state) goto continue_rmode; - vmcs_write16(GUEST_SS_SELECTOR, vmcs_readl(GUEST_SS_BASE) 4); - vmcs_write32(GUEST_SS_LIMIT, 0x); - vmcs_write32(GUEST_SS_AR_BYTES, 0xf3); + vmx_get_segment(vcpu, var, VCPU_SREG_SS); + vmx_set_segment(vcpu, var, VCPU_SREG_SS); + + vmx_get_segment(vcpu, var, VCPU_SREG_CS); + vmx_set_segment(vcpu, var, VCPU_SREG_CS); + + vmx_get_segment(vcpu, var, VCPU_SREG_ES); + vmx_set_segment(vcpu, var, VCPU_SREG_ES); - vmcs_write32(GUEST_CS_AR_BYTES, 0xf3); - vmcs_write32(GUEST_CS_LIMIT, 0x); - if (vmcs_readl(GUEST_CS_BASE) == 0x) - vmcs_writel(GUEST_CS_BASE, 0xf); - vmcs_write16(GUEST_CS_SELECTOR, vmcs_readl(GUEST_CS_BASE) 4); + vmx_get_segment(vcpu, var, VCPU_SREG_DS); + vmx_set_segment(vcpu, var, VCPU_SREG_DS); - fix_rmode_seg(VCPU_SREG_ES, vmx-rmode.es); - fix_rmode_seg(VCPU_SREG_DS, vmx-rmode.ds); - fix_rmode_seg(VCPU_SREG_GS, vmx-rmode.gs); - fix_rmode_seg(VCPU_SREG_FS, vmx-rmode.fs); + vmx_get_segment(vcpu, var, VCPU_SREG_GS); + vmx_set_segment(vcpu, var, VCPU_SREG_GS); + + vmx_get_segment(vcpu, var, VCPU_SREG_FS); + vmx_set_segment(vcpu, var, VCPU_SREG_FS); continue_rmode: kvm_mmu_reset_context(vcpu); @@ -3229,6 +3237,44 @@ static void vmx_set_segment(struct kvm_vcpu *vcpu, vmcs_write32(sf-ar_bytes, ar); __clear_bit(VCPU_EXREG_CPL, (ulong *)vcpu-arch.regs_avail); + + /* +* Fix segments for real mode guest in hosts that don't have +* unrestricted_mode or it was disabled. +* This is done to allow migration of the guests from hosts with +* unrestricted guest like Westmere to older host that don't have +* unrestricted guest like Nehelem. +*/ + if (!enable_unrestricted_guest vmx-rmode.vm86_active) { + switch (seg) { + case VCPU_SREG_CS: + vmcs_write32(GUEST_CS_AR_BYTES, 0xf3); + vmcs_write32(GUEST_CS_LIMIT, 0x); + if (vmcs_readl(GUEST_CS_BASE) == 0x) + vmcs_writel(GUEST_CS_BASE, 0xf); + vmcs_write16(GUEST_CS_SELECTOR, +vmcs_readl(GUEST_CS_BASE) 4); + break; + case VCPU_SREG_ES: + fix_rmode_seg(VCPU_SREG_ES, vmx-rmode.es); + break; + case VCPU_SREG_DS: + fix_rmode_seg(VCPU_SREG_DS, vmx-rmode.ds); + break; + case VCPU_SREG_GS: + fix_rmode_seg(VCPU_SREG_GS, vmx-rmode.gs); + break; + case VCPU_SREG_FS: + fix_rmode_seg(VCPU_SREG_FS, vmx-rmode.fs); + break; + case VCPU_SREG_SS: + vmcs_write16(GUEST_SS_SELECTOR, +vmcs_readl(GUEST_SS_BASE) 4); + vmcs_write32(GUEST_SS_LIMIT, 0x); + vmcs_write32(GUEST_SS_AR_BYTES, 0xf3); + break; + } + } } static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org
Re: Nested virtualization on Intel does not work - second level freezes when third level is starting
On 04/11/2012 03:44 PM, Guido Winkelmann wrote: Hi, Nested virtualization on Intel does not work for me with qemu-kvm. As soon as the third layer OS (second virtualised) is starting the Linux kernel, the entire second layer freezes up. The last thing I can see console of the third layer system before it freezes is Decompressing Linux... . (no done, though). When starting without nofb option, the kernel still manages to set the screen resolution before freezing. Grub/Syslinux still work, but are extremely slow. Both the first layer OS (i.e. the one running on bare metal) and the second layer OS are 64-bit-Fedora 16 with Kernel 3.3.1-3.fc16.x86_64. On both the first and second layer OS, the kvm_intel modules are loaded with nested=Y parameter. (I've also tried with nested=N in the second layer. Didn't change anything.) Qemu-kvm was originally the Fedora-shipped 0.14, but I have since upgraded to 1.0. (Using rpmbuild with the specfile and patches from http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=blob;f=qemu.spec;hb=HEAD) The second layer machine has this CPU specification in libvirt on the first layer OS: cpu mode='custom' match='exact' model fallback='allow'Nehalem/model feature policy='require' name='vmx'/ /cpu which results in this qemu commandline (from libvirt's logs): LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu- kvm -S -M pc-0.15 -cpu kvm64,+lahf_lm,+popcnt,+sse4.2,+sse4.1,+ssse3,+vmx - enable-kvm -m 8192 -smp 8,sockets=8,cores=1,threads=1 -name vshost1 -uuid 192b8c4b-0ded-07aa-2545-d7fef4cd897f -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vshost1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown - no-acpi -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/data/vshost1.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio- disk0,bootindex=1 -drive file=/data/Fedora-16-x86_64- netinst.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw - device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=21,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net- pci,netdev=hostnet0,id=net0,mac=52:54:00:84:7d:46,bus=pci.0,addr=0x3 -netdev tap,fd=23,id=hostnet1,vhost=on,vhostfd=24 -device virtio-net- pci,netdev=hostnet1,id=net1,mac=52:54:00:84:8d:46,bus=pci.0,addr=0x4 -vnc 127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon- pci,id=balloon0,bus=pci.0,addr=0x6 I have also tried some other combinations for the cpu element, like changing the model to core2duo and/or including all the features reported by libvirt's capabalities command. The third level machine does not have a cpu element in libvirt, and its commandline looks like this: LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu- kvm -S -M pc-0.14 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 -name gentoo -uuid 3cdcc902-4520-df25-92ac-31ca5c707a50 -nodefconfig -nodefaults - chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/gentoo.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-acpi -drive file=/data/gentoo.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 - drive file=/data/install-amd64- minimal-20120223.iso,if=none,media=cdrom,id=drive- ide0-1-0,readonly=on,format=raw -device ide- drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net- pci,netdev=hostnet0,id=net0,mac=52:54:00:84:6d:46,bus=pci.0,addr=0x3 -usb -vnc 127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon- pci,id=balloon0,bus=pci.0,addr=0x5 The third layer OS is a recent Gentoo minimal install (amd64), but somehow I don't think that matters at this point... The metal is a Dell PowerEdge R710 server with two Xeon E5520 CPUs. I've tried updating the machine's BIOS and other firmware to the latest version. That took a lot of time and a lot of searching on Dell websites, but didn't change anything. Does anyone have any idea what might be going wrong here or how I could debug this further? I'm not sure if this is the problem but I noticed that the second layer and the third layer have the same memory size (8G), how about trying to reduce the memory for the third layer ? Orit Guido -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Nested virtualization on Intel does not work - second level freezes when third level is starting
On 04/11/2012 04:43 PM, Guido Winkelmann wrote: Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie: I'm not sure if this is the problem but I noticed that the second layer and the third layer have the same memory size (8G), how about trying to reduce the memory for the third layer ? I tried reducing the third layer to 1G. That didn't change anything. There is a patch for fixing nVMX in 3.4. http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html you can try it . Orit Guido -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nested virtualization on Intel does not work - second level freezes when third level is starting
On 04/11/2012 09:37 PM, Guido Winkelmann wrote: Am Mittwoch, 11. April 2012, 17:25:12 schrieb Orit Wasserman: On 04/11/2012 04:43 PM, Guido Winkelmann wrote: Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie: I'm not sure if this is the problem but I noticed that the second layer and the third layer have the same memory size (8G), how about trying to reduce the memory for the third layer ? I tried reducing the third layer to 1G. That didn't change anything. There is a patch for fixing nVMX in 3.4. http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html you can try it . That worked, though the VM inside the VM is still very slow... you can try Nadav's nEPT (Nested EPT) patches they should help with the performance. Orit Guido -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nested virtualization on Intel does not work - second level freezes when third level is starting
On 04/11/2012 08:00 PM, Guido Winkelmann wrote: Am Mittwoch, 11. April 2012, 17:25:12 schrieben Sie: On 04/11/2012 04:43 PM, Guido Winkelmann wrote: Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie: I'm not sure if this is the problem but I noticed that the second layer and the third layer have the same memory size (8G), how about trying to reduce the memory for the third layer ? I tried reducing the third layer to 1G. That didn't change anything. There is a patch for fixing nVMX in 3.4. http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html you can try it . Should I use that patch on the host, or on the first virtualized layer, or on both? (Compiling 3.4-rc2 for both for now... ) On the host (we refer to it as L0 by the way). Orit Guido -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v5 1/4] sockets: introduce set_socket_error()
On 03/22/2012 05:52 AM, Amos Kong wrote: Introduce set_socket_error() to set the errno, use WSASetLastError() for win32. Sometimes, clean work would rewrite errno in error path, we can use this function to restore real errno. Signed-off-by: Amos Kong ak...@redhat.com --- qemu_socket.h |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/qemu_socket.h b/qemu_socket.h index fe4cf6c..a4c5170 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -8,6 +8,7 @@ #include ws2tcpip.h #define socket_error() WSAGetLastError() +#define set_socket_error(e) WSASetLastError(e) #undef EINTR #define EWOULDBLOCK WSAEWOULDBLOCK #define EINTR WSAEINTR @@ -26,6 +27,7 @@ int inet_aton(const char *cp, struct in_addr *ia); #include sys/un.h #define socket_error() errno +#define set_socket_error(e) errno = e how about creating a function set_errno(int e) and use it in the macro. Orit #define closesocket(s) close(s) #endif /* !_WIN32 */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v5 2/4] qemu-socket: change inet_connect() to to support nonblock socket
On 03/22/2012 05:52 AM, Amos Kong wrote: Change inet_connect(const char *str, int socktype) to inet_connect(const char *str, bool block), socktype is unused, block is used to assign if set socket to block/nonblock. Retry to connect when -EINTR/-EWOULDBLOCK is got. Connect's successful for nonblock socket when following errors are got: -EINPROGRESS -WSAEALREADY/-WSAEINVAL (win32) Add a bool entry(block) for dummy_opts to tag block type. Use set_socket_error() to set real errno, set errno to EINVAL for parse error. Change nbd, vnc to use new interface. Signed-off-by: Amos Kong ak...@redhat.com --- nbd.c |2 +- qemu-sockets.c | 66 +++- qemu_socket.h |2 +- ui/vnc.c |2 +- 4 files changed, 54 insertions(+), 18 deletions(-) diff --git a/nbd.c b/nbd.c index 567e94e..3618344 100644 --- a/nbd.c +++ b/nbd.c @@ -146,7 +146,7 @@ int tcp_socket_outgoing(const char *address, uint16_t port) int tcp_socket_outgoing_spec(const char *address_and_port) { -return inet_connect(address_and_port, SOCK_STREAM); +return inet_connect(address_and_port, true); } int tcp_socket_incoming(const char *address, uint16_t port) diff --git a/qemu-sockets.c b/qemu-sockets.c index 6bcb8e3..908479e 100644 --- a/qemu-sockets.c +++ b/qemu-sockets.c @@ -51,6 +51,9 @@ static QemuOptsList dummy_opts = { },{ .name = ipv6, .type = QEMU_OPT_BOOL, +},{ +.name = block, +.type = QEMU_OPT_BOOL, }, { /* end if list */ } }, @@ -201,7 +204,8 @@ int inet_connect_opts(QemuOpts *opts) const char *port; char uaddr[INET6_ADDRSTRLEN+1]; char uport[33]; -int sock,rc; +int sock, rc, err; +bool block; memset(ai,0, sizeof(ai)); ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG; @@ -210,9 +214,11 @@ int inet_connect_opts(QemuOpts *opts) addr = qemu_opt_get(opts, host); port = qemu_opt_get(opts, port); +block = qemu_opt_get_bool(opts, block, 0); if (addr == NULL || port == NULL) { fprintf(stderr, inet_connect: host and/or port not specified\n); -return -1; +err = -EINVAL; +goto err; I would call set_socket_error with -EINVAL and than return -1; } if (qemu_opt_get_bool(opts, ipv4, 0)) @@ -224,7 +230,8 @@ int inet_connect_opts(QemuOpts *opts) if (0 != (rc = getaddrinfo(addr, port, ai, res))) { fprintf(stderr,getaddrinfo(%s,%s): %s\n, addr, port, gai_strerror(rc)); - return -1; +err = -EINVAL; +goto err; I would call set_socket_error with EINVAL and than return -1; } for (e = res; e != NULL; e = e-ai_next) { @@ -241,21 +248,44 @@ int inet_connect_opts(QemuOpts *opts) continue; } setsockopt(sock,SOL_SOCKET,SO_REUSEADDR,(void*)on,sizeof(on)); - +if (!block) { +socket_set_nonblock(sock); +} /* connect to peer */ -if (connect(sock,e-ai_addr,e-ai_addrlen) 0) { -if (NULL == e-ai_next) -fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, __FUNCTION__, -inet_strfamily(e-ai_family), -e-ai_canonname, uaddr, uport, strerror(errno)); -closesocket(sock); -continue; +do { +err = 0; +if (connect(sock, e-ai_addr, e-ai_addrlen) 0) { +err = -socket_error(); +} +} while (err == -EINTR || err == -EWOULDBLOCK); + this is only relevant for non blocking socket it should look something like this : while ( (rc = connect(sock, e-ai_addr, e-ai_addrlen)) 0 !block (socket_error() == EINTR || socket_error() == -EWOULDBLOCK)) +if (err = 0) { +goto success; +} else if (!block err == -EINPROGRESS) { +goto success; +#ifdef _WIN32 +} else if (!block (err == -WSAEALREADY || err == -WSAEINVAL)) { +goto success; +#endif I prefer to check for error the code should look: if (rc 0 (block || #ifdef _WIN32 (socket_error() != WSAEALREADY socket_error() != -WSAEINVAL)) { #else (socket_error() != EINPROGRESS)) { #endif if (NULL == e-ai_next) fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, __func__, inet_strfamily(e-ai_family), e-ai_canonname, uaddr, uport, strerror(errno)); closesocket(sock); continue; } ... Orit } -freeaddrinfo(res); -return sock; + +if (NULL == e-ai_next) { +fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, __func__, +inet_strfamily(e-ai_family), +e-ai_canonname, uaddr, uport,
Re: [Qemu-devel] [PATCH v4 1/2] qemu-socket: change inet_connect() to to support nonblock socket
On 03/19/2012 04:11 PM, Amos Kong wrote: Change inet_connect(const char *str, int socktype) to inet_connect(const char *str, bool block, int *sock_err), socktype is unused, block is used to assign if set socket to block/nonblock, sock_err is used to restore socket error. It is more common to call the parameter blocking/nonblocking. You removed socktype (I noticed it was not implemented) , can you keep the same API but fix the implementation ? I don't like the use of sock_err. How about adding a function set_socket_error that will set the errno and for win32 you can use SetLastError, you will be able to read the error using socket_error function. Cheers, Orit Connect's successful for nonblock socket when following errors are returned: -EINPROGRESS or -EWOULDBLOCK -WSAEALREADY or -WSAEINVAL (win32) Also change the wrap function inet_connect_opts(QemuOpts *opts) to inet_connect_opts(QemuOpts *opts, int *sock_err). Add a bool entry(block) for dummy_opts to tag block type. Change nbd, vnc to use new interface. Signed-off-by: Amos Kong ak...@redhat.com --- nbd.c |2 +- qemu-char.c|2 +- qemu-sockets.c | 73 qemu_socket.h |4 ++- ui/vnc.c |2 +- 5 files changed, 62 insertions(+), 21 deletions(-) diff --git a/nbd.c b/nbd.c index 567e94e..ad4de06 100644 --- a/nbd.c +++ b/nbd.c @@ -146,7 +146,7 @@ int tcp_socket_outgoing(const char *address, uint16_t port) int tcp_socket_outgoing_spec(const char *address_and_port) { -return inet_connect(address_and_port, SOCK_STREAM); +return inet_connect(address_and_port, true, NULL); } int tcp_socket_incoming(const char *address, uint16_t port) diff --git a/qemu-char.c b/qemu-char.c index bb9e3f5..d3543ea 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -2443,7 +2443,7 @@ static CharDriverState *qemu_chr_open_socket(QemuOpts *opts) if (is_listen) { fd = inet_listen_opts(opts, 0); } else { -fd = inet_connect_opts(opts); +fd = inet_connect_opts(opts, NULL); } } if (fd 0) { diff --git a/qemu-sockets.c b/qemu-sockets.c index 6bcb8e3..8ed45f8 100644 --- a/qemu-sockets.c +++ b/qemu-sockets.c @@ -51,6 +51,9 @@ static QemuOptsList dummy_opts = { },{ .name = ipv6, .type = QEMU_OPT_BOOL, +},{ +.name = block, +.type = QEMU_OPT_BOOL, }, { /* end if list */ } }, @@ -194,14 +197,15 @@ listen: return slisten; } -int inet_connect_opts(QemuOpts *opts) +int inet_connect_opts(QemuOpts *opts, int *sock_err) { struct addrinfo ai,*res,*e; const char *addr; const char *port; char uaddr[INET6_ADDRSTRLEN+1]; char uport[33]; -int sock,rc; +int sock, rc, err; +bool block; memset(ai,0, sizeof(ai)); ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG; @@ -210,9 +214,11 @@ int inet_connect_opts(QemuOpts *opts) addr = qemu_opt_get(opts, host); port = qemu_opt_get(opts, port); +block = qemu_opt_get_bool(opts, block, 0); if (addr == NULL || port == NULL) { fprintf(stderr, inet_connect: host and/or port not specified\n); -return -1; +err = -EINVAL; +goto err; } if (qemu_opt_get_bool(opts, ipv4, 0)) @@ -224,7 +230,8 @@ int inet_connect_opts(QemuOpts *opts) if (0 != (rc = getaddrinfo(addr, port, ai, res))) { fprintf(stderr,getaddrinfo(%s,%s): %s\n, addr, port, gai_strerror(rc)); - return -1; +err = -EINVAL; +goto err; } for (e = res; e != NULL; e = e-ai_next) { @@ -241,21 +248,52 @@ int inet_connect_opts(QemuOpts *opts) continue; } setsockopt(sock,SOL_SOCKET,SO_REUSEADDR,(void*)on,sizeof(on)); - +if (!block) { +socket_set_nonblock(sock); +} /* connect to peer */ -if (connect(sock,e-ai_addr,e-ai_addrlen) 0) { -if (NULL == e-ai_next) -fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, __FUNCTION__, -inet_strfamily(e-ai_family), -e-ai_canonname, uaddr, uport, strerror(errno)); -closesocket(sock); -continue; +do { +err = 0; +if (connect(sock, e-ai_addr, e-ai_addrlen) 0) { +err = -socket_error(); +if (block) { +break; +} +} +} while (err == -EINTR); + +if (err = 0) { +goto success; +} else if (!block (err == -EINPROGRESS || err == -EWOULDBLOCK)) { +goto success; +#ifdef _WIN32 +} else if (!block (sock_err == -WSAEALREADY || + sock_err == -WSAEINVAL)) {
Re: [Qemu-devel] [PATCH v3 1/9] net: introduce tcp_server_start()
On 03/07/2012 12:47 AM, Amos Kong wrote: Introduce tcp_server_start() by moving original code in tcp_start_incoming_migration(). Signed-off-by: Amos Kong ak...@redhat.com --- net.c | 28 qemu_socket.h |2 ++ 2 files changed, 30 insertions(+), 0 deletions(-) diff --git a/net.c b/net.c index c34474f..e90ff23 100644 --- a/net.c +++ b/net.c @@ -99,6 +99,34 @@ static int get_str_sep(char *buf, int buf_size, const char **pp, int sep) return 0; } +int tcp_server_start(const char *str, int *fd) +{ +int val, ret; +struct sockaddr_in saddr; + +if (parse_host_port(saddr, str) 0) { +error_report(invalid host/port combination: %s, str); +return -EINVAL; +} + +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0); +if (fd 0) { +perror(socket); +return -1; return -socket_error(); +} +socket_set_nonblock(*fd); In incoming migration we don't use non-blocking socket. How about a flag to the function for non-blocking ? + +/* allow fast reuse */ +val = 1; +setsockopt(*fd, SOL_SOCKET, SO_REUSEADDR, (const char *)val, sizeof(val)); + +ret = bind(*fd, (struct sockaddr *)saddr, sizeof(saddr)); +if (ret 0) { +closesocket(*fd); +} +return ret; if (bind(*fd, (struct sockaddr *)saddr, sizeof(saddr)) 0) { closesocket(*fd); return -socket_error(); } return 0; and than you will not need ret +} + int parse_host_port(struct sockaddr_in *saddr, const char *str) { char buf[512]; diff --git a/qemu_socket.h b/qemu_socket.h index fe4cf6c..d612793 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -54,6 +54,8 @@ int unix_listen(const char *path, char *ostr, int olen); int unix_connect_opts(QemuOpts *opts); int unix_connect(const char *path); +int tcp_server_start(const char *str, int *fd); + /* Old, ipv4 only bits. Don't use for new code. */ int parse_host_port(struct sockaddr_in *saddr, const char *str); int socket_init(void); Orit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v3 3/9] net: introduce tcp_client_start()
On 03/07/2012 12:48 AM, Amos Kong wrote: Introduce tcp_client_start() by moving original code in tcp_start_outgoing_migration(). Signed-off-by: Amos Kong ak...@redhat.com --- net.c | 41 + qemu_socket.h |1 + 2 files changed, 42 insertions(+), 0 deletions(-) diff --git a/net.c b/net.c index e90ff23..9afb0d1 100644 --- a/net.c +++ b/net.c @@ -127,6 +127,47 @@ int tcp_server_start(const char *str, int *fd) return ret; } +int tcp_client_start(const char *str, int *fd) +{ +struct sockaddr_in saddr; +int ret; + +*fd = -1; +if (parse_host_port(saddr, str) 0) { +error_report(invalid host/port combination: %s, str); +return -EINVAL; +} + +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0); +if (fd 0) { +perror(socket); +return -1; return -socket_error(); +} +socket_set_nonblock(*fd); + +for (;;) { +ret = connect(*fd, (struct sockaddr *)saddr, sizeof(saddr)); +if (ret 0) { +ret = -socket_error(); +if (ret == -EINPROGRESS) { +break; +#ifdef _WIN32 +} else if (ret == -WSAEALREADY || ret == -WSAEINVAL) { +break; +#endif +} else if (ret != -EINTR ret != -EWOULDBLOCK) { +perror(connect); +closesocket(*fd); +return ret; +} +} else { +break; +} +} + +return ret; +} + int parse_host_port(struct sockaddr_in *saddr, const char *str) { char buf[512]; diff --git a/qemu_socket.h b/qemu_socket.h index d612793..9246578 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -55,6 +55,7 @@ int unix_connect_opts(QemuOpts *opts); int unix_connect(const char *path); int tcp_server_start(const char *str, int *fd); +int tcp_client_start(const char *str, int *fd); /* Old, ipv4 only bits. Don't use for new code. */ int parse_host_port(struct sockaddr_in *saddr, const char *str); Orit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v3 1/9] net: introduce tcp_server_start()
On 03/14/2012 09:51 AM, Amos Kong wrote: On 14/03/12 15:27, Paolo Bonzini wrote: Hi Paolo, Il 14/03/2012 08:14, Orit Wasserman ha scritto: if (bind(*fd, (struct sockaddr *)saddr, sizeof(saddr)) 0) { closesocket(*fd); return -socket_error(); } return 0; and than you will not need ret But closesocket could clobber socket_error(), no? Completely correct. Yes, it will effect socket_error() How about this fix ? ret = bind(*fd, (struct sockaddr *)saddr, sizeof(saddr)); if (ret 0) { ret = -socket_error(); closesocket(*fd); } return ret; } Looks good. Orit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v2 1/9] net: introduce tcp_server_start()
On 03/05/2012 12:03 PM, Amos Kong wrote: Introduce tcp_server_start() by moving original code in tcp_start_incoming_migration(). Signed-off-by: Amos Kong ak...@redhat.com --- net.c | 27 +++ qemu_socket.h |2 ++ 2 files changed, 29 insertions(+), 0 deletions(-) diff --git a/net.c b/net.c index c34474f..0260968 100644 --- a/net.c +++ b/net.c @@ -99,6 +99,33 @@ static int get_str_sep(char *buf, int buf_size, const char **pp, int sep) return 0; } +int tcp_server_start(const char *str, int *fd) +{ +int val, ret; +struct sockaddr_in saddr; + +if (parse_host_port(saddr, str) 0) { error message would be nice +return -1; +} + +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0); +if (fd 0) { +perror(socket); +return -1; +} this is actually net_socket_listen_init version tcp_start_incoming_migration returns the error -socket_error(). I prefer not to lose the errno. I know that when calling net_socket_listen_init for some unknown reason there is an explict check for -1 if (net_socket_listen_init(vlan, socket, name, listen) == -1) { I think it is a good opportunity to change this check. Orit +socket_set_nonblock(*fd); + +/* allow fast reuse */ +val = 1; +setsockopt(*fd, SOL_SOCKET, SO_REUSEADDR, (const char *)val, sizeof(val)); + +ret = bind(*fd, (struct sockaddr *)saddr, sizeof(saddr)); +if (ret 0) { +closesocket(*fd); +} +return ret; +} + int parse_host_port(struct sockaddr_in *saddr, const char *str) { char buf[512]; diff --git a/qemu_socket.h b/qemu_socket.h index fe4cf6c..d612793 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -54,6 +54,8 @@ int unix_listen(const char *path, char *ostr, int olen); int unix_connect_opts(QemuOpts *opts); int unix_connect(const char *path); +int tcp_server_start(const char *str, int *fd); + /* Old, ipv4 only bits. Don't use for new code. */ int parse_host_port(struct sockaddr_in *saddr, const char *str); int socket_init(void); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v2 3/9] net: introduce tcp_client_start()
On 03/05/2012 12:03 PM, Amos Kong wrote: Introduce tcp_client_start() by moving original code in tcp_start_outgoing_migration(). Signed-off-by: Amos Kong ak...@redhat.com --- net.c | 39 +++ qemu_socket.h |1 + 2 files changed, 40 insertions(+), 0 deletions(-) diff --git a/net.c b/net.c index 0260968..5c20e22 100644 --- a/net.c +++ b/net.c @@ -126,6 +126,45 @@ int tcp_server_start(const char *str, int *fd) return ret; } +int tcp_client_start(const char *str, int *fd) +{ +struct sockaddr_in saddr; +int ret; + +if (parse_host_port(saddr, str) 0) { +return -EINVAL; +} + +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0); +if (fd 0) { +perror(socket); +return -1; +} +socket_set_nonblock(*fd); + +for (;;) { +ret = connect(*fd, (struct sockaddr *)saddr, sizeof(saddr)); +if (ret 0) { +ret = -socket_error(); +if (ret == -EINPROGRESS) { +break; +#ifdef _WIN32 +} else if (ret == -WSAEALREADY || ret == -WSAEINVAL) { +break; +#endif +} else if (ret != -EINTR ret != -EWOULDBLOCK) { +perror(connect); +closesocket(*fd); +return -1; I think it should be: return ret (otherwise you lose the error code). And you need it. Orit +} +} else { +break; +} +} + +return ret; +} + int parse_host_port(struct sockaddr_in *saddr, const char *str) { char buf[512]; diff --git a/qemu_socket.h b/qemu_socket.h index d612793..9246578 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -55,6 +55,7 @@ int unix_connect_opts(QemuOpts *opts); int unix_connect(const char *path); int tcp_server_start(const char *str, int *fd); +int tcp_client_start(const char *str, int *fd); /* Old, ipv4 only bits. Don't use for new code. */ int parse_host_port(struct sockaddr_in *saddr, const char *str); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v2 2/9] net: use tcp_server_start() for tcp server creation
On 03/05/2012 12:03 PM, Amos Kong wrote: Use tcp_server_start in those two functions: tcp_start_incoming_migration() net_socket_listen_init() Signed-off-by: Amos Kong ak...@redhat.com --- migration-tcp.c | 21 + net/socket.c| 23 +++ 2 files changed, 8 insertions(+), 36 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index 35a5781..ecadd10 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -157,28 +157,17 @@ out2: int tcp_start_incoming_migration(const char *host_port) { -struct sockaddr_in addr; -int val; +int ret; int s; DPRINTF(Attempting to start an incoming migration\n); -if (parse_host_port(addr, host_port) 0) { -fprintf(stderr, invalid host/port combination: %s\n, host_port); -return -EINVAL; -} - -s = qemu_socket(PF_INET, SOCK_STREAM, 0); -if (s == -1) { -return -socket_error(); +ret = tcp_server_start(host_port, s); +if (ret 0) { +fprintf(stderr, tcp_server_start: %s\n, strerror(-ret)); +return ret; } -val = 1; -setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (const char *)val, sizeof(val)); - -if (bind(s, (struct sockaddr *)addr, sizeof(addr)) == -1) { -goto err; -} if (listen(s, 1) == -1) { goto err; } diff --git a/net/socket.c b/net/socket.c index 0bcf229..5feb3d2 100644 --- a/net/socket.c +++ b/net/socket.c @@ -403,31 +403,14 @@ static int net_socket_listen_init(VLANState *vlan, const char *host_str) { NetSocketListenState *s; -int fd, val, ret; -struct sockaddr_in saddr; - -if (parse_host_port(saddr, host_str) 0) -return -1; +int fd, ret; s = g_malloc0(sizeof(NetSocketListenState)); -fd = qemu_socket(PF_INET, SOCK_STREAM, 0); -if (fd 0) { -perror(socket); -g_free(s); -return -1; -} -socket_set_nonblock(fd); - -/* allow fast reuse */ -val = 1; -setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, (const char *)val, sizeof(val)); - -ret = bind(fd, (struct sockaddr *)saddr, sizeof(saddr)); +ret = tcp_server_start(host_str, fd); if (ret 0) { -perror(bind); +error_report(tcp_server_start: %s, strerror(-ret)); If the return value is always -1 this has no meaning Orit g_free(s); -closesocket(fd); return -1; } ret = listen(fd, 0); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v2 3/9] net: introduce tcp_client_start()
On 03/05/2012 12:03 PM, Amos Kong wrote: Introduce tcp_client_start() by moving original code in tcp_start_outgoing_migration(). Signed-off-by: Amos Kong ak...@redhat.com --- net.c | 39 +++ qemu_socket.h |1 + 2 files changed, 40 insertions(+), 0 deletions(-) diff --git a/net.c b/net.c index 0260968..5c20e22 100644 --- a/net.c +++ b/net.c @@ -126,6 +126,45 @@ int tcp_server_start(const char *str, int *fd) return ret; } +int tcp_client_start(const char *str, int *fd) +{ +struct sockaddr_in saddr; +int ret; + +if (parse_host_port(saddr, str) 0) { +return -EINVAL; You use this in order to know when to call migrate_fd_error this is problematic as another error can return this error code. I think that setting *fd = -1 in the beginning of the function would be enough. Orit +} + +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0); +if (fd 0) { +perror(socket); +return -1; +} +socket_set_nonblock(*fd); + +for (;;) { +ret = connect(*fd, (struct sockaddr *)saddr, sizeof(saddr)); +if (ret 0) { +ret = -socket_error(); +if (ret == -EINPROGRESS) { +break; +#ifdef _WIN32 +} else if (ret == -WSAEALREADY || ret == -WSAEINVAL) { +break; +#endif +} else if (ret != -EINTR ret != -EWOULDBLOCK) { +perror(connect); +closesocket(*fd); +return -1; should be return ret; +} +} else { +break; +} +} + +return ret; +} + int parse_host_port(struct sockaddr_in *saddr, const char *str) { char buf[512]; diff --git a/qemu_socket.h b/qemu_socket.h index d612793..9246578 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -55,6 +55,7 @@ int unix_connect_opts(QemuOpts *opts); int unix_connect(const char *path); int tcp_server_start(const char *str, int *fd); +int tcp_client_start(const char *str, int *fd); /* Old, ipv4 only bits. Don't use for new code. */ int parse_host_port(struct sockaddr_in *saddr, const char *str); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v2 4/9] net: use tcp_client_start for tcp client creation
On 03/05/2012 12:03 PM, Amos Kong wrote: Use tcp_client_start() in those two functions: tcp_start_outgoing_migration() net_socket_connect_init() Signed-off-by: Amos Kong ak...@redhat.com --- migration-tcp.c | 41 + net/socket.c| 41 +++-- 2 files changed, 24 insertions(+), 58 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index ecadd10..4f89bff 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -81,43 +81,28 @@ static void tcp_wait_for_connect(void *opaque) int tcp_start_outgoing_migration(MigrationState *s, const char *host_port) { -struct sockaddr_in addr; int ret; - -ret = parse_host_port(addr, host_port); -if (ret 0) { -return ret; -} +int fd; s-get_error = socket_errno; s-write = socket_write; s-close = tcp_close; -s-fd = qemu_socket(PF_INET, SOCK_STREAM, 0); -if (s-fd == -1) { -DPRINTF(Unable to open socket); -return -socket_error(); -} - -socket_set_nonblock(s-fd); - -do { -ret = connect(s-fd, (struct sockaddr *)addr, sizeof(addr)); -if (ret == -1) { -ret = -socket_error(); -} -if (ret == -EINPROGRESS || ret == -EWOULDBLOCK) { -qemu_set_fd_handler2(s-fd, NULL, NULL, tcp_wait_for_connect, s); -return 0; -} -} while (ret == -EINTR); - -if (ret 0) { +ret = tcp_client_start(host_port, fd); +s-fd = fd; you don't need fd you can pass s-fd to the function. Orit +if (ret == -EINPROGRESS || ret == -EWOULDBLOCK) { +DPRINTF(connect in progress); +qemu_set_fd_handler2(s-fd, NULL, NULL, tcp_wait_for_connect, s); +} else if (ret 0) { DPRINTF(connect failed\n); -migrate_fd_error(s); +if (ret != -EINVAL) { +migrate_fd_error(s); +} return ret; +} else { +migrate_fd_connect(s); } -migrate_fd_connect(s); + return 0; } diff --git a/net/socket.c b/net/socket.c index 5feb3d2..b7cd8ec 100644 --- a/net/socket.c +++ b/net/socket.c @@ -434,41 +434,22 @@ static int net_socket_connect_init(VLANState *vlan, const char *host_str) { NetSocketState *s; -int fd, connected, ret, err; +int fd, connected, ret; struct sockaddr_in saddr; -if (parse_host_port(saddr, host_str) 0) -return -1; - -fd = qemu_socket(PF_INET, SOCK_STREAM, 0); -if (fd 0) { -perror(socket); -return -1; -} -socket_set_nonblock(fd); - -connected = 0; -for(;;) { -ret = connect(fd, (struct sockaddr *)saddr, sizeof(saddr)); -if (ret 0) { -err = socket_error(); -if (err == EINTR || err == EWOULDBLOCK) { -} else if (err == EINPROGRESS) { -break; +ret = tcp_client_start(host_str, fd); +if (ret == -EINPROGRESS || ret == -EWOULDBLOCK) { +connected = 0; #ifdef _WIN32 -} else if (err == WSAEALREADY || err == WSAEINVAL) { -break; +} else if (ret == -WSAEALREADY || ret == -WSAEINVAL) { +connected = 0; #endif -} else { -perror(connect); -closesocket(fd); -return -1; -} -} else { -connected = 1; -break; -} +} else if (ret 0) { +return -1; +} else { +connected = 1; } + s = net_socket_fd_init(vlan, model, name, fd, connected); if (!s) return -1; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMXON region vs VMCS region?
On 01/31/2012 05:35 AM, Zhi Yong Wu wrote: HI, Can anyone let me know know the difference between VMXON region and VMCS region? relationship? There is no relationship between them: VMXON region is created per logical processor and used by it for VMX ops. VMCS region is created for each guest vcpu and used both by the hypervisor and the processor. It will be appreciated if you can make some comments. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration
On 12/30/2011 12:39 AM, Anthony Liguori wrote: On 12/28/2011 07:25 PM, Isaku Yamahata wrote: Intro = This patch series implements postcopy live migration.[1] As discussed at KVM forum 2011, dedicated character device is used for distributed shared memory between migration source and destination. Now we can discuss/benchmark/compare with precopy. I believe there are much rooms for improvement. [1] http://wiki.qemu.org/Features/PostCopyLiveMigration Usage = You need load umem character device on the host before starting migration. Postcopy can be used for tcg and kvm accelarator. The implementation depend on only linux umem character device. But the driver dependent code is split into a file. I tested only host page size == guest page size case, but the implementation allows host page size != guest page size case. The following options are added with this patch series. - incoming part command line options -postcopy [-postcopy-flagsflags] where flags is for changing behavior for benchmark/debugging Currently the following flags are available 0: default 1: enable touching page request example: qemu -postcopy -incoming tcp:0: -monitor stdio -machine accel=kvm - outging part options for migrate command migrate [-p [-n]] URI -p: indicate postcopy migration -n: disable background transferring pages: This is for benchmark/debugging example: migrate -p -n tcp:dest ip address: TODO - benchmark/evaluation. Especially how async page fault affects the result. I'll review this series next week (Mike/Juan, please also review when you can). But we really need to think hard about whether this is the right thing to take into the tree. I worry a lot about the fact that we don't test pre-copy migration nearly enough and adding a second form just introduces more things to test. It's also not clear to me why post-copy is better. If you were going to sit down and explain to someone building a management tool when they should use pre-copy and when they should use post-copy, what would you tell them? Start with pre-copy , if it doesn't converge switch to post-copy Orit Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On 11/12/2011 11:37 PM, Nadav Har'El wrote: On Sat, Nov 12, 2011, Avi Kivity wrote about Re: [PATCH 02/10] nEPT: MMU context for nested EPT: host may write-protect a page. Second, the shadow and guest ptes may be in different formats (ept vs ia32). I'm afraid I've lost you here... The shadow table and the to-be-shadowed table are both ia32 (this is the normal shadow table code), or both ept (the nested tdp code). When are they supposed to be in different formats (ept vs ia32)? I'm also puzzled in what situation will the host will write-protect an EPT02 (shadow EPT) page? In fact that happens to accidentally work, no? Intermediate ptes are always present/write/user, which translates to read/write/execute in EPT. It didn't work because it also used to set the accessed bit, bit 5, which on EPT is reserved and caused EPT misconfiguration. So I had to fix link_shadow_page, or nested EPT would not work at all. Don't optimize for least changes, optimize for best result afterwards. As I'm sure you remember, two years ago, in September 6 2009, you wrote in your blog about the newly contributed nested VMX patch set, and in particular its nested EPT (which predated the nested NPT contribution). Nested EPT was, for some workloads, a huge performance improvement, but you (if I understand correctly) did not want that code in KVM because it, basically, optimized for getting the job done, in the most correct and most efficient manner - but without regard of how cleanly this fit with other types of shadowing (normal shadow page tables, and nested NPT), or how much of the code was being duplicated or circumvented. So this time around, I couldn't really not optimize for least changes. This time, the nested EPT had to fit (like a square peg in a round hole ;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write the most correct and most efficient code (which Orit Wasserman already did, two years earlier). This time I needed to figure out the least obtrusive way of changing the existing code. The hardest thing about doing this was trying to understand all the complexities and subtleties of the existing MMU code in KVM, which already does 101 different cases in one overloaded piece of code, which is not commented or documented. And of course, add to that all the complexities (some might even say cruft) which the underlying x86 architecture itself has acrued over the years. So it's not surprising I've missed some of the important subtleties which didn't have any effect in the typical case I've tried. Like I said, in my tests nested EPT *did* work. And even getting to that point was hard enough :-) We need a third variant of walk_addr_generic that parses EPT format PTEs. Whether that's best done by writing paging_ept.h or modifying paging_tmpl.h, I don't know. Thanks. I'll think about everything you've said in this thread (I'm still not convinced I understood all your points, so just understanding them will be the first step). I'll see what I can do to improve the patch. But I have to be honest - I'm not sure how quickly I can finish this. I really appreciate all your comments about nested VMX in the last two years - most of them have been spot-on, 100% correct, and really helpful for making me understand things which I had previously misunderstood. However, since you are (of course) extremely familiar with every nook and cranny of KVM, what normally happens is that every comment which took you 5 minutes to figure out, takes me 5 days to fully understand, and to actually write, debug and test the fixed code. Every review that takes you two days to go through (and is very much appreciated!) takes me several months to fix each and every thing you asked for. Don't get me wrong, I *am* planning to continue working (part-time) on nested VMX, and nested EPT in particular. But if you want it to pick up the pace, I could use some help with actual coding from people who have much more intimate knowledge of the non-nested-VMX parts of KVM than I have. In the meantime, if anybody wants to experiment with a much faster Nested VMX than we had before, you can try my current patch. It may not be perfect, but in many ways it is better than the old shadow-on-ept code. And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it works well. Nadav. Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly. I'm sorry I don't have setup to run nested VMX at the moment so i can't test it. Orit From 032ba0221ea690f61ecc4f6d946090d18d6f4dfb Mon Sep 17 00:00:00 2001 From: Orit Wasserman owass...@redhard.com Date: Sun, 13 Nov 2011 12:57:17 +0200 Subject: [PATCH] Add EPT_walk_addr_genric --- arch/x86/kvm/mmu.c | 13 + arch/x86/kvm/mmu.h |5 + arch/x86/kvm/paging_tmpl.h | 44 3 files changed, 58
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On 11/13/2011 04:32 PM, Avi Kivity wrote: On 11/13/2011 01:30 PM, Orit Wasserman wrote: Maybe this patch can help, this is roughly what Avi wants (I hope) done very quickly. I'm sorry I don't have setup to run nested VMX at the moment so i can't test it. Orit diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9335e1b..bbe212f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3180,6 +3180,10 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access, #include paging_tmpl.h #undef PTTYPE +#define PTTYPE EPT +#include paging_tmpl.h +#undef PTTYPE + Yes, that's the key. int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]); void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask); int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 507e2b8..70d4cfd 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -39,6 +39,21 @@ #define CMPXCHG cmpxchg64 #define PT_MAX_FULL_LEVELS 2 #endif +#elif PTTYPE == EPT +#define pt_element_t u64 +#define FNAME(name) EPT_##name +#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK +#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl) +#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl) +#define PT_INDEX(addr, level) PT64_INDEX(addr, level) +#define PT_LEVEL_BITS PT64_LEVEL_BITS +#ifdef CONFIG_X86_64 +#define PT_MAX_FULL_LEVELS 4 +#define CMPXCHG cmpxchg +#else +#define CMPXCHG cmpxchg64 +#define PT_MAX_FULL_LEVELS 2 +#endif The various masks should be defined here, to avoid lots of #ifdefs later. That what I did first but than I was afraid that the MASK will be changed for mmu.c too. so I decided on ifdefs. The more I think about it I think we need rapper function for mask checking (at least for this file). What do you think ? #elif PTTYPE == 32 #define pt_element_t u32 #define guest_walker guest_walker32 @@ -106,14 +121,19 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte, { unsigned access; +#if PTTYPE == EPT access = (gpte (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK; +#else +access = (gpte EPT_WRITABLE_MASK) | EPT_EXEC_MASK; if (last !is_dirty_gpte(gpte)) access = ~ACC_WRITE_MASK; +#endif Like here, you could make is_dirty_gpte() local to paging_tmpl() returning true for EPT and the dirty bit otherwise. #if PTTYPE == 64 if (vcpu-arch.mmu.nx) access = ~(gpte PT64_NX_SHIFT); The ept X bit is lost. Could do something like access = (gpte PT_X_NX_SHIFT) ^ PT_X_NX_SENSE; +#if PTTYPE == EPT +const int write_fault = access EPT_WRITABLE_MASK; +const int user_fault = 0; +const int fetch_fault = 0; +#else EPT has fetch permissions (but not user permissions); anyway translate_nested_gpa() already does this. const int write_fault = access PFERR_WRITE_MASK; const int user_fault = access PFERR_USER_MASK; const int fetch_fault = access PFERR_FETCH_MASK; +#endif u16 errcode = 0; trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault, @@ -174,6 +200,9 @@ retry_walk: (mmu-get_cr3(vcpu) CR3_NONPAE_RESERVED_BITS) == 0); pt_access = ACC_ALL; +#if PTTYPE == EPT +pt_access = PT_PRESENT_MASK | EPT_WRITABLE_MASK | EPT_EXEC_MASK; +#endif pt_access is not in EPT or ia32 format - it's our own format (xwu). So this doesn't need changing. Updating gpte_access() is sufficient. for (;;) { gfn_t real_gfn; @@ -186,9 +215,14 @@ retry_walk: pte_gpa = gfn_to_gpa(table_gfn) + offset; walker-table_gfn[walker-level - 1] = table_gfn; walker-pte_gpa[walker-level - 1] = pte_gpa; - +#if PTTYPE == EPT +real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn), + EPT_WRITABLE_MASK); +#else real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn), PFERR_USER_MASK|PFERR_WRITE_MASK); +#endif + Unneeded, I think. Is it because translate_nested_gpa always set USER_MASK ? if (unlikely(real_gfn == UNMAPPED_GVA)) goto error; real_gfn = gpa_to_gfn(real_gfn); @@ -221,6 +255,7 @@ retry_walk: eperm = true; #endif +#if PTTYPE != EPT if (!eperm unlikely(!(pte PT_ACCESSED_MASK))) { int ret; trace_kvm_mmu_set_accessed_bit(table_gfn, index, @@ -235,7 +270,7 @@ retry_walk: mark_page_dirty(vcpu-kvm, table_gfn); pte |= PT_ACCESSED_MASK; } - +#endif If PT_ACCESSED_MASK is 0 for EPT
Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT
On 11/10/2011 11:59 AM, Nadav Har'El wrote: When the existing KVM MMU code creates a shadow page table, it assumes it has the normal x86 page table format. This is obviously correct for normal shadow page tables, and also correct for AMD's NPT. Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary page tables, so when we create a shadow EPT table (i.e., in nested EPT), we need to slightly modify the way in which this table table is built. In particular, when mmu.c's link_shadow_page() creates non-leaf page table entries, it used to enable the present, accessed, writable and user flags on these entries. While this is correct for ordinary page tables, it is wrong in EPT tables - where these bits actually have completely different meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h). In particular, leaving the code as-is causes bit 5 of the PTE to be turned on (supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes an EPT Misconfiguration failure. So we must move link_shadow_page's list of extra bits to a new mmu context field, which is set differently for nested EPT. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 16 +--- arch/x86/kvm/paging_tmpl.h |6 -- arch/x86/kvm/vmx.c |3 +++ 4 files changed, 17 insertions(+), 9 deletions(-) --- .before/arch/x86/include/asm/kvm_host.h 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/include/asm/kvm_host.h2011-11-10 11:33:59.0 +0200 @@ -287,6 +287,7 @@ struct kvm_mmu { bool nx; u64 pdptrs[4]; /* pae */ + u64 link_shadow_page_set_bits; }; struct kvm_vcpu_arch { --- .before/arch/x86/kvm/vmx.c2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 @@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s vcpu-arch.mmu.get_pdptr = nested_ept_get_pdptr; vcpu-arch.mmu.inject_page_fault = nested_ept_inject_page_fault; vcpu-arch.mmu.shadow_root_level = get_ept_level(); + vcpu-arch.mmu.link_shadow_page_set_bits = + VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | + VMX_EPT_EXECUTABLE_MASK; vcpu-arch.walk_mmu = vcpu-arch.nested_mmu; --- .before/arch/x86/kvm/mmu.c2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/mmu.c 2011-11-10 11:33:59.0 +0200 @@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_ return __shadow_walk_next(iterator, *iterator-sptep); } -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits) { - u64 spte; - - spte = __pa(sp-spt) - | PT_PRESENT_MASK | PT_ACCESSED_MASK - | PT_WRITABLE_MASK | PT_USER_MASK; - mmu_spte_set(sptep, spte); + mmu_spte_set(sptep, __pa(sp-spt) | set_bits); } static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) @@ -3366,6 +3361,13 @@ int kvm_init_shadow_mmu(struct kvm_vcpu vcpu-arch.mmu.base_role.cr0_wp = is_write_protection(vcpu); vcpu-arch.mmu.base_role.smep_andnot_wp = smep !is_write_protection(vcpu); + /* + * link_shadow() should apply these bits in shadow page tables, and + * in shadow NPT tables (nested NPT). For nested EPT, different bits + * apply. + */ + vcpu-arch.mmu.link_shadow_page_set_bits = PT_PRESENT_MASK | + PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK; return r; } --- .before/arch/x86/kvm/paging_tmpl.h2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/paging_tmpl.h 2011-11-10 11:33:59.0 +0200 @@ -515,7 +515,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu goto out_gpte_changed; if (sp) - link_shadow_page(it.sptep, sp); + link_shadow_page(it.sptep, sp, + vcpu-arch.mmu.link_shadow_page_set_bits); } for (; @@ -535,7 +536,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1, true, direct_access, it.sptep); - link_shadow_page(it.sptep, sp); + link_shadow_page(it.sptep, sp, + vcpu-arch.mmu.link_shadow_page_set_bits); } clear_sp_write_flooding_count(it.sptep); We need to consider the permission in L1's EPT table,what if an entry is read only ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume
Gleb Natapov g...@redhat.com wrote on 25/10/2009 11:44:31: From: Gleb Natapov g...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/ Haifa/i...@ibmil, kvm@vger.kernel.org, md...@us.ibm.com, Muli Ben- Yehuda/Haifa/i...@ibmil Date: 25/10/2009 11:44 Subject: Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume On Thu, Oct 22, 2009 at 05:46:16PM +0200, Orit Wasserman wrote: Gleb Natapov g...@redhat.com wrote on 22/10/2009 11:04:58: From: Gleb Natapov g...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/ Haifa/i...@ibmil, kvm@vger.kernel.org, md...@us.ibm.com, Muli Ben- Yehuda/Haifa/i...@ibmil Date: 22/10/2009 11:05 Subject: Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume On Wed, Oct 21, 2009 at 04:43:44PM +0200, Orit Wasserman wrote: @@ -4641,10 +4955,13 @@ static void vmx_complete_interrupts (struct vcpu_vmx *vmx) int type; bool idtv_info_valid; - exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); - vmx-exit_reason = vmcs_read32(VM_EXIT_REASON); + if (vmx-nested.nested_mode) + return; + Why return here? What the function does that should not be done in nested mode? In nested mode L0 injects an interrupt to L2 only in one scenario, if there is an IDT_VALID event and L0 decides to run L2 again and not to switch back to L1. In all other cases the injection is handled by L1. This is exactly the kind of scenario that is handled by vmx_complete_interrupts(). (vmx|svm)_complete_interrups() store pending event in arch agnostic way and re-injection is handled by x86.c You bypass this logic by inserting return here and introducing nested_handle_valid_idt() function below. The only location we can truly know if we are switching to L1 is in vmx_vcpu_run because enable_irq_window (that is called after handling the exit) can decide to switch to L1 because of an interrupt. enable_irq_window() will be called after L2 VMCS will be setup for event re-injection by previous call to inject_pending_event(). As far as I can see this should work for interrupt injection. For exception we should probably require l2 guest to re execute faulted instruction for now like svm does. The main issue is that L0 doesn't inject events to L2 but L1 hypervisor (we want to keep the nested hypervisor semantics as much as possible). Only if the event was caused by the fact that L2 is a nested guest and L1 can't handle it L0 will re-inject and event to L2, for example IDT event with page fault that is caused by a missing entry in SPT02 (the shadow page table L0 create for L2). In this case when vmx_complete_intterupts is called L0 doesn't know if the page fault should be handled by it or by L1 (it is decided later when handling the exit). In most other cases , L0 will switch to L1 and L1 will decide if there will be re-injection (depends on the L1 hypervisor logic) and update L2 VMCS accordingly. In order to simplify our code it was simpler to bypass vmx_complete_interrupts when it is called (after running L2) and to add nested_handle_valid_idt just before running L2. + exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + /* Handle machine checks before interrupts are enabled */ if ((vmx-exit_reason == EXIT_REASON_MCE_DURING_VMENTRY) || (vmx-exit_reason == EXIT_REASON_EXCEPTION_NMI @@ -4747,6 +5064,60 @@ static void fixup_rmode_irq(struct vcpu_vmx *vmx) | vmx-rmode.irq.vector; } +static int nested_handle_valid_idt(struct kvm_vcpu *vcpu) +{ It seems by this function you are trying to bypass general event reinjection logic. Why? See above. The logic implemented by this function is handled in x86.c in arch agnostic way. Is there something wrong with this? See my comment before Sometimes it is wrong to reinject events from L0 to L2 directly. If L2 was not able to handle event because its IDT is not mapped by L1 shadow page table we should generate PF vmexit with valid idt vectoring info to L1 and let L1 handle event reinjection. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] Nested VMX patch 1 implements vmon and vmoff
Avi Kivity a...@redhat.com wrote on 20/10/2009 06:00:50: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/ Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil, aligu...@us.ibm.com, md...@us.ibm.com Date: 20/10/2009 06:02 Subject: Re: [PATCH 1/5] Nested VMX patch 1 implements vmon and vmoff On 10/15/2009 11:41 PM, or...@il.ibm.com wrote: /* + * Handles msr read for nested virtualization + */ +static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, + u64 *pdata) +{ + u64 vmx_msr = 0; + + switch (msr_index) { + case MSR_IA32_FEATURE_CONTROL: + *pdata = 0; + break; + case MSR_IA32_VMX_BASIC: + *pdata = 0; + rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr); + *pdata = (vmx_msr 0x00cf); + break; + This (and the rest of the msrs) must be controllable from userspace. Otherwise a live migration from a newer host to an older host would break. OK. /* + * Writes msr value for nested virtualization + * Returns 0 on success, non-0 otherwise. + */ +static int nested_vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) +{ + switch (msr_index) { + case MSR_IA32_FEATURE_CONTROL: + if ((data (FEATURE_CONTROL_LOCKED | + FEATURE_CONTROL_VMXON_ENABLED)) + != (FEATURE_CONTROL_LOCKED | + FEATURE_CONTROL_VMXON_ENABLED)) + return 1; + break; + default: + return 1; + } + + return 0; +} + Need to export this msr to userspace for live migration. See msrs_to_save[]. OK. +/* + * Check to see if vcpu can execute vmx command + * Inject the corrseponding exception + */ +static int nested_vmx_check_permission(struct kvm_vcpu *vcpu) +{ + struct kvm_segment cs; + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct kvm_msr_entry *msr; + + vmx_get_segment(vcpu,cs, VCPU_SREG_CS); + + if (!vmx-nested.vmxon) { + printk(KERN_DEBUG %s: vmx not on\n, __func__); pr_debug I will change it. + kvm_queue_exception(vcpu, UD_VECTOR); + return 0; + } + + msr = find_msr_entry(vmx, MSR_EFER); + + if ((vmx_get_rflags(vcpu) X86_EFLAGS_VM) || + ((msr-data EFER_LMA) !cs.l)) { is_long_mode() I will change it. static int handle_vmx_insn(struct kvm_vcpu *vcpu) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } +static int handle_vmoff(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + vmx-nested.vmxon = 0; + + skip_emulated_instruction(vcpu); + return 1; +} + +static int handle_vmon(struct kvm_vcpu *vcpu) +{ + struct kvm_segment cs; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (!nested) { + printk(KERN_DEBUG %s: nested vmx not enabled\n, __func__); + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + vmx_get_segment(vcpu,cs, VCPU_SREG_CS); + + if (!(vcpu-arch.cr4 X86_CR4_VMXE) || + !(vcpu-arch.cr0 X86_CR0_PE) || + (vmx_get_rflags(vcpu) X86_EFLAGS_VM)) { + kvm_queue_exception(vcpu, UD_VECTOR); + printk(KERN_INFO %s invalid register state\n, __func__); + return 1; + } +#ifdef CONFIG_X86_64 + if (((find_msr_entry(to_vmx(vcpu), + MSR_EFER)-data EFER_LMA) !cs.l)) { is_long_mode(), and you can avoid the #ifdef. I will change it. VMXON is supposed to block INIT, please add that (in a separate patch). I will add it. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst
Avi Kivity a...@redhat.com wrote on 20/10/2009 06:24:33: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/ Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil, aligu...@us.ibm.com, md...@us.ibm.com Date: 20/10/2009 06:24 Subject: Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst On 10/15/2009 11:41 PM, or...@il.ibm.com wrote: + +struct __attribute__ ((__packed__)) shadow_vmcs { Since this is in guest memory, we need it packed so the binary format is preserved across migration. Please add a comment so it isn't changed (at least without changing the revision_id). I will add a comment. vmclear state should be here, that will help multiguest support. Working on it ... struct nested_vmx { /* Has the level1 guest done vmxon? */ bool vmxon; - + /* What is the location of the vmcs l1 keeps for l2? (in level1 gpa) */ + u64 vmptr; Need to expose it for live migration. I will look into it. /* * Level 2 state : includes vmcs,registers and * a copy of vmcs12 for vmread/vmwrite */ struct level_state *l2_state; + /* Level 1 state for switching to level 2 and back */ + struct level_state *l1_state; This creates a ton of duplication. Some of the data is completely unnecessary, for example we can recalculate cr0 from HOST_CR0 and GUEST_CR0. I will look into it . + +static int vmptrld(struct kvm_vcpu *vcpu, + u64 phys_addr) +{ + u8 error; + + asm volatile (__ex(ASM_VMX_VMPTRLD_RAX) ; setna %0 +: =g(error) : a(phys_addr), m(phys_addr) +: cc); + if (error) { + printk(KERN_ERR kvm: %s vmptrld %llx failed\n, + __func__, phys_addr); + return 1; + } + + return 0; +} + /* * Switches to specified vcpu, until a matching vcpu_put(), but assumes * vcpu mutex is already taken. @@ -736,15 +923,8 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) } if (per_cpu(current_vmcs, cpu) != vmx-vmcs) { - u8 error; - per_cpu(current_vmcs, cpu) = vmx-vmcs; - asm volatile (__ex(ASM_VMX_VMPTRLD_RAX) ; setna %0 - : =g(error) : a(phys_addr), m(phys_addr) - : cc); - if (error) - printk(KERN_ERR kvm: vmptrld %p/%llx fail\n, -vmx-vmcs, phys_addr); + vmptrld(vcpu, phys_addr); } This part of the patch is no longer needed. I will remove it. + if (cpu_has_vmx_msr_bitmap()) + vmx-nested.l2_state-msr_bitmap = vmcs_read64(MSR_BITMAP); + else + vmx-nested.l2_state-msr_bitmap = 0; + + vmx-nested.l2_state-io_bitmap_a = vmcs_read64(IO_BITMAP_A); + vmx-nested.l2_state-io_bitmap_b = vmcs_read64(IO_BITMAP_B); + This no longer works, since we don't load the guest vmcs. I will look into it. +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes, + struct kvm_vcpu *vcpu); Isn't this in a header somewhere? I will move it into the header. + +int read_guest_vmcs_gpa(struct kvm_vcpu *vcpu, u64 *gentry) +{ + + int r = 0; + + r = kvm_read_guest_virt(vcpu-arch.regs[VCPU_REGS_RAX], gentry, +sizeof(u64), vcpu); + if (r) { + printk(KERN_ERR %s cannot read guest vmcs addr %lx : %d\n, + __func__, vcpu-arch.regs[VCPU_REGS_RAX], r); + return r; + } + + if (!IS_ALIGNED(*gentry, PAGE_SIZE)) { + printk(KERN_DEBUG %s addr %llx not aligned\n, + __func__, *gentry); + return 1; + } + + return 0; +} + Should go through the emulator to evaluate arguments. OK. +static int handle_vmptrld(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct page *vmcs_page; + u64 guest_vmcs_addr; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (read_guest_vmcs_gpa(vcpu,guest_vmcs_addr)) + return 1; + + if (create_l1_state(vcpu)) { + printk(KERN_ERR %s create_l1_state failed\n, __func__); + return 1; + } + + if (create_l2_state(vcpu)) { + printk(KERN_ERR %s create_l2_state failed\n, __func__); + return 1; + } return errors here, so we see the problem. OK. + +static int handle_vmptrst(struct kvm_vcpu *vcpu) +{ + int r = 0; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + r = kvm_write_guest_virt(vcpu-arch.regs[VCPU_REGS_RAX], + (void *)to_vmx(vcpu)-nested.vmptr, + sizeof(u64), vcpu); Emulator again. OK. +void save_vmcs(struct shadow_vmcs *dst) +{ + dst-io_bitmap_a = vmcs_read64(IO_BITMAP_A); + dst-io_bitmap_b = vmcs_read64(IO_BITMAP_B); These (and many others) can never change due to a nested guest running, so no need to save
Re: [PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite
Avi Kivity a...@redhat.com wrote on 20/10/2009 06:44:41: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/ Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil, aligu...@us.ibm.com, md...@us.ibm.com Date: 20/10/2009 06:44 Subject: Re: [PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite On 10/15/2009 11:41 PM, or...@il.ibm.com wrote: +static int nested_map_shadow_vmcs(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct page *vmcs_page = nested_get_page(vcpu, vmx-nested.vmptr); + + if (vmcs_page == NULL) { + printk(KERN_INFO %s: failure in nested_get_page\n,__func__); + return 0; + } + + if (vmx-nested.l2_state-shadow_vmcs) { + printk(KERN_INFO %s: shadow vmcs already mapped\n,__func__); + return 0; + } + Consider dropping shadow_vmcs from l2_state and just passing it everywhere. Less convenient but safer. I will think about it, it is called from many places ... + vmx-nested.l2_state-shadow_vmcs = kmap_atomic(vmcs_page, KM_USER0); + + if (!vmx-nested.l2_state-shadow_vmcs) { + printk(KERN_INFO %s: error in kmap_atomic\n,__func__); + return 0; + } kmap_atomic() can't fail. I will remove the check. +static int handle_vmread(struct kvm_vcpu *vcpu) +{ +#ifndef CONFIG_X86_64 + u64 value; +#endif + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (!nested_map_shadow_vmcs(vcpu)) { + printk(KERN_INFO %s invalid shadow vmcs\n, __func__); + set_rflags_to_vmx_fail_invalid(vcpu); + return 1; + } return an error. OK. + + switch (vmcs_field_length(vcpu-arch.regs[VCPU_REGS_RDX])) { + case VMCS_FIELD_TYPE_U16: + vcpu-arch.regs[VCPU_REGS_RAX] = + nested_vmcs_read16(vcpu, + vcpu-arch.regs[VCPU_REGS_RDX]); + break; Use the emulator to decode operands. OK. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume
Avi Kivity a...@redhat.com wrote on 20/10/2009 06:56:39: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/ Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil, aligu...@us.ibm.com, md...@us.ibm.com Date: 20/10/2009 06:56 Subject: Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume On 10/15/2009 11:41 PM, or...@il.ibm.com wrote: From: Orit Wassermanor...@il.ibm.com --- arch/x86/kvm/vmx.c | 1173 ++ -- 1 files changed, 1148 insertions(+), 25 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6a4c252..e814029 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -209,6 +209,7 @@ struct __attribute__ ((__packed__)) level_state { struct vmcs *vmcs; int cpu; int launched; + bool first_launch; }; struct nested_vmx { @@ -216,6 +217,12 @@ struct nested_vmx { bool vmxon; /* What is the location of the vmcs l1 keeps for l2? (in level1 gpa) */ u64 vmptr; + /* Are we running nested guest */ + bool nested_mode; + /* L1 requested VMLAUNCH or VMRESUME but we didn't run L2 yet */ + bool nested_run_pending; + /* flag indicating if there was a valid IDT after exiting from l2 */ + bool nested_valid_idt; Did you mean valid_idt_vectoring_info? yes. No need to prefix everything with nested_ inside nested_vmx. OK. +void prepare_vmcs_12(struct kvm_vcpu *vcpu) +{ + struct shadow_vmcs *l2_shadow_vmcs = + get_shadow_vmcs(vcpu); + + l2_shadow_vmcs-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR); + l2_shadow_vmcs-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR); + l2_shadow_vmcs-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR); + l2_shadow_vmcs-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR); + l2_shadow_vmcs-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR); + l2_shadow_vmcs-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR); + l2_shadow_vmcs-guest_ldtr_selector = vmcs_read16 (GUEST_LDTR_SELECTOR); + l2_shadow_vmcs-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR); + + l2_shadow_vmcs-tsc_offset = vmcs_read64(TSC_OFFSET); + l2_shadow_vmcs-guest_physical_address = + vmcs_read64(GUEST_PHYSICAL_ADDRESS); + l2_shadow_vmcs-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); Physical addresses need translation, no? If you are referring to GUEST_PHYSICAL_ADDRESS than there is no need for translation for L1. It need to stay in L2 physical address. + l2_shadow_vmcs-guest_cr0 = vmcs_readl(GUEST_CR0); + + l2_shadow_vmcs-guest_cr4 = vmcs_readl(GUEST_CR4); We don't allow the guest to modify these, so no need to read them. If you do, you need to remove the bits that we modify. You are correct for example CR0.TS , it will be fixed in the next set of patches. + +int load_vmcs_common(struct shadow_vmcs *src) +{ + + vmcs_write64(VMCS_LINK_POINTER, src-vmcs_link_pointer); Why load this? At the moment it is not used , maybe in the future. I can add a check to see if it was changed. + vmcs_write64(GUEST_IA32_DEBUGCTL, src-guest_ia32_debugctl); I think some features there are dangerous. I will look into it. + vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, src- vm_entry_msr_load_count); Need to verify? Also need to validate the loaded MSRs and run them through kvm_set_msr() instead of letting the cpu do it. I will add the checks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume
Gleb Natapov g...@redhat.com wrote on 22/10/2009 11:04:58: From: Gleb Natapov g...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/ Haifa/i...@ibmil, kvm@vger.kernel.org, md...@us.ibm.com, Muli Ben- Yehuda/Haifa/i...@ibmil Date: 22/10/2009 11:05 Subject: Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume On Wed, Oct 21, 2009 at 04:43:44PM +0200, Orit Wasserman wrote: @@ -4641,10 +4955,13 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx) int type; bool idtv_info_valid; - exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); - vmx-exit_reason = vmcs_read32(VM_EXIT_REASON); + if (vmx-nested.nested_mode) + return; + Why return here? What the function does that should not be done in nested mode? In nested mode L0 injects an interrupt to L2 only in one scenario, if there is an IDT_VALID event and L0 decides to run L2 again and not to switch back to L1. In all other cases the injection is handled by L1. This is exactly the kind of scenario that is handled by vmx_complete_interrupts(). (vmx|svm)_complete_interrups() store pending event in arch agnostic way and re-injection is handled by x86.c You bypass this logic by inserting return here and introducing nested_handle_valid_idt() function below. The only location we can truly know if we are switching to L1 is in vmx_vcpu_run because enable_irq_window (that is called after handling the exit) can decide to switch to L1 because of an interrupt. In order to simplify our code it was simpler to bypass vmx_complete_interrupts when it is called (after running L2) and to add nested_handle_valid_idt just before running L2. + exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + /* Handle machine checks before interrupts are enabled */ if ((vmx-exit_reason == EXIT_REASON_MCE_DURING_VMENTRY) || (vmx-exit_reason == EXIT_REASON_EXCEPTION_NMI @@ -4747,6 +5064,60 @@ static void fixup_rmode_irq(struct vcpu_vmx *vmx) | vmx-rmode.irq.vector; } +static int nested_handle_valid_idt(struct kvm_vcpu *vcpu) +{ It seems by this function you are trying to bypass general event reinjection logic. Why? See above. The logic implemented by this function is handled in x86.c in arch agnostic way. Is there something wrong with this? See my comment before + vmx-launched = vmx-nested.l2_state-launched; + Can you explain why -launched logic is needed? It is possible L1 called vmlaunch but we didn't actually run L2 (for example there was an interrupt and enable_irq_window switched back to L1 before running L2). L1 thinks the vmlaunch was successful and call vmresume in the next time but KVM needs to call vmlaunch for L2. handle_vmlauch() and handle_vmresume() are exactly the same. Why KVM needs to run one and not the other? Yes they are very similar (almost the same code) the only difference is the check of vmclear, we need to emulate the vmx hardware behavior for those two commands and check VMC12 state. +static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, + bool is_interrupt) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + int initial_pfu_active = vcpu-fpu_active; + + if (!vmx-nested.nested_mode) { + printk(KERN_INFO WARNING: %s called but not in nested mode \n, + __func__); + return 0; + } + + save_msrs(vmx-guest_msrs, vmx-save_nmsrs); + + sync_cached_regs_to_vmcs(vcpu); + + if (!nested_map_shadow_vmcs(vcpu)) { + printk(KERN_INFO Error mapping shadow vmcs\n); + set_rflags_to_vmx_fail_valid(vcpu); Error during vmexit should set abort flag, not change flags. I think this is more a vmlaunch/vmresume error (in the code that switch back to L1). How is this vmlaunch/vmresume error? This function is called to exit from L2 guest while on L2 vcms. It is called asynchronously in respect to L2 guest and you modify L2 guest rflags register at unpredictable place here. OK. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst
Gleb Natapov g...@redhat.com wrote on 19/10/2009 13:17:41: From: Gleb Natapov g...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/ Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil, aligu...@us.ibm.com, md...@us.ibm.com Date: 19/10/2009 13:17 Subject: Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst On Thu, Oct 15, 2009 at 04:41:44PM +0200, or...@il.ibm.com wrote: From: Orit Wasserman or...@il.ibm.com --- arch/x86/kvm/vmx.c | 468 +++ +++-- arch/x86/kvm/x86.c |3 +- 2 files changed, 459 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 411cbdb..8c186e0 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -61,20 +61,168 @@ module_param_named(unrestricted_guest, static int __read_mostly emulate_invalid_guest_state = 0; module_param(emulate_invalid_guest_state, bool, S_IRUGO); + +struct __attribute__ ((__packed__)) shadow_vmcs { + u32 revision_id; + u32 abort; + u16 virtual_processor_id; + u16 guest_es_selector; + u16 guest_cs_selector; + u16 guest_ss_selector; + u16 guest_ds_selector; + u16 guest_fs_selector; + u16 guest_gs_selector; + u16 guest_ldtr_selector; + u16 guest_tr_selector; + u16 host_es_selector; + u16 host_cs_selector; + u16 host_ss_selector; + u16 host_ds_selector; + u16 host_fs_selector; + u16 host_gs_selector; + u16 host_tr_selector; + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + u64 vm_exit_msr_store_addr; + u64 vm_exit_msr_load_addr; + u64 vm_entry_msr_load_addr; + u64 tsc_offset; + u64 virtual_apic_page_addr; + u64 apic_access_addr; + u64 ept_pointer; + u64 guest_physical_address; + u64 vmcs_link_pointer; + u64 guest_ia32_debugctl; + u64 guest_ia32_pat; + u64 guest_pdptr0; + u64 guest_pdptr1; + u64 guest_pdptr2; + u64 guest_pdptr3; + u64 host_ia32_pat; + u32 pin_based_vm_exec_control; + u32 cpu_based_vm_exec_control; + u32 exception_bitmap; + u32 page_fault_error_code_mask; + u32 page_fault_error_code_match; + u32 cr3_target_count; + u32 vm_exit_controls; + u32 vm_exit_msr_store_count; + u32 vm_exit_msr_load_count; + u32 vm_entry_controls; + u32 vm_entry_msr_load_count; + u32 vm_entry_intr_info_field; + u32 vm_entry_exception_error_code; + u32 vm_entry_instruction_len; + u32 tpr_threshold; + u32 secondary_vm_exec_control; + u32 vm_instruction_error; + u32 vm_exit_reason; + u32 vm_exit_intr_info; + u32 vm_exit_intr_error_code; + u32 idt_vectoring_info_field; + u32 idt_vectoring_error_code; + u32 vm_exit_instruction_len; + u32 vmx_instruction_info; + u32 guest_es_limit; + u32 guest_cs_limit; + u32 guest_ss_limit; + u32 guest_ds_limit; + u32 guest_fs_limit; + u32 guest_gs_limit; + u32 guest_ldtr_limit; + u32 guest_tr_limit; + u32 guest_gdtr_limit; + u32 guest_idtr_limit; + u32 guest_es_ar_bytes; + u32 guest_cs_ar_bytes; + u32 guest_ss_ar_bytes; + u32 guest_ds_ar_bytes; + u32 guest_fs_ar_bytes; + u32 guest_gs_ar_bytes; + u32 guest_ldtr_ar_bytes; + u32 guest_tr_ar_bytes; + u32 guest_interruptibility_info; + u32 guest_activity_state; + u32 guest_sysenter_cs; + u32 host_ia32_sysenter_cs; + unsigned long cr0_guest_host_mask; + unsigned long cr4_guest_host_mask; + unsigned long cr0_read_shadow; + unsigned long cr4_read_shadow; + unsigned long cr3_target_value0; + unsigned long cr3_target_value1; + unsigned long cr3_target_value2; + unsigned long cr3_target_value3; + unsigned long exit_qualification; + unsigned long guest_linear_address; + unsigned long guest_cr0; + unsigned long guest_cr3; + unsigned long guest_cr4; + unsigned long guest_es_base; + unsigned long guest_cs_base; + unsigned long guest_ss_base; + unsigned long guest_ds_base; + unsigned long guest_fs_base; + unsigned long guest_gs_base; + unsigned long guest_ldtr_base; + unsigned long guest_tr_base; + unsigned long guest_gdtr_base; + unsigned long guest_idtr_base; + unsigned long guest_dr7; + unsigned long guest_rsp; + unsigned long guest_rip; + unsigned long guest_rflags; + unsigned long guest_pending_dbg_exceptions; + unsigned long guest_sysenter_esp; + unsigned long guest_sysenter_eip; + unsigned long host_cr0; + unsigned long host_cr3; + unsigned long host_cr4; + unsigned long host_fs_base; + unsigned long host_gs_base; + unsigned long host_tr_base; + unsigned long host_gdtr_base; + unsigned long host_idtr_base; + unsigned long host_ia32_sysenter_esp; + unsigned long host_ia32_sysenter_eip
Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst
Gleb Natapov g...@redhat.com wrote on 19/10/2009 14:59:53: From: Gleb Natapov g...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/ Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil, aligu...@us.ibm.com, md...@us.ibm.com Date: 19/10/2009 15:00 Subject: Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst On Thu, Oct 15, 2009 at 04:41:44PM +0200, or...@il.ibm.com wrote: +static struct page *nested_get_page(struct kvm_vcpu *vcpu, +u64 vmcs_addr) +{ + struct page *vmcs_page = NULL; + + down_read(current-mm-mmap_sem); + vmcs_page = gfn_to_page(vcpu-kvm, vmcs_addr PAGE_SHIFT); + up_read(current-mm-mmap_sem); Why are you taking mmap_sem here? gup_fast() takes it if required. I will remove it. + + if (is_error_page(vmcs_page)) { + printk(KERN_ERR %s error allocating page \n, __func__); + kvm_release_page_clean(vmcs_page); + return NULL; + } + + return vmcs_page; + +} + -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite
Gleb Natapov g...@redhat.com wrote on 19/10/2009 15:17:20: From: Gleb Natapov g...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/ Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil, aligu...@us.ibm.com, md...@us.ibm.com Date: 19/10/2009 15:17 Subject: Re: [PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite On Thu, Oct 15, 2009 at 04:41:45PM +0200, or...@il.ibm.com wrote: From: Orit Wasserman or...@il.ibm.com --- arch/x86/kvm/vmx.c | 591 +++ - 1 files changed, 589 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 8c186e0..6a4c252 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -225,6 +225,21 @@ struct nested_vmx { struct level_state *l1_state; }; +enum vmcs_field_type { + VMCS_FIELD_TYPE_U16 = 0, + VMCS_FIELD_TYPE_U64 = 1, + VMCS_FIELD_TYPE_U32 = 2, + VMCS_FIELD_TYPE_ULONG = 3 +}; + +#define VMCS_FIELD_LENGTH_OFFSET 13 +#define VMCS_FIELD_LENGTH_MASK 0x6000 + +static inline int vmcs_field_length(unsigned long field) +{ + return (VMCS_FIELD_LENGTH_MASK field) 13; +} + struct vmcs { u32 revision_id; u32 abort; @@ -288,6 +303,404 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu) return container_of(vcpu, struct vcpu_vmx, vcpu); } +#define SHADOW_VMCS_OFFSET(x) offsetof(struct shadow_vmcs, x) + +static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = { + + [VIRTUAL_PROCESSOR_ID] = + SHADOW_VMCS_OFFSET(virtual_processor_id), + [GUEST_ES_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_es_selector), + [GUEST_CS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_cs_selector), + [GUEST_SS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_ss_selector), + [GUEST_DS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_ds_selector), + [GUEST_FS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_fs_selector), + [GUEST_GS_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_gs_selector), + [GUEST_LDTR_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_ldtr_selector), + [GUEST_TR_SELECTOR] = + SHADOW_VMCS_OFFSET(guest_tr_selector), + [HOST_ES_SELECTOR] = + SHADOW_VMCS_OFFSET(host_es_selector), + [HOST_CS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_cs_selector), + [HOST_SS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_ss_selector), + [HOST_DS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_ds_selector), + [HOST_FS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_fs_selector), + [HOST_GS_SELECTOR] = + SHADOW_VMCS_OFFSET(host_gs_selector), + [HOST_TR_SELECTOR] = + SHADOW_VMCS_OFFSET(host_tr_selector), + [IO_BITMAP_A] = + SHADOW_VMCS_OFFSET(io_bitmap_a), + [IO_BITMAP_A_HIGH] = + SHADOW_VMCS_OFFSET(io_bitmap_a)+4, + [IO_BITMAP_B] = + SHADOW_VMCS_OFFSET(io_bitmap_b), + [IO_BITMAP_B_HIGH] = + SHADOW_VMCS_OFFSET(io_bitmap_b)+4, + [MSR_BITMAP] = + SHADOW_VMCS_OFFSET(msr_bitmap), + [MSR_BITMAP_HIGH] = + SHADOW_VMCS_OFFSET(msr_bitmap)+4, + [VM_EXIT_MSR_STORE_ADDR] = + SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr), + [VM_EXIT_MSR_STORE_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr)+4, + [VM_EXIT_MSR_LOAD_ADDR] = + SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr), + [VM_EXIT_MSR_LOAD_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr)+4, + [VM_ENTRY_MSR_LOAD_ADDR] = + SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr), + [VM_ENTRY_MSR_LOAD_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr)+4, + [TSC_OFFSET] = + SHADOW_VMCS_OFFSET(tsc_offset), + [TSC_OFFSET_HIGH] = + SHADOW_VMCS_OFFSET(tsc_offset)+4, + [VIRTUAL_APIC_PAGE_ADDR] = + SHADOW_VMCS_OFFSET(virtual_apic_page_addr), + [VIRTUAL_APIC_PAGE_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(virtual_apic_page_addr)+4, + [APIC_ACCESS_ADDR] = + SHADOW_VMCS_OFFSET(apic_access_addr), + [APIC_ACCESS_ADDR_HIGH] = + SHADOW_VMCS_OFFSET(apic_access_addr)+4, + [EPT_POINTER] = + SHADOW_VMCS_OFFSET(ept_pointer), + [EPT_POINTER_HIGH] = + SHADOW_VMCS_OFFSET(ept_pointer)+4, + [GUEST_PHYSICAL_ADDRESS] = + SHADOW_VMCS_OFFSET(guest_physical_address), + [GUEST_PHYSICAL_ADDRESS_HIGH] = + SHADOW_VMCS_OFFSET(guest_physical_address)+4, + [VMCS_LINK_POINTER] = + SHADOW_VMCS_OFFSET(vmcs_link_pointer), + [VMCS_LINK_POINTER_HIGH] = + SHADOW_VMCS_OFFSET(vmcs_link_pointer)+4, + [GUEST_IA32_DEBUGCTL] = + SHADOW_VMCS_OFFSET(guest_ia32_debugctl), + [GUEST_IA32_DEBUGCTL_HIGH] = + SHADOW_VMCS_OFFSET(guest_ia32_debugctl)+4, + [GUEST_IA32_PAT] = + SHADOW_VMCS_OFFSET(guest_ia32_pat), + [GUEST_IA32_PAT_HIGH] = + SHADOW_VMCS_OFFSET
Re: Nested VMX support v3
Avi Kivity a...@redhat.com wrote on 20/10/2009 05:30:34: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/ Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil, aligu...@us.ibm.com, md...@us.ibm.com Date: 20/10/2009 05:30 Subject: Re: Nested VMX support v3 On 10/15/2009 11:41 PM, or...@il.ibm.com wrote: Avi, We have addressed all of the comments, please apply. The following patches implement nested VMX support. The patches enable a guest to use the VMX APIs in order to run its own nested guest (i.e., enable running other hypervisors which use VMX under KVM). The current patches support running Linux under a nested KVM using shadow page table (with bypass_guest_pf disabled). SMP support was fixed. Reworking EPT support to mesh cleanly with the current shadow paging design per Avi's comments is a work-in-progress. Why is bypass_guest_pf disabled? It was not implemented. We need to modify the walk_addr code to handle the sptes that have invalid content (used only for the bypass_guest_pf optimization) and identify them as not present. Maybe remove some other validity checks too. The current patches only support a single nested hypervisor, which can only run a single guest (multiple guests are work in progress). Only 64-bit nested hypervisors are supported. Multiple guests and 32-bit support are merge requirements. As far as I can tell there shouldn't be anything special required to support them? Ok. vpid allocation will be updated with the multiguest support (work in progress). We are working on fixing the cr0.TS handling, it works for nested kvm by not for vmware server. Please either drop or fix vpid before merging. What's wrong with cr0.ts? I'd like to see that fixed as well. Ok. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst
Avi Kivity a...@redhat.com wrote on 06/09/2009 12:25:17: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/ Haifa/i...@ibmil, kvm@vger.kernel.org, mm...@us.ibm.com, Muli Ben- Yehuda/Haifa/i...@ibmil Date: 06/09/2009 12:25 Subject: Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst On 09/03/2009 05:25 PM, Orit Wasserman wrote: + /* +* Level 2 state : includes vmcs,registers and +* a copy of vmcs12 for vmread/vmwrite +*/ + struct level_state *l2_state; + + /* Level 1 state for switching to level 2 and back */ + struct level_state *l1_state; Can you explain why we need two of them? in the guest vmcs we have host and guest values, and in l1_state and l2_state we have more copies, and in struct vcpu we have yet another set of copies. We also have a couple of copies in the host vmcs. I'm getting dizzy... L2_state stores all the L2 guest state: vmcs - A pointer to VMCS02, the VMCS used to run it by L0. shadow vmcs - a structure storing the values of VMCS12 (the vmcs L1 create to run L2). When we support multiple nested guests, we'll run into a problem of where to store shadow_vmcs. I see these options: - maintain a cache of limited size of shadow_vmcs; when evicting, copy the shadow_vmcs into the guest's vmptr] - always put shadow_vmcs in the guest's vmptr, and write protect it so the guest can't play with it - always put shadow_vmcs in the guest's vmptr, and verify everything you read (that's what nsvm does) The second option looks a bit complicated I prefer one of the other two. cpu - the cpu id Why is it needed? This is a copy of the cpu id from the vcpu to store the last cpu id the L2 guest run on. launched- launched flag Can be part of shadow_vmcs I prefer to keep the shadow_vmcs as a separate structure to store only VMCS fields. vpid - the vpid allocate by L0 for L2 (we need to store it somewhere) Note the guest can DoS the host by allocating a lot of vpids. So we to allocate host vpids on demand and be able to flush them out. The guest is not allocating the vpids the host(L0) does using allocate_vpid. I agree that with nested the danger for them to run out is bigger. msr_bitmap - At the moment we use L0 msr_bitmap(as we are running kvm on kvm) in the future we will use a merge of both bitmaps. Note kvm uses two bitmaps (for long mode and legacy mode). OK. L1 state stores the L1 state - vmcs - pointer to VMCS01 So it's the same as vmx-vmcs in normal operation? yes , but with nested the vmx-vmcs is changed when running an L2 (nested) guest. shadow vmcs - a structure storing the values of VMCS01. we use it when updating VMCS02 in order to avoid the need to switch between VMCS02 and VMCS01. Sorry, don't understand. VMCS02 - the VMCS L0 uses to run L2. When we create/update VMCS02 we need to read fields from VMCS01 (host state is taken fully, control fields ). For L1 the shadow_vmcs is a copy of VMCS01 in a structure format, we used the same structure. cpu - the cpu id launched- launched flag This is a copy of vmx-launched? Exactly .The vmx-launched is updated when switching from L1/L2 and back so we need to store it here. + + if (vmx-nested.l1_cur_vmcs != guest_vmcs_addr) { + vmcs_page = nested_get_page(vcpu, guest_vmcs_addr); + if (vmcs_page == NULL) + return 1; + + /* load nested vmcs to processor */ + if (vmptrld(vcpu, page_to_phys(vmcs_page))) { So, you're loading a guest page as the vmcs. This is dangerous as the guest can play with it. Much better to use inaccessible memory (and you do alloc_vmcs() earlier?) We can copy the vmcs and than vmptrld it. As for the allocate vmcs this is a memory leak and I will fix it (it should be allocated only once). But why do it? Your approach is to store the guest vmcs in the same format as the processor (which we don't really know), so you have to use vmread/vmwrite to maintain it. Instead, you can choose that the guest vmcs is a shadow_vmcs structure and then you can access it using normal memory operations. I got it now. We will need a way to distinguish between processor format VMCS and structure based VMCS, we can use the revision id field (create a unique revision id for nested like 0x or 0x0). I see. You're using the processor's format when reading the guest vmcs. But we don't have to do that, we can use the shadow_vmcs structure (and a memcpy). I'm sorry I don't understand your comment can u elaborate ? See previous comment. Basically you can do struct shadow_vmcs *svmcs = kmap_atomic(gpa_to_page(vmx-vmptr)); printk(guest_cs = %x\n, svmcs-guest_cs_selector); See above. instead of vmptrld(gpa_to_hpa(vmx-vmptr
Re: [PATCH 5/6] Nested VMX patch 5 implements vmlaunch and vmresume
Avi Kivity a...@redhat.com wrote on 06/09/2009 12:29:58: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/ Haifa/i...@ibmil, kvm@vger.kernel.org, mm...@us.ibm.com, Muli Ben- Yehuda/Haifa/i...@ibmil Date: 06/09/2009 12:30 Subject: Re: [PATCH 5/6] Nested VMX patch 5 implements vmlaunch and vmresume On 09/03/2009 05:53 PM, Orit Wasserman wrote: Need to auto-ack the interrupt if requested by the guest. The is_interrupt means L1 has interrupts, kvm regular code will handle it. If the VM-Exit Controls bit 15 (Acknowledge interrupts on exit) is set, when the nested guest exits you need to run kvm_cpu_get_interrupt() and put the vector number in the VM-Exit interruption-information field. kvm doesn't set this bit but I think Xen does. VMware doesn't set it either. We have to run L2 with the bit off even if the L1 hypervisor set it, and emulate the L2 exit to L1 hypervisor correctly. I will look at it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst
Avi Kivity a...@redhat.com wrote on 06/09/2009 16:52:56: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/ Haifa/i...@ibmil, kvm@vger.kernel.org, m...@us.ibm.com, Muli Ben- Yehuda/Haifa/i...@ibmil Date: 06/09/2009 16:53 Subject: Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst On 09/06/2009 04:36 PM, Orit Wasserman wrote: When we support multiple nested guests, we'll run into a problem of where to store shadow_vmcs. I see these options: - maintain a cache of limited size of shadow_vmcs; when evicting, copy the shadow_vmcs into the guest's vmptr] - always put shadow_vmcs in the guest's vmptr, and write protect it so the guest can't play with it - always put shadow_vmcs in the guest's vmptr, and verify everything you read (that's what nsvm does) The second option looks a bit complicated I prefer one of the other two. I agree, the third option looks easiest but not sure how much verification is needed. Note other things like the msr bitmaps may need write protection, otherwise you have to re-merge the bitmap on every guest entry, which can be very slow. So we may be forced to add write protection anyway. We will also need to write protected L1's EPT tables , to allow L1 to swap out his guests. launched- launched flag Can be part of shadow_vmcs I prefer to keep the shadow_vmcs as a separate structure to store only VMCS fields. It is a vmcs field - it is manipulated by vmx instructions which operate on the vmcs. You'll need to store it in guest memory when you support multiple nested guests. You can put the vmcs fields in a sub-structure if you want to separate between explicit fields and implicit fields (I can only see one implicit field (launched), but maybe there are more). OK. vpid - the vpid allocate by L0 for L2 (we need to store it somewhere) Note the guest can DoS the host by allocating a lot of vpids. So we to allocate host vpids on demand and be able to flush them out. The guest is not allocating the vpids the host(L0) does using allocate_vpid. I meant, the guest can force the host to allocate vpids if we don't protect against it. You meant by launching a lot of guests ? We can limit the number of guests as a very quick solution. More complicated is limiting the number of vpids per L1 hypervisor and reusing them. This means we will sometime need to invalidate the vpid when switching between L2 guests. I agree that with nested the danger for them to run out is bigger. Sorry, don't understand. VMCS02 - the VMCS L0 uses to run L2. When we create/update VMCS02 we need to read fields from VMCS01 (host state is taken fully, control fields ). For L1 the shadow_vmcs is a copy of VMCS01 in a structure format, we used the same structure. I don't understand why you need it. Host state shouldn't change. Only the control fields are interesting, and things like exception_bitmap. I think that when KVM switches to Qemu the host state can change (L0 host state). If this happens between different runs of L2 we will need to update VMCS02 host state. Of course we can optimize and update it only than. But why do it? Your approach is to store the guest vmcs in the same format as the processor (which we don't really know), so you have to use vmread/vmwrite to maintain it. Instead, you can choose that the guest vmcs is a shadow_vmcs structure and then you can access it using normal memory operations. I got it now. We will need a way to distinguish between processor format VMCS and structure based VMCS, we can use the revision id field (create a unique revision id for nested like 0x or 0x0). No, you can always store guest vmcs in software format, since we'll never load it with vmptrld. We'll only load a real vmcs with vmptrld. You are right , a new VMCS will be zeroed. Note it also solves live migration, since now all guest vmcss are copied as part of normal guest memory (including their launched state). Great. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] Nested VMX patch 1 implements vmon and vmoff
Avi Kivity a...@redhat.com wrote on 02/09/2009 22:34:58: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben- Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, mm...@us.ibm.com Date: 02/09/2009 22:34 Subject: Re: [PATCH 1/6] Nested VMX patch 1 implements vmon and vmoff On 09/02/2009 06:38 PM, or...@il.ibm.com wrote: From: Orit Wassermanor...@il.ibm.com --- arch/x86/kvm/svm.c |3 - arch/x86/kvm/vmx.c | 187 ++ +- arch/x86/kvm/x86.c |6 ++- arch/x86/kvm/x86.h |2 + 4 files changed, 192 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 2df9b45..3c1f22a 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -124,9 +124,6 @@ static int npt = 1; module_param(npt, int, S_IRUGO); -static int nested = 1; -module_param(nested, int, S_IRUGO); - static void svm_flush_tlb(struct kvm_vcpu *vcpu); static void svm_complete_interrupts(struct vcpu_svm *svm); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 78101dd..abba325 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -67,6 +67,11 @@ struct vmcs { char data[0]; }; +struct nested_vmx { + /* Has the level1 guest done vmon? */ + bool vmon; +}; vmxon fixed @@ -967,6 +975,69 @@ static void guest_write_tsc(u64 guest_tsc, u64 host_tsc) } /* + * Handles msr read for nested virtualization + */ +static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, + u64 *pdata) +{ + u32 vmx_msr_low = 0, vmx_msr_high = 0; + + switch (msr_index) { + case MSR_IA32_FEATURE_CONTROL: + *pdata = 0; + break; + case MSR_IA32_VMX_BASIC: + rdmsr(MSR_IA32_VMX_BASIC, vmx_msr_low, vmx_msr_high); Use rdmsrl, it's easier. fixed I think we need to mask it with the capabilities we support. Otherwise the guest can try to use some new feature which we don't support yet, and crash. I agree , but I went over the Intel spec and didn't find any problematic feature. We may need to consider it in the future. + *pdata = vmx_msr_low | ((u64)vmx_msr_high 32); + break; + case MSR_IA32_VMX_PINBASED_CTLS: + *pdata = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING | + PIN_BASED_VIRTUAL_NMIS; Need to mask with actual cpu capabilities in case we run on an older cpu. fixed + break; + case MSR_IA32_VMX_PROCBASED_CTLS: + *pdata = CPU_BASED_HLT_EXITING | +#ifdef CONFIG_X86_64 + CPU_BASED_CR8_LOAD_EXITING | + CPU_BASED_CR8_STORE_EXITING | +#endif + CPU_BASED_CR3_LOAD_EXITING | + CPU_BASED_CR3_STORE_EXITING | + CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_MOV_DR_EXITING | + CPU_BASED_USE_TSC_OFFSETING | + CPU_BASED_INVLPG_EXITING; Same here... or are all these guaranteed to be present? fixed + +static int handle_vmon(struct kvm_vcpu *vcpu) +{ + struct kvm_segment cs; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (!nested) { + printk(KERN_DEBUG %s: nested vmx not enabled\n, __func__); + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + vmx_get_segment(vcpu,cs, VCPU_SREG_CS); + + if (!(vcpu-arch.cr4 X86_CR4_VMXE) || + !(vcpu-arch.cr0 X86_CR0_PE) || + (vmx_get_rflags(vcpu) X86_EFLAGS_VM) || + ((find_msr_entry(to_vmx(vcpu), + MSR_EFER)-data EFER_LMA) !cs.l)) { Not everyone has EFER. Better to wrap this in an #ifdef CONFIG_X86_64. fixed -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] Nested VMX patch 2 implements vmclear
Avi Kivity a...@redhat.com wrote on 02/09/2009 22:38:22: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben- Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, mm...@us.ibm.com Date: 02/09/2009 23:01 Subject: Re: [PATCH 2/6] Nested VMX patch 2 implements vmclear On 09/02/2009 06:38 PM, or...@il.ibm.com wrote: From: Orit Wassermanor...@il.ibm.com --- arch/x86/kvm/vmx.c | 24 +++- 1 files changed, 23 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index abba325..2b1fc3b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -70,6 +70,8 @@ struct vmcs { struct nested_vmx { /* Has the level1 guest done vmon? */ bool vmon; + /* Has the level1 guest done vmclear? */ + bool vmclear; }; Doesn't seem these two belong in the same structure - vmclear is per-vmcs... but you're probably aware of that with the multi-guest support coming. You are right vmclear flag is part of the L2 guest state. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst
Avi Kivity a...@redhat.com wrote on 02/09/2009 23:05:09: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben- Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, mm...@us.ibm.com Date: 02/09/2009 23:04 Subject: Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst On 09/02/2009 06:38 PM, or...@il.ibm.com wrote: +struct __attribute__ ((__packed__)) level_state { + struct shadow_vmcs *shadow_vmcs; + + u16 vpid; + u64 shadow_efer; + unsigned long cr2; + unsigned long cr3; + unsigned long cr4; + unsigned long cr8; + + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + + struct vmcs *vmcs; + int cpu; + int launched; +}; + struct vmcs { u32 revision_id; u32 abort; @@ -72,6 +217,17 @@ struct nested_vmx { bool vmon; /* Has the level1 guest done vmclear? */ bool vmclear; + /* What is the location of the vmcs l1 keeps for l2? (in level1 gpa) */ + u64 l1_cur_vmcs; This is the vmptr (exactly as loaded by vmptrld), right? If so, please call it vmptr. OK I will change it. + /* +* Level 2 state : includes vmcs,registers and +* a copy of vmcs12 for vmread/vmwrite +*/ + struct level_state *l2_state; + + /* Level 1 state for switching to level 2 and back */ + struct level_state *l1_state; Can you explain why we need two of them? in the guest vmcs we have host and guest values, and in l1_state and l2_state we have more copies, and in struct vcpu we have yet another set of copies. We also have a couple of copies in the host vmcs. I'm getting dizzy... L2_state stores all the L2 guest state: vmcs - A pointer to VMCS02, the VMCS used to run it by L0. shadow vmcs - a structure storing the values of VMCS12 (the vmcs L1 create to run L2). cpu - the cpu id launched- launched flag vpid - the vpid allocate by L0 for L2 (we need to store it somewhere) msr_bitmap - At the moment we use L0 msr_bitmap(as we are running kvm on kvm) in the future we will use a merge of both bitmaps. io_bitmaps - At the moment we use L0 io_bitmaps (as we are running kvm on kvm) in the future we will use a merge of both io_bitmaps. L1 state stores the L1 state - vmcs - pointer to VMCS01 shadow vmcs - a structure storing the values of VMCS01. we use it when updating VMCS02 in order to avoid the need to switch between VMCS02 and VMCS01. cpu - the cpu id launched- launched flag vpid - the vpid allocate by L0 for L1 (we need to store it somewhere) shadow_efer - Till recently wasn't part of the VMCS and may be needed for older processors. cr0 - not used I will remove it cr2 - not used I will remove it cr3 cr4 We didn't use the state stored in the vcpu for L1 because sometime it changes during L2 run. The vmcs in the vcpu point to the active vmcs it is pointing to VMCS01 when L1 is running and to VMCS02 when L2 guest is running. static int init_rmode(struct kvm *kvm); static u64 construct_eptp(unsigned long root_hpa); +int read_guest_vmcs_gpa(struct kvm_vcpu *vcpu, u64 *gentry) +{ + gpa_t gpa; + struct page *page; + int r = 0; + + gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, vcpu-arch.regs [VCPU_REGS_RAX]); + + /* checking guest gpa */ + page = gfn_to_page(vcpu-kvm, gpa PAGE_SHIFT); + if (is_error_page(page)) { + printk(KERN_ERR %s Invalid guest vmcs addr %llx\n, + __func__, gpa); + r = 1; + goto out; + } + + r = kvm_read_guest(vcpu-kvm, gpa, gentry, sizeof(u64)); + if (r) { + printk(KERN_ERR %s cannot read guest vmcs addr %llx : %d\n, + __func__, gpa, r); + goto out; + } You can use kvm_read_guest_virt() to simplify this. I will fix it. + + if (!IS_ALIGNED(*gentry, PAGE_SIZE)) { + printk(KERN_DEBUG %s addr %llx not aligned\n, + __func__, *gentry); + return 1; + } + +out: + kvm_release_page_clean(page); + return r; +} + +static int handle_vmptrld(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + struct page *vmcs_page; + u64 guest_vmcs_addr; + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (read_guest_vmcs_gpa(vcpu,guest_vmcs_addr)) + return 1; + + if (create_l1_state(vcpu)) { + printk(KERN_ERR %s create_l1_state failed\n, __func__); + return 1; + } + + if (create_l2_state(vcpu)) { + printk(KERN_ERR %s create_l2_state failed\n, __func__); + return 1; + } + + vmx-nested.l2_state-vmcs = alloc_vmcs(); + if (!vmx-nested.l2_state-vmcs) { + printk(KERN_ERR %s error in creating level 2 vmcs, __func__); + return 1
Re: [PATCH 2/2] Nested VMX patch 4 implements vmread and vmwrite
Avi Kivity a...@redhat.com wrote on 02/09/2009 23:15:40: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben- Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, mm...@us.ibm.com Date: 02/09/2009 23:15 Subject: Re: [PATCH 2/2] Nested VMX patch 4 implements vmread and vmwrite On 09/02/2009 06:38 PM, or...@il.ibm.com wrote: + +static void init_vmcs_field_to_offset_table(void) +{ + memset(vmcs_field_to_offset_table,0xff, + sizeof(vmcs_field_to_offset_table)); + + vmcs_field_to_offset_table[VIRTUAL_PROCESSOR_ID] = + offsetof(struct shadow_vmcs, virtual_processor_id); + vmcs_field_to_offset_table[GUEST_ES_SELECTOR] = + offsetof(struct shadow_vmcs, guest_es_selector); + vmcs_field_to_offset_table[GUEST_CS_SELECTOR] = + offsetof(struct shadow_vmcs, guest_cs_selector); + vmcs_field_to_offset_table[GUEST_SS_SELECTOR] = + offsetof(struct shadow_vmcs, guest_ss_selector); + vmcs_field_to_offset_table[GUEST_DS_SELECTOR] = + offsetof(struct shadow_vmcs, guest_ds_selector); + vmcs_field_to_offset_table[GUEST_FS_SELECTOR] = + offsetof(struct shadow_vmcs, guest_fs_selector); + vmcs_field_to_offset_table[GUEST_GS_SELECTOR] = + offsetof(struct shadow_vmcs, guest_gs_selector); + vmcs_field_to_offset_table[GUEST_LDTR_SELECTOR] = + offsetof(struct shadow_vmcs, guest_ldtr_selector); + vmcs_field_to_offset_table[GUEST_TR_SELECTOR] = + offsetof(struct shadow_vmcs, guest_tr_selector); + vmcs_field_to_offset_table[HOST_ES_SELECTOR] = + offsetof(struct shadow_vmcs, host_es_selector); + vmcs_field_to_offset_table[HOST_CS_SELECTOR] = + offsetof(struct shadow_vmcs, host_cs_selector); + vmcs_field_to_offset_table[HOST_SS_SELECTOR] = + offsetof(struct shadow_vmcs, host_ss_selector); + vmcs_field_to_offset_table[HOST_DS_SELECTOR] = + offsetof(struct shadow_vmcs, host_ds_selector); + vmcs_field_to_offset_table[HOST_FS_SELECTOR] = + offsetof(struct shadow_vmcs, host_fs_selector); + vmcs_field_to_offset_table[HOST_GS_SELECTOR] = + offsetof(struct shadow_vmcs, host_gs_selector); + vmcs_field_to_offset_table[HOST_TR_SELECTOR] = + offsetof(struct shadow_vmcs, host_tr_selector); + vmcs_field_to_offset_table[IO_BITMAP_A] = + offsetof(struct shadow_vmcs, io_bitmap_a); + vmcs_field_to_offset_table[IO_BITMAP_A_HIGH] = + offsetof(struct shadow_vmcs, io_bitmap_a)+4; + vmcs_field_to_offset_table[IO_BITMAP_B] = + offsetof(struct shadow_vmcs, io_bitmap_b); + vmcs_field_to_offset_table[IO_BITMAP_B_HIGH] = + offsetof(struct shadow_vmcs, io_bitmap_b)+4; + vmcs_field_to_offset_table[MSR_BITMAP] = + offsetof(struct shadow_vmcs, msr_bitmap); + vmcs_field_to_offset_table[MSR_BITMAP_HIGH] = + offsetof(struct shadow_vmcs, msr_bitmap)+4; + vmcs_field_to_offset_table[VM_EXIT_MSR_STORE_ADDR] = + offsetof(struct shadow_vmcs, vm_exit_msr_store_addr); + vmcs_field_to_offset_table[VM_EXIT_MSR_STORE_ADDR_HIGH] = + offsetof(struct shadow_vmcs, vm_exit_msr_store_addr)+4; + vmcs_field_to_offset_table[VM_EXIT_MSR_LOAD_ADDR] = + offsetof(struct shadow_vmcs, vm_exit_msr_load_addr); + vmcs_field_to_offset_table[VM_EXIT_MSR_LOAD_ADDR_HIGH] = + offsetof(struct shadow_vmcs, vm_exit_msr_load_addr)+4; + vmcs_field_to_offset_table[VM_ENTRY_MSR_LOAD_ADDR] = + offsetof(struct shadow_vmcs, vm_entry_msr_load_addr); + vmcs_field_to_offset_table[VM_ENTRY_MSR_LOAD_ADDR_HIGH] = + offsetof(struct shadow_vmcs, vm_entry_msr_load_addr)+4; + vmcs_field_to_offset_table[TSC_OFFSET] = + offsetof(struct shadow_vmcs, tsc_offset); + vmcs_field_to_offset_table[TSC_OFFSET_HIGH] = + offsetof(struct shadow_vmcs, tsc_offset)+4; + vmcs_field_to_offset_table[VIRTUAL_APIC_PAGE_ADDR] = + offsetof(struct shadow_vmcs, virtual_apic_page_addr); + vmcs_field_to_offset_table[VIRTUAL_APIC_PAGE_ADDR_HIGH] = + offsetof(struct shadow_vmcs, virtual_apic_page_addr)+4; + vmcs_field_to_offset_table[APIC_ACCESS_ADDR] = + offsetof(struct shadow_vmcs, apic_access_addr); + vmcs_field_to_offset_table[APIC_ACCESS_ADDR_HIGH] = + offsetof(struct shadow_vmcs, apic_access_addr)+4; + vmcs_field_to_offset_table[EPT_POINTER] = + offsetof(struct shadow_vmcs, ept_pointer); + vmcs_field_to_offset_table[EPT_POINTER_HIGH] = + offsetof(struct shadow_vmcs, ept_pointer)+4; + vmcs_field_to_offset_table[GUEST_PHYSICAL_ADDRESS] = + offsetof(struct shadow_vmcs, guest_physical_address); + vmcs_field_to_offset_table[GUEST_PHYSICAL_ADDRESS_HIGH] = + offsetof(struct shadow_vmcs, guest_physical_address)+4; + vmcs_field_to_offset_table[VMCS_LINK_POINTER] = + offsetof(struct
Re: [PATCH 5/6] Nested VMX patch 5 implements vmlaunch and vmresume
Avi Kivity a...@redhat.com wrote on 03/09/2009 00:38:16: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben- Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, mm...@us.ibm.com Date: 03/09/2009 00:38 Subject: Re: [PATCH 5/6] Nested VMX patch 5 implements vmlaunch and vmresume On 09/02/2009 06:38 PM, or...@il.ibm.com wrote: -struct nested_vmx { - /* Has the level1 guest done vmon? */ +struct nested_vmx { /* Has the level1 guest done vmon? */ A \n died here. I will fix it. bool vmon; /* Has the level1 guest done vmclear? */ bool vmclear; + + /* Are we running nested guest */ + bool nested_mode; + + /* L1 requested VMLAUNCH or VMRESUME but we didn't run L2 yet */ + bool nested_run_pending; + + /* flag indicating if there was a valid IDT after exiting from l2 */ + bool nested_pending_valid_idt; What does this mean? pending event? I will rename it. + +static inline int nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu) +{ + return to_vmx(vcpu)-nested.l2_state-shadow_vmcs- + cpu_based_vm_exec_control CPU_BASED_TPR_SHADOW; +} Don't we need to check if the host supports it too? We check it separately but I can add it here +static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu +*vcpu) +{ + struct shadow_vmcs *shadow = to_vmx(vcpu)-nested.l2_state- shadow_vmcs; + + return (shadow-secondary_vm_exec_control + SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) + to_vmx(vcpu)-nested.l2_state-shadow_vmcs-apic_access_addr != 0; +} Why check apic_access_addr? I will remove it. + +static inline int nested_cpu_has_vmx_ept(struct kvm_vcpu *vcpu) +{ + return to_vmx(vcpu)-nested.l2_state-shadow_vmcs- + secondary_vm_exec_control SECONDARY_EXEC_ENABLE_EPT; +} Need to check if secondary controls enabled? If the secondary controls are not enabled this field is zero. +static void vmx_set_irq(struct kvm_vcpu *vcpu) +{ + if (to_vmx(vcpu)-nested.nested_mode) + return; Why? Note if the guest didn't enable external interrupt exiting, we need to inject as usual. I look into it. +static int nested_handle_pending_idt(struct kvm_vcpu *vcpu) +{ Again the name is confusing. pending_event_injection? I will rename it. + struct vcpu_vmx *vmx = to_vmx(vcpu); + int irq; + int type; + int errCodeValid; + u32 idt_vectoring_info; + u32 guest_intr; + bool nmi_window_open; + bool interrupt_window_open; + + if (vmx-nested.nested_mode vmx- nested.nested_pending_valid_idt) { + idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD); + irq = idt_vectoring_info VECTORING_INFO_VECTOR_MASK; + type = idt_vectoring_info VECTORING_INFO_TYPE_MASK; + errCodeValid = idt_vectoring_info + VECTORING_INFO_DELIVER_CODE_MASK; + + guest_intr = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO); + nmi_window_open = + !(guest_intr (GUEST_INTR_STATE_STI | + GUEST_INTR_STATE_MOV_SS | + GUEST_INTR_STATE_NMI)); + + interrupt_window_open = + ((vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_IF) + !(guest_intr (GUEST_INTR_STATE_STI | +GUEST_INTR_STATE_MOV_SS))); + + if (type == INTR_TYPE_EXT_INTR !interrupt_window_open) { + printk(KERN_INFO IDT ignored, l2 interrupt window closed!\n); + return 0; + } How can this happen? Unless it's on nested entry, in which case we need to abort the entry. Ok i will fix it. The truth I never saw it happen. + #ifdef CONFIG_X86_64 #define R r #define Q q @@ -4646,6 +4842,15 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); + nested_handle_pending_idt(vcpu); You're not checking the return code (need to do that on entry). I will fix it. + + if (vmx-nested.nested_mode) { + vmcs_writel(GUEST_CR0, vmx-nested.l2_state-shadow_vmcs- guest_cr0); Might not be legal. We may also want to force-enable caching. Lastly, don't we need to handle cr0.ts and ct0.mp specially to manage the fpu state? We are working on implementing this correctly . kvm seems to handle it fine but vmware doesn't like it. + if (vmx-nested.nested_mode) + vmx-nested.vmclear = 0; + Why? I will check it. free_vmcs: @@ -5122,6 +5339,228 @@ static int shadow_vmcs_load(struct kvm_vcpu *vcpu) return 0; } +void prepare_vmcs_12(struct kvm_vcpu *vcpu) +{ + struct shadow_vmcs *l2_shadow_vmcs = + to_vmx(vcpu)-nested.l2_state-shadow_vmcs; + struct shadow_vmcs *l1_shadow_vmcs = + to_vmx(vcpu)-nested.l1_state-shadow_vmcs; + + l2_shadow_vmcs-guest_es_selector
Re: [PATCH 1/6] Nested VMX patch 1 implements vmon and vmoff
Avi Kivity a...@redhat.com wrote on 03/09/2009 16:39:09: From: Avi Kivity a...@redhat.com To: Orit Wasserman/Haifa/i...@ibmil Cc: Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/ Haifa/i...@ibmil, kvm@vger.kernel.org, mm...@us.ibm.com, Muli Ben- Yehuda/Haifa/i...@ibmil Date: 03/09/2009 16:39 Subject: Re: [PATCH 1/6] Nested VMX patch 1 implements vmon and vmoff On 09/03/2009 03:34 PM, Orit Wasserman wrote: I think we need to mask it with the capabilities we support. Otherwise the guest can try to use some new feature which we don't support yet, and crash. I agree , but I went over the Intel spec and didn't find any problematic feature. We may need to consider it in the future. We need to do it, since we don't know anything about future processors. OK -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html