Re: KVM call for agenda for 2013-10-01

2013-09-30 Thread Orit Wasserman
On 09/24/2013 05:09 PM, Juan Quintela wrote:
 
 Hi
 
 Please, send any topic that you are interested in covering.
 
 Last week I forgot to send the call for topics.  We still have a topic there.
 
 Thanks, Juan.
 
 Agenda so far:
 - Talk about qemu reverse executing (1st description was done this week)
   How to handle IO when we want to do reverse execution.
   How this relate to Kemari needs?

Hi Juan,
I'm sorry but I won't be able to attend to call tomorrow due to travel.

Regards,
Orit

   And to icount changes?
 
 Call details:
 
 10:00 AM to 11:00 AM EDT
 Every two weeks
 
 If you need phone number details,  contact me privately.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 01/14] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-08-01 Thread Orit Wasserman
On 07/31/2013 05:48 PM, Gleb Natapov wrote:
 From: Nadav Har'El n...@il.ibm.com
 
 Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
 switch the EFER MSR when EPT is used and the host and guest have different
 NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
 and want to be able to run recent KVM as L1, we need to allow L1 to use this
 EFER switching feature.
 
 To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
 and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
 support for the former (the latter is still unsupported).
 
 Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
 respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
 that's left to do in this patch is to properly advertise this feature to L1.
 
 Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
 vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
 support this feature, regardless of whether the host supports it.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/vmx.c |   23 ---
  1 file changed, 16 insertions(+), 7 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index e999dc7..27efa6a 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
  #else
   nested_vmx_exit_ctls_high = 0;
  #endif
 - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 +   VM_EXIT_LOAD_IA32_EFER);
  
   /* entry controls */
   rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
 @@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
   nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
   nested_vmx_entry_ctls_high =
   VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
 - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 -
 + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
 +VM_ENTRY_LOAD_IA32_EFER);
   /* cpu-based controls */
   rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
   nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
 @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
   vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
   vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits);
  
 - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
 - vmcs_write32(VM_EXIT_CONTROLS,
 - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
 - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
 + /* L2-L1 exit controls are emulated - the hardware exit is to L0 so
 +  * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
 +  * bits are further modified by vmx_set_efer() below.
 +  */
 + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
 +
 + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
 +  * emulated by vmx_set_efer(), below.
 +  */
 + vmcs_write32(VM_ENTRY_CONTROLS,
 + (vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
 + ~VM_ENTRY_IA32E_MODE) |
   (vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));
  
   if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
 

Reviewed-by: Orit Wasserman owass...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 02/14] nEPT: Fix cr3 handling in nested exit and entry

2013-08-01 Thread Orit Wasserman
On 07/31/2013 05:48 PM, Gleb Natapov wrote:
 From: Nadav Har'El n...@il.ibm.com
 
 The existing code for handling cr3 and related VMCS fields during nested
 exit and entry wasn't correct in all cases:
 
 If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
 during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
 we forgot to do so. This patch adds this copy.
 
 If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
 whoever does control cr3 (L1 or L2) is using PAE, the processor might have
 saved PDPTEs and we should also save them in vmcs12 (and restore later).
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/vmx.c |   26 ++
  1 file changed, 26 insertions(+)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 27efa6a..4e98764 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7595,6 +7595,16 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
   kvm_set_cr3(vcpu, vmcs12-guest_cr3);
   kvm_mmu_reset_context(vcpu);
  
 + /*
 +  * L1 may access the L2's PDPTR, so save them to construct vmcs12
 +  */
 + if (enable_ept) {
 + vmcs_write64(GUEST_PDPTR0, vmcs12-guest_pdptr0);
 + vmcs_write64(GUEST_PDPTR1, vmcs12-guest_pdptr1);
 + vmcs_write64(GUEST_PDPTR2, vmcs12-guest_pdptr2);
 + vmcs_write64(GUEST_PDPTR3, vmcs12-guest_pdptr3);
 + }
 +
   kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-guest_rsp);
   kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12-guest_rip);
  }
 @@ -7917,6 +7927,22 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
   vmcs12-guest_pending_dbg_exceptions =
   vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
  
 + /*
 +  * In some cases (usually, nested EPT), L2 is allowed to change its
 +  * own CR3 without exiting. If it has changed it, we must keep it.
 +  * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
 +  * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
 +  *
 +  * Additionally, restore L2's PDPTR to vmcs12.
 +  */
 + if (enable_ept) {
 + vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3);
 + vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
 + vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
 + vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
 + vmcs12-guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
 + }
 +
   vmcs12-vm_entry_controls =
   (vmcs12-vm_entry_controls  ~VM_ENTRY_IA32E_MODE) |
   (vmcs_read32(VM_ENTRY_CONTROLS)  VM_ENTRY_IA32E_MODE);
 

Reviewed-by: Orit Wasserman owass...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/14] nEPT: Fix wrong test in kvm_set_cr3

2013-08-01 Thread Orit Wasserman
On 07/31/2013 05:48 PM, Gleb Natapov wrote:
 From: Nadav Har'El n...@il.ibm.com
 
 kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
 address. The problem is that with nested EPT, cr3 is an *L2* physical
 address, not an L1 physical address as this test expects.
 
 As the comment above this test explains, it isn't necessary, and doesn't
 correspond to anything a real processor would do. So this patch removes it.
 
 Note that this wrong test could have also theoretically caused problems
 in nested NPT, not just in nested EPT. However, in practice, the problem
 was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
 nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
 circumventing the problem. Additional potential calls to the buggy function
 are avoided in that we don't trap cr3 modifications when nested NPT is
 enabled. However, because in nested VMX we did want to use kvm_set_cr3()
 (as requested in Avi Kivity's review of the original nested VMX patches),
 we can't avoid this problem and need to fix it.
 
 Reviewed-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
 Signed-off-by: Xinhao Xu xinhao...@intel.com
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
  arch/x86/kvm/x86.c |   11 ---
  1 file changed, 11 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index d2caeb9..e2fef8b 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -682,17 +682,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
*/
   }
  
 - /*
 -  * Does the new cr3 value map to physical memory? (Note, we
 -  * catch an invalid cr3 even in real-mode, because it would
 -  * cause trouble later on when we turn on paging anyway.)
 -  *
 -  * A real CPU would silently accept an invalid cr3 and would
 -  * attempt to use it - with largely undefined (and often hard
 -  * to debug) behavior on the guest side.
 -  */
 - if (unlikely(!gfn_to_memslot(vcpu-kvm, cr3  PAGE_SHIFT)))
 - return 1;
   vcpu-arch.cr3 = cr3;
   __set_bit(VCPU_EXREG_CR3, (ulong *)vcpu-arch.regs_avail);
   vcpu-arch.mmu.new_cr3(vcpu);
 

Reviewed-by: Orit Wasserman owass...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/11] KVM: nVMX: shadow VMCS support, v5

2013-04-18 Thread Orit Wasserman
On 04/18/2013 02:34 PM, Abel Gordon wrote:
 This series of patches implements shadow-vmcs capability for nested VMX.
 
 Shadow-vmcs - background and overview:
 
  In Intel VMX, vmread and vmwrite privileged instructions are used by the
  hypervisor to read and modify the guest and host specifications (VMCS). In a
  nested virtualization environment, L1 executes multiple vmread and vmwrite
  instruction to handle a single L2 exit. Each vmread and vmwrite executed by 
 L1
  traps (cause an exit) to the L0 hypervisor (KVM). L0 emulates the instruction
  behaviour and resumes L1 execution.
 
  Removing the need to trap and emulate these special instructions reduces the
  number of exits and improves nested virtualization performance. As it was 
 first
  evaluated in [1], exit-less vmread and vmwrite can reduce nested 
 virtualization
  overhead up-to 40%.
  
  Intel introduced a new feature to their processors called shadow-vmcs.  Using
  shadow-vmcs, L0 can configure the processor to let L1 running in guest-mode
  access VMCS12 fields using vmread and vmwrite instructions but without 
 causing
  an exit to L0. The VMCS12 fields' data is stored in a shadow-vmcs controlled
  by L0.
 
 Shadow-vmcs - design considerations: 
 
  A shadow-vmcs is processor-dependent and must be accessed by L0 or L1 using
  vmread and vmwrite instructions. With nested virtualization we aim to 
 abstract
  the hardware from the L1 hypervisor. Thus, to avoid hardware dependencies we
  prefered to keep the software defined VMCS12 format as part of L1 address 
 space
  and hold the processor-specific shadow-vmcs format only in L0 address space.
  In other words, the shadow-vmcs is used by L0 as an accelerator but the 
 format
  and content is never exposed to L1 directly. L0 syncs the content of the
  processor-specific shadow vmcs with the content of the software-controlled
  VMCS12 format.
 
  We could have been kept the processor-specific shadow-vmcs format in L1 
 address
  space to avoid using the software defined VMCS12 format, however, this type 
 of
  design/implementation would have been created hardware dependencies and
  would complicate other capabilities (e.g. Live Migration of L1).
 
 Changes since v1:
  1) Added sync_shadow_vmcs flag used to indicate when the content of VMCS12
 must be copied to the shadow vmcs. The flag value is checked during 
 vmx_vcpu_run.
  2) Code quality improvements
 
 Changes since v2:
  1) Allocate shadow vmcs only once per VCPU on handle_vmxon and re-use the 
 same instance for multiple VMCS12s
  2) More code quality improvements
 
 Changes since v3:
  1) Fixed VMXON emulation (new patch). 
 Previous nVMX code didn't verify if L1 is already in root mode (VMXON
 was previously called). Now we call nested_vmx_failValid if VMX is 
 already ON. This is requird to avoid host leaks (due to shadow vmcs 
 allocation) if L1 repetedly executes VMXON.
  2) Improved comment: clarified we do not shadow fields that are modified
 when L1 executes vmx instructions like the VM_INSTRUCTION_ERROR field.
 
 Changes since v4:
  1) Fixed free_nested: we now free the shadow vmcs also 
 when there is no current vmcs.
  
 Acknowledgments:
 
  Many thanks to
  Natapov, Gleb g...@redhat.com 
  Xu, Dongxiao dongxiao...@intel.com
  Nakajima, Jun jun.nakaj...@intel.com
  Har'El, Nadav na...@harel.org.il 
   
  for the insightful discussions, comments and reviews.
 
 
  These patches were easily created and maintained using
  Patchouli -- patch creator
  http://patchouli.sourceforge.net/
 
 
 [1] The Turtles Project: Design and Implementation of Nested Virtualization,
 http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
 

Hard to keep up :)

Reviewed-by: Orit Wasserman owass...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/11] KVM: nVMX: shadow VMCS support, v1

2013-03-21 Thread Orit Wasserman
On 03/10/2013 06:03 PM, Abel Gordon wrote:
 This series of patches implements shadow-vmcs capability for nested VMX.
 
 Shadow-vmcs - background and overview:
 
  In Intel VMX, vmread and vmwrite privileged instructions are used by the
  hypervisor to read and modify the guest and host specifications (VMCS). In a
  nested virtualization environment, L1 executes multiple vmread and vmwrite
  instruction to handle a single L2 exit. Each vmread and vmwrite executed by 
 L1
  traps (cause an exit) to the L0 hypervisor (KVM). L0 emulates the instruction
  behaviour and resumes L1 execution.
 
  Removing the need to trap and emulate these special instructions reduces the
  number of exits and improves nested virtualization performance. As it was 
 first
  evaluated in [1], exit-less vmread and vmwrite can reduce nested 
 virtualization
  overhead up-to 40%.
  
  Intel introduced a new feature to their processors called shadow-vmcs.  Using
  shadow-vmcs, L0 can configure the processor to let L1 running in guest-mode
  access VMCS12 fields using vmread and vmwrite instructions but without 
 causing
  an exit to L0. The VMCS12 fields' data is stored in a shadow-vmcs controlled
  by L0.
 
 Shadow-vmcs - design considerations: 
 
  A shadow-vmcs is processor-dependent and must be accessed by L0 or L1 using
  vmread and vmwrite instructions. With nested virtualization we aim to 
 abstract
  the hardware from the L1 hypervisor. Thus, to avoid hardware dependencies we
  prefered to keep the software defined VMCS12 format as part of L1 address 
 space
  and hold the processor-specific shadow-vmcs format only in L0 address space.
  In other words, the shadow-vmcs is used by L0 as an accelerator but the 
 format
  and content is never exposed to L1 directly. L0 syncs the content of the
  processor-specific shadow vmcs with the content of the software-controlled
  VMCS12 format.
 
  We could have been kept the processor-specific shadow-vmcs format in L1 
 address
  space to avoid using the software defined VMCS12 format, however, this type 
 of
  design/implementation would have been created hardware dependencies and
  would complicate other capabilities (e.g. Live Migration of L1).
  
 Acknowledgments:
 
  Many thanks to
  Xu, Dongxiao dongxiao...@intel.com
  Nakajima, Jun jun.nakaj...@intel.com
  Har'El, Nadav na...@harel.org.il 
   
  for the insightful discussions, comments and reviews.
 
 
  These patches were easily created and maintained using
  Patchouli -- patch creator
  http://patchouli.sourceforge.net/
 
 
 [1] The Turtles Project: Design and Implementation of Nested Virtualization,
 http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
Reviewed-by: Orit Wasserman owass...@redhat.com

By the way do you have some performance results, how does it improve nested ?

Orit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM

2012-11-28 Thread Orit Wasserman
On 11/28/2012 02:29 AM, Marcelo Tosatti wrote:
 On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote:
 The launch state is not a member in the VMCS area, use a separate
 variable (list) to store it instead.

 Signed-off-by: Dongxiao Xu dongxiao...@intel.com
 
 1. What is the problem with keeping launched state in the VMCS?
 Assuming there is a positive answer to the above:
 
 2. Don't you have to change VMCS ID?
 
 3. Can't it be kept somewhere else other than a list? Current scheme 
 allows guest to allocate unlimited amounts of host memory.
I agree with Marcelo you have to limit the number of VMCS in the list otherwise
it will be easy to attack a host with nested :)
 
 4. What is the state of migration / nested vmx again? If vmcs12 is
 migrated, this means launched state is not migrated anymore.
 
 Patches 1-3 seem fine.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] nested vmx: clean up for vmcs12 read and write

2012-11-21 Thread Orit Wasserman
On 11/21/2012 11:04 AM, Dongxiao Xu wrote:
 abstract vmcs12_read and vmcs12_write functions to do the vmcs12
 read/write operations.
 
 Signed-off-by: Dongxiao Xu dongxiao...@intel.com
 ---
  arch/x86/kvm/vmx.c |   86 
 +++-
  1 files changed, 45 insertions(+), 41 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index f858159..d8670e4 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -5407,32 +5407,67 @@ static inline int vmcs_field_readonly(unsigned long 
 field)
   * some of the bits we return here (e.g., on 32-bit guests, only 32 bits of
   * 64-bit fields are to be returned).
   */
 -static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
 - unsigned long field, u64 *ret)
 +static inline u64 vmcs12_read(struct kvm_vcpu *vcpu, unsigned long field)
  {
   short offset = vmcs_field_to_offset(field);
   char *p;
  
 - if (offset  0)
 + if (offset  0) {
 + nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
 + skip_emulated_instruction(vcpu);
   return 0;
 + }
  
   p = ((char *)(get_vmcs12(vcpu))) + offset;
  
   switch (vmcs_field_type(field)) {
   case VMCS_FIELD_TYPE_NATURAL_WIDTH:
 - *ret = *((natural_width *)p);
 + return *((natural_width *)p);
 + case VMCS_FIELD_TYPE_U16:
 + return *((u16 *)p);
 + case VMCS_FIELD_TYPE_U32:
 + return *((u32 *)p);
 + case VMCS_FIELD_TYPE_U64:
 + return *((u64 *)p);
 + default:
 + nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
 + skip_emulated_instruction(vcpu);
 + return 0; /* can never happen. */
 + }
 +}
 +
 +static inline int vmcs12_write(struct kvm_vcpu *vcpu,
 + unsigned long field,
 + u64 value)
 +{
 + short offset = vmcs_field_to_offset(field);
 + char *p;
 +
 + if (offset  0) {
 + nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
 + skip_emulated_instruction(vcpu);
 + return 0;
 + }
 +
 + p = ((char *)(get_vmcs12(vcpu))) + offset;
 +
 + switch (vmcs_field_type(field)) {
 + case VMCS_FIELD_TYPE_NATURAL_WIDTH:
 + *(natural_width *)p = value;
   return 1;
   case VMCS_FIELD_TYPE_U16:
 - *ret = *((u16 *)p);
 + *(u16 *)p = value;
   return 1;
   case VMCS_FIELD_TYPE_U32:
 - *ret = *((u32 *)p);
 + *(u32 *)p = value;
   return 1;
   case VMCS_FIELD_TYPE_U64:
 - *ret = *((u64 *)p);
 + *(u64 *)p = value;
   return 1;
   default:
 - return 0; /* can never happen. */
 + nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
 + skip_emulated_instruction(vcpu);
 + return 0;
   }
  }
  
 @@ -5466,11 +5501,7 @@ static int handle_vmread(struct kvm_vcpu *vcpu)
   /* Decode instruction info and find the field to read */
   field = kvm_register_read(vcpu, (((vmx_instruction_info)  28)  0xf));
   /* Read the field, zero-extended to a u64 field_value */
 - if (!vmcs12_read_any(vcpu, field, field_value)) {
 - nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
 - skip_emulated_instruction(vcpu);
 - return 1;
 - }
 + field_value = vmcs12_read(vcpu, field);
If you had an error here, you will still continue running the function,
you will override the failure (by calling nested_vmx_succeed) and skip emulated 
instruction twice.
You need to detect there was an error and return (like in vmwrite),

Orit
   /*
* Now copy part of this value to register or memory, as requested.
* Note that the number of bits actually copied is 32 or 64 depending
 @@ -5500,8 +5531,6 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
   gva_t gva;
   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
   u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
 - char *p;
 - short offset;
   /* The value to write might be 32 or 64 bits, depending on L1's long
* mode, and eventually we need to write that into a field of several
* possible lengths. The code below first zero-extends the value to 64
 @@ -5537,33 +5566,8 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu)
   skip_emulated_instruction(vcpu);
   return 1;
   }
 -
 - offset = vmcs_field_to_offset(field);
 - if (offset  0) {
 - nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
 - skip_emulated_instruction(vcpu);
 - return 1;
 - }
 - p = ((char *) get_vmcs12(vcpu)) + offset;
 -
 - switch (vmcs_field_type(field)) {
 - case VMCS_FIELD_TYPE_U16:
 - 

Re: [PATCH v3 00/35] postcopy live migration

2012-11-06 Thread Orit Wasserman
Hi,
I didn't have time yet to review in detail your patches,
but I have one general comment about the interface to activate postcopy.
As postcopy needs to be supported both by source and destination Qemu,
for those kind of features we have migration capabilities interface,
you can look at the XBZRLE patch series for more details.
So in order to activate postcopy the user will need to do:
migrate_set_capabilites postcopy on on source and destination Qemu before 
starting
the migration process.

Regards,
Orit 

On 10/30/2012 10:32 AM, Isaku Yamahata wrote:
 This is the v3 patch series of postcopy migration.
 
 The trees is available at
 git://github.com/yamahata/qemu.git qemu-postcopy-oct-30-2012
 git://github.com/yamahata/linux-umem.git linux-umem-oct-29-2012
 
 Major changes v2 - v3:
 - implemented pre+post optimization
 - auto detection of postcopy by incoming side
 - using threads on destination instead of fork
 - using blocking io instead of select + non-blocking io loop
 - less memory overhead
 - various improvement and code simplification
 - kernel module name change umem - uvmem to avoid name conflict.
 
 Patches organization:
 1-2: trivial fixes
 3-5: prepartion for threading. cherry-picked from migration tree
 6-18: refactoring existing code and preparation
 19-25: implement postcopy live migration itself (essential part)
 26-35: optimization/heuristic for postcopy
 
 Usage
 =
 You need load uvmem character device on the host before starting migration.
 Postcopy can be used for tcg and kvm accelarator. The implementation depend
 on only linux uvmem character device. But the driver dependent code is split
 into a file.
 I tested only host page size == guest page size case, but the implementation
 allows host page size != guest page size case.
 
 The following options are added with this patch series.
 - incoming part
   use -incoming as usual. Postcopy is automatically detected.
   example:
   qemu -incoming tcp:0: -monitor stdio -machine accel=kvm
 
 - outging part
   options for migrate command
   migrate [-p [-n] [-m]] URI 
   [precopy count [prefault forward [prefault backword]]]
 
   Newly added options/arguments
   -p: indicate postcopy migration
   -n: disable background transferring pages: This is for benchmark/debugging
   -m: move background transfer of postcopy mode
   precopy count: The number of precopy RAM scan before postcopy.
default 0 (0 means no precopy)
   prefault forward: The number of forward pages which is sent with on-demand
   prefault backward: The number of backward pages which is sent with
on-demand
 
   example:
   migrate -p -n tcp:dest ip address:
   migrate -p -n -m tcp:dest ip address: 42 42 0
 
 
 TODO
 
 - benchmark/evaluation
 - improve/optimization
   At the moment at least what I'm aware of is
   - pre+post case
 On desitnation side reading dirty bitmap would cause long latency.
 create thread for that.
 - consider on FUSE/CUSE possibility
 
 basic postcopy work flow
 
 qemu on the destination
   |
   V
 open(/dev/uvmem)
   |
   V
 UVMEM_INIT
   |
   V
 Here we have two file descriptors to
 umem device and shmem file
   |
   |  umem threads
   |  on the destination
   |
   Vcreate pipe to communicate
 crete threads,
   |  |
   V   mmap(shmem file)
 mmap(uvmem device) for guest RAM  close(shmem file)
   |  |
   |  |
   V  |
 wait for ready from daemon pipe-send ready message
   |  |
   | Here the daemon takes over
 send okpipe--- the owner of the socket
   |   to the source
   V  |
 entering post copy stage |
 start guest execution|
   |  |
   V  V
 access guest RAM  read() to get faulted pages
   |  |
   V  V
 page fault --page offset is returned
 block|
  V
 

Re: [Qemu-devel] Block Migration and xbzrle

2012-10-02 Thread Orit Wasserman
On 10/02/2012 10:33 AM, lieven-li...@dlh.net wrote:
 Orit Wasserman wrote:
 On 09/16/2012 01:39 PM, Peter Lieven wrote:
 Hi,

 I remember that this was broken some time ago and currently with
 qemu-kvm 1.2.0 I am still not able to use
 block migration plus xbzrle. The migration fails if both are used
 together. XBZRLE without block migration works.

 Can someone please advise what is the current expected behaviour?
 XBZRLE only work on guest memory so it shouldn't be effected by block
 migration.
 What is the error you are getting?
 What command line ?
 
 Meanwhile I can confirm that it happens with and without block migration.
 I I observe 2 errors:
 a)
  qemu: warning: error while loading state section id 2
  load of migration failed
 b)
  the vm does not enter running state after migration.
 
 The command-line:
 /usr/bin/qemu-kvm-1.2.0  -net
 tap,vlan=798,script=no,downscript=no,ifname=tap1  -net
 nic,vlan=798,model=e1000,macaddr=52:54:00:ff:01:15   -drive
 format=host_device,file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-d85f4e007-3f30017ce11505df-ubuntu-tools-hd0,if=virtio,cache=none,aio=native
  -m 4096 -smp 2,sockets=1,cores=2,threads=1  -monitor
 tcp:0:4002,server,nowait -vnc :2 -qmp tcp:0:3002,server,nowait  -name
 'Ubuntu-Tools'  -boot order=dc,menu=off  -k de  -incoming
 tcp:172.21.55.34:5002  -pidfile /var/run/qemu/vm-250.pid  -mem-path
 /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet
 -vga cirrus  -cpu host,+x2apic,model_id='Intel(R) Xeon(R) CPU 
Migration with -cpu host is very problemtic, because the source and destination 
can 
have different cpu resulting in different cpu features.
Does regular migration works with this setup?
Can you try with a different cpu type?
What are the source and destination /proc/cpuinfo output ?

Cheers,
Orit
 
 L5640  @ 2.27GHz',-tsc
 
 Thanks,
 Peter
 

 Regards,
 Orit

 Thanks,
 Peter





 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Block Migration and xbzrle

2012-10-02 Thread Orit Wasserman
On 10/02/2012 11:30 AM, Peter Lieven wrote:
 
 Am 02.10.2012 um 11:28 schrieb Orit Wasserman:
 
 On 10/02/2012 10:33 AM, lieven-li...@dlh.net wrote:
 Orit Wasserman wrote:
 On 09/16/2012 01:39 PM, Peter Lieven wrote:
 Hi,

 I remember that this was broken some time ago and currently with
 qemu-kvm 1.2.0 I am still not able to use
 block migration plus xbzrle. The migration fails if both are used
 together. XBZRLE without block migration works.

 Can someone please advise what is the current expected behaviour?
 XBZRLE only work on guest memory so it shouldn't be effected by block
 migration.
 What is the error you are getting?
 What command line ?

 Meanwhile I can confirm that it happens with and without block migration.
 I I observe 2 errors:
 a)
 qemu: warning: error while loading state section id 2
 load of migration failed
Did you enabled XBZRLE on the destination also?
(migrate_set_capability xbzrle on)

Orit
 b)
 the vm does not enter running state after migration.

 The command-line:
 /usr/bin/qemu-kvm-1.2.0  -net
 tap,vlan=798,script=no,downscript=no,ifname=tap1  -net
 nic,vlan=798,model=e1000,macaddr=52:54:00:ff:01:15   -drive
 format=host_device,file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-d85f4e007-3f30017ce11505df-ubuntu-tools-hd0,if=virtio,cache=none,aio=native
 -m 4096 -smp 2,sockets=1,cores=2,threads=1  -monitor
 tcp:0:4002,server,nowait -vnc :2 -qmp tcp:0:3002,server,nowait  -name
 'Ubuntu-Tools'  -boot order=dc,menu=off  -k de  -incoming
 tcp:172.21.55.34:5002  -pidfile /var/run/qemu/vm-250.pid  -mem-path
 /hugepages  -mem-prealloc  -rtc base=utc -usb -usbdevice tablet -no-hpet
 -vga cirrus  -cpu host,+x2apic,model_id='Intel(R) Xeon(R) CPU 
 Migration with -cpu host is very problemtic, because the source and 
 destination can 
 have different cpu resulting in different cpu features.
 Does regular migration works with this setup?
 Can you try with a different cpu type?
 What are the source and destination /proc/cpuinfo output ?

 
 The CPUs are identical, we also check if flags and cpu types match if cpu 
 type is set to host.
 Regular migration does work.



 
 BR,
 Peter
 

 Cheers,
 Orit

 L5640  @ 2.27GHz',-tsc

 Thanks,
 Peter


 Regards,
 Orit

 Thanks,
 Peter







 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Block Migration and xbzrle

2012-09-17 Thread Orit Wasserman
On 09/16/2012 01:39 PM, Peter Lieven wrote:
 Hi,
 
 I remember that this was broken some time ago and currently with qemu-kvm 
 1.2.0 I am still not able to use
 block migration plus xbzrle. The migration fails if both are used together. 
 XBZRLE without block migration works.
 
 Can someone please advise what is the current expected behaviour?
XBZRLE only work on guest memory so it shouldn't be effected by block migration.
What is the error you are getting?
What command line ?

Regards,
Orit
 
 Thanks,
 Peter
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 00/41] postcopy live migration

2012-06-07 Thread Orit Wasserman
On 06/04/2012 03:37 PM, Anthony Liguori wrote:
 On 06/04/2012 05:57 PM, Isaku Yamahata wrote:
 After the long time, we have v2. This is qemu part.
 The linux kernel part is sent separatedly.

 Changes v1 -  v2:
 - split up patches for review
 - buffered file refactored
 - many bug fixes
Espcially PV drivers can work with postcopy
 - optimization/heuristic

 Patches
 1 - 30: refactoring exsiting code and preparation
 31 - 37: implement postcopy itself (essential part)
 38 - 41: some optimization/heuristic for postcopy

 Intro
 =
 This patch series implements postcopy live migration.[1]
 As discussed at KVM forum 2011, dedicated character device is used for
 distributed shared memory between migration source and destination.
 Now we can discuss/benchmark/compare with precopy. I believe there are
 much rooms for improvement.

 [1] http://wiki.qemu.org/Features/PostCopyLiveMigration


 Usage
 =
 You need load umem character device on the host before starting migration.
 Postcopy can be used for tcg and kvm accelarator. The implementation depend
 on only linux umem character device. But the driver dependent code is split
 into a file.
 I tested only host page size == guest page size case, but the implementation
 allows host page size != guest page size case.

 The following options are added with this patch series.
 - incoming part
command line options
-postcopy [-postcopy-flagsflags]
where flags is for changing behavior for benchmark/debugging
Currently the following flags are available
0: default
1: enable touching page request

example:
qemu -postcopy -incoming tcp:0: -monitor stdio -machine accel=kvm

 - outging part
options for migrate command
migrate [-p [-n] [-m]] URI [prefault forward  [prefault backword]]
-p: indicate postcopy migration
-n: disable background transferring pages: This is for benchmark/debugging
-m: move background transfer of postcopy mode
prefault forward: The number of forward pages which is sent with 
 on-demand
prefault backward: The number of backward pages which is sent with
 on-demand

example:
migrate -p -n tcp:dest ip address:
migrate -p -n -m tcp:dest ip address: 32 0


 TODO
 
 - benchmark/evaluation. Especially how async page fault affects the result.
 
 I don't mean to beat on a dead horse, but I really don't understand the point 
 of postcopy migration other than the fact that it's possible.  It's a lot of 
 code and a new ABI in an area where we already have too much difficulty 
 maintaining our ABI.
 
 Without a compelling real world case with supporting benchmarks for why we 
 need postcopy and cannot improve precopy, I'm against merging this.
Hi Anthony,

The example is quite simple lets look at a 300G guest that is dirtying 10 
percent of it memory every second. (for example SAP ...)
Even if we have a 30G/s network we will need 1 second of downtime of this guest 
, many workload time out in this kind of downtime.
The guests are getting bigger and bigger so for those big guest the only way to 
do live migration is using post copy.
I agree we are losing reliability with post copy but we can try to limit the 
risk :
- do a full copy of the guest ram (precopy) and than switch to post copy only 
for the updates
- the user will use a private LAN ,maybe with redundancy which is much safer
- maybe backup the memory to storage so in case of network failure we can 
recover

In the end it is up to the user , he can decide what he is willing to risk.
The default of course should always be precopy live migration, maybe we should 
even have a different command for post copy.
In the end I can see some users that will have no choice but use post copy live 
migration or stop the guest in order to move them to 
another host.

Regards,
Orit
 
 Regards,
 
 Anthony Liguori
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix migration of real mode guests from host with unrestricted guest to a host without it.

2012-05-31 Thread Orit Wasserman
For example migration between Westmere and Nehelem hosts.
The patch fixes the guest segments similar to enter_rmode function.

Signed-off-by: Orit Wasserman owass...@rehdat.com
---
 arch/x86/kvm/vmx.c |   38 ++
 1 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 32eb588..2eca18f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3229,6 +3229,44 @@ static void vmx_set_segment(struct kvm_vcpu *vcpu,
 
vmcs_write32(sf-ar_bytes, ar);
__clear_bit(VCPU_EXREG_CPL, (ulong *)vcpu-arch.regs_avail);
+
+   /*
+* Fix segments for real mode guest in hosts that don't have
+* unrestricted_mode or it was disabled.
+* This is done to allow migration of the guests from hosts with
+* unrestricted guest like Westmere to older host that don't have
+* unrestricted guest like Nehelem.
+*/
+   if (!enable_unrestricted_guest  vmx-rmode.vm86_active) {
+   switch (seg) {
+   case VCPU_SREG_CS:
+   vmcs_write32(GUEST_CS_AR_BYTES, 0xf3);
+   vmcs_write32(GUEST_CS_LIMIT, 0x);
+   if (vmcs_readl(GUEST_CS_BASE) == 0x)
+   vmcs_writel(GUEST_CS_BASE, 0xf);
+   vmcs_write16(GUEST_CS_SELECTOR,
+vmcs_readl(GUEST_CS_BASE)  4);
+   break;
+   case VCPU_SREG_ES:
+   fix_rmode_seg(VCPU_SREG_ES, vmx-rmode.es);
+   break;
+   case VCPU_SREG_DS:
+   fix_rmode_seg(VCPU_SREG_DS, vmx-rmode.ds);
+   break;
+   case VCPU_SREG_GS:
+   fix_rmode_seg(VCPU_SREG_GS, vmx-rmode.gs);
+   break;
+   case VCPU_SREG_FS:
+   fix_rmode_seg(VCPU_SREG_FS, vmx-rmode.fs);
+   break;
+   case VCPU_SREG_SS:
+   vmcs_write16(GUEST_SS_SELECTOR,
+vmcs_readl(GUEST_SS_BASE)  4);
+   vmcs_write32(GUEST_SS_LIMIT, 0x);
+   vmcs_write32(GUEST_SS_AR_BYTES, 0xf3);
+   break;
+   }
+   }
 }
 
 static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Fix migration of real mode guests from host with unrestricted guest to a host without it.

2012-05-31 Thread Orit Wasserman
For example migration between Westmere and Nehelem hosts.

The code that fixes the segments for real mode guest was moved from enter_rmode
to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment.

Signed-off-by: Orit Wasserman owass...@rehdat.com
---
 arch/x86/kvm/vmx.c |   70 +++-
 1 files changed, 58 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 32eb588..548c9b5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -615,6 +615,10 @@ static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
 static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+static void vmx_set_segment(struct kvm_vcpu *vcpu,
+   struct kvm_segment *var, int seg);
+static void vmx_get_segment(struct kvm_vcpu *vcpu,
+   struct kvm_segment *var, int seg);
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -2770,6 +2774,7 @@ static void enter_rmode(struct kvm_vcpu *vcpu)
 {
unsigned long flags;
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct kvm_segment var;
 
if (enable_unrestricted_guest)
return;
@@ -2813,20 +2818,23 @@ static void enter_rmode(struct kvm_vcpu *vcpu)
if (emulate_invalid_guest_state)
goto continue_rmode;
 
-   vmcs_write16(GUEST_SS_SELECTOR, vmcs_readl(GUEST_SS_BASE)  4);
-   vmcs_write32(GUEST_SS_LIMIT, 0x);
-   vmcs_write32(GUEST_SS_AR_BYTES, 0xf3);
+   vmx_get_segment(vcpu, var, VCPU_SREG_SS);
+   vmx_set_segment(vcpu, var, VCPU_SREG_SS);
+
+   vmx_get_segment(vcpu, var, VCPU_SREG_CS);
+   vmx_set_segment(vcpu, var, VCPU_SREG_CS);
+
+   vmx_get_segment(vcpu, var, VCPU_SREG_ES);
+   vmx_set_segment(vcpu, var, VCPU_SREG_ES);
 
-   vmcs_write32(GUEST_CS_AR_BYTES, 0xf3);
-   vmcs_write32(GUEST_CS_LIMIT, 0x);
-   if (vmcs_readl(GUEST_CS_BASE) == 0x)
-   vmcs_writel(GUEST_CS_BASE, 0xf);
-   vmcs_write16(GUEST_CS_SELECTOR, vmcs_readl(GUEST_CS_BASE)  4);
+   vmx_get_segment(vcpu, var, VCPU_SREG_DS);
+   vmx_set_segment(vcpu, var, VCPU_SREG_DS);
 
-   fix_rmode_seg(VCPU_SREG_ES, vmx-rmode.es);
-   fix_rmode_seg(VCPU_SREG_DS, vmx-rmode.ds);
-   fix_rmode_seg(VCPU_SREG_GS, vmx-rmode.gs);
-   fix_rmode_seg(VCPU_SREG_FS, vmx-rmode.fs);
+   vmx_get_segment(vcpu, var, VCPU_SREG_GS);
+   vmx_set_segment(vcpu, var, VCPU_SREG_GS);
+
+   vmx_get_segment(vcpu, var, VCPU_SREG_FS);
+   vmx_set_segment(vcpu, var, VCPU_SREG_FS);
 
 continue_rmode:
kvm_mmu_reset_context(vcpu);
@@ -3229,6 +3237,44 @@ static void vmx_set_segment(struct kvm_vcpu *vcpu,
 
vmcs_write32(sf-ar_bytes, ar);
__clear_bit(VCPU_EXREG_CPL, (ulong *)vcpu-arch.regs_avail);
+
+   /*
+* Fix segments for real mode guest in hosts that don't have
+* unrestricted_mode or it was disabled.
+* This is done to allow migration of the guests from hosts with
+* unrestricted guest like Westmere to older host that don't have
+* unrestricted guest like Nehelem.
+*/
+   if (!enable_unrestricted_guest  vmx-rmode.vm86_active) {
+   switch (seg) {
+   case VCPU_SREG_CS:
+   vmcs_write32(GUEST_CS_AR_BYTES, 0xf3);
+   vmcs_write32(GUEST_CS_LIMIT, 0x);
+   if (vmcs_readl(GUEST_CS_BASE) == 0x)
+   vmcs_writel(GUEST_CS_BASE, 0xf);
+   vmcs_write16(GUEST_CS_SELECTOR,
+vmcs_readl(GUEST_CS_BASE)  4);
+   break;
+   case VCPU_SREG_ES:
+   fix_rmode_seg(VCPU_SREG_ES, vmx-rmode.es);
+   break;
+   case VCPU_SREG_DS:
+   fix_rmode_seg(VCPU_SREG_DS, vmx-rmode.ds);
+   break;
+   case VCPU_SREG_GS:
+   fix_rmode_seg(VCPU_SREG_GS, vmx-rmode.gs);
+   break;
+   case VCPU_SREG_FS:
+   fix_rmode_seg(VCPU_SREG_FS, vmx-rmode.fs);
+   break;
+   case VCPU_SREG_SS:
+   vmcs_write16(GUEST_SS_SELECTOR,
+vmcs_readl(GUEST_SS_BASE)  4);
+   vmcs_write32(GUEST_SS_LIMIT, 0x);
+   vmcs_write32(GUEST_SS_AR_BYTES, 0xf3);
+   break;
+   }
+   }
 }
 
 static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Orit Wasserman
On 04/11/2012 03:44 PM, Guido Winkelmann wrote:
 Hi,
 
 Nested virtualization on Intel does not work for me with qemu-kvm. As soon as 
 the third layer OS (second virtualised) is starting the Linux kernel, the 
 entire second layer freezes up. The last thing I can see console of the third 
 layer system before it freezes is Decompressing Linux... . (no done, 
 though). When starting without nofb option, the kernel still manages to set 
 the screen resolution before freezing.
 
 Grub/Syslinux still work, but are extremely slow.
 
 Both the first layer OS (i.e. the one running on bare metal) and the second 
 layer OS are 64-bit-Fedora 16 with Kernel 3.3.1-3.fc16.x86_64. On both the 
 first and second layer OS, the kvm_intel modules are loaded with nested=Y 
 parameter. (I've also tried with nested=N in the second layer. Didn't change 
 anything.)
 Qemu-kvm was originally the Fedora-shipped 0.14, but I have since upgraded to 
 1.0. (Using rpmbuild with the specfile and patches from 
 http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=blob;f=qemu.spec;hb=HEAD)
 
 The second layer machine has this CPU specification in libvirt on the first 
 layer OS:
 
   cpu mode='custom' match='exact'
 model fallback='allow'Nehalem/model
 feature policy='require' name='vmx'/
   /cpu
 
 which results in this qemu commandline (from libvirt's logs):
 
 LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-
 kvm -S -M pc-0.15 -cpu kvm64,+lahf_lm,+popcnt,+sse4.2,+sse4.1,+ssse3,+vmx -
 enable-kvm -m 8192 -smp 8,sockets=8,cores=1,threads=1 -name vshost1 -uuid 
 192b8c4b-0ded-07aa-2545-d7fef4cd897f -nodefconfig -nodefaults -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vshost1.monitor,server,nowait
  
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -
 no-acpi -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
 file=/data/vshost1.img,if=none,id=drive-virtio-disk0,format=qcow2 -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-
 disk0,bootindex=1 -drive file=/data/Fedora-16-x86_64-
 netinst.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -
 device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
 tap,fd=21,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-
 pci,netdev=hostnet0,id=net0,mac=52:54:00:84:7d:46,bus=pci.0,addr=0x3 -netdev 
 tap,fd=23,id=hostnet1,vhost=on,vhostfd=24 -device virtio-net-
 pci,netdev=hostnet1,id=net1,mac=52:54:00:84:8d:46,bus=pci.0,addr=0x4 -vnc 
 127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon-
 pci,id=balloon0,bus=pci.0,addr=0x6
 
 I have also tried some other combinations for the cpu element, like changing 
 the model to core2duo and/or including all the features reported by libvirt's 
 capabalities command.
 
 The third level machine does not have a cpu element in libvirt, and its 
 commandline looks like this:
 
 LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-
 kvm -S -M pc-0.14 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 
 -name 
 gentoo -uuid 3cdcc902-4520-df25-92ac-31ca5c707a50 -nodefconfig -nodefaults -
 chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/gentoo.monitor,server,nowait 
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-acpi 
 -drive 
 file=/data/gentoo.img,if=none,id=drive-virtio-disk0,format=qcow2 -device 
 virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -
 drive file=/data/install-amd64-
 minimal-20120223.iso,if=none,media=cdrom,id=drive-
 ide0-1-0,readonly=on,format=raw -device ide-
 drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -netdev 
 tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net-
 pci,netdev=hostnet0,id=net0,mac=52:54:00:84:6d:46,bus=pci.0,addr=0x3 -usb 
 -vnc 
 127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon-
 pci,id=balloon0,bus=pci.0,addr=0x5
 
 The third layer OS is a recent Gentoo minimal install (amd64), but somehow I 
 don't think that matters at this point...
 
 The metal is a Dell PowerEdge R710 server with two Xeon E5520 CPUs. I've 
 tried 
 updating the machine's BIOS and other firmware to the latest version. That 
 took a lot of time and a lot of searching on Dell websites, but didn't change 
 anything.
 
 Does anyone have any idea what might be going wrong here or how I could debug 
 this further?

I'm not sure if this is the problem but I noticed that the second layer and the 
third layer have 
the same memory size (8G), how about trying to reduce the memory for the third 
layer ?

Orit

 
   Guido
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Orit Wasserman
On 04/11/2012 04:43 PM, Guido Winkelmann wrote:
 Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
 I'm not sure if this is the problem but I noticed that the second layer and
 the third layer have the same memory size (8G), how about trying to reduce
 the memory for the third layer ?
 
 I tried reducing the third layer to 1G. That didn't change anything.

There is a patch for fixing nVMX in 3.4. 
http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html
you can try it .

Orit
 
   Guido
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Orit Wasserman
On 04/11/2012 09:37 PM, Guido Winkelmann wrote:
 Am Mittwoch, 11. April 2012, 17:25:12 schrieb Orit Wasserman:
 On 04/11/2012 04:43 PM, Guido Winkelmann wrote:
 Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
 I'm not sure if this is the problem but I noticed that the second layer
 and
 the third layer have the same memory size (8G), how about trying to
 reduce
 the memory for the third layer ?

 I tried reducing the third layer to 1G. That didn't change anything.

 There is a patch for fixing nVMX in 3.4.
 http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html
 you can try it .
 
 That worked, though the VM inside the VM is still very slow...

you can try Nadav's nEPT (Nested EPT) patches they should help with the 
performance.

Orit

 
   Guido

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Orit Wasserman
On 04/11/2012 08:00 PM, Guido Winkelmann wrote:
 Am Mittwoch, 11. April 2012, 17:25:12 schrieben Sie:
 On 04/11/2012 04:43 PM, Guido Winkelmann wrote:
 Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
 I'm not sure if this is the problem but I noticed that the second layer
 and
 the third layer have the same memory size (8G), how about trying to
 reduce
 the memory for the third layer ?

 I tried reducing the third layer to 1G. That didn't change anything.

 There is a patch for fixing nVMX in 3.4.
 http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html
 you can try it .
 
 Should I use that patch on the host, or on the first virtualized layer, or on 
 both? (Compiling 3.4-rc2 for both for now... )
On the host (we refer to it as L0 by the way).

Orit

 
   Guido



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 1/4] sockets: introduce set_socket_error()

2012-03-27 Thread Orit Wasserman
On 03/22/2012 05:52 AM, Amos Kong wrote:
 Introduce set_socket_error() to set the errno, use
 WSASetLastError() for win32.
 Sometimes, clean work would rewrite errno in error path,
 we can use this function to restore real errno.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  qemu_socket.h |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)
 
 diff --git a/qemu_socket.h b/qemu_socket.h
 index fe4cf6c..a4c5170 100644
 --- a/qemu_socket.h
 +++ b/qemu_socket.h
 @@ -8,6 +8,7 @@
  #include ws2tcpip.h
  
  #define socket_error() WSAGetLastError()
 +#define set_socket_error(e) WSASetLastError(e)
  #undef EINTR
  #define EWOULDBLOCK WSAEWOULDBLOCK
  #define EINTR   WSAEINTR
 @@ -26,6 +27,7 @@ int inet_aton(const char *cp, struct in_addr *ia);
  #include sys/un.h
  
  #define socket_error() errno
 +#define set_socket_error(e) errno = e

how about creating a function set_errno(int e) and use it in the macro.

Orit
  #define closesocket(s) close(s)
  
  #endif /* !_WIN32 */
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 2/4] qemu-socket: change inet_connect() to to support nonblock socket

2012-03-27 Thread Orit Wasserman
On 03/22/2012 05:52 AM, Amos Kong wrote:
 Change inet_connect(const char *str, int socktype) to
 inet_connect(const char *str, bool block),
 socktype is unused, block is used to assign if set socket
 to block/nonblock.
 
 Retry to connect when -EINTR/-EWOULDBLOCK is got.
 Connect's successful for nonblock socket when following
 errors are got:
   -EINPROGRESS
   -WSAEALREADY/-WSAEINVAL (win32)
 
 Add a bool entry(block) for dummy_opts to tag block type.
 
 Use set_socket_error() to set real errno, set errno to
 EINVAL for parse error.
 
 Change nbd, vnc to use new interface.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  nbd.c  |2 +-
  qemu-sockets.c |   66 
 +++-
  qemu_socket.h  |2 +-
  ui/vnc.c   |2 +-
  4 files changed, 54 insertions(+), 18 deletions(-)
 
 diff --git a/nbd.c b/nbd.c
 index 567e94e..3618344 100644
 --- a/nbd.c
 +++ b/nbd.c
 @@ -146,7 +146,7 @@ int tcp_socket_outgoing(const char *address, uint16_t 
 port)
  
  int tcp_socket_outgoing_spec(const char *address_and_port)
  {
 -return inet_connect(address_and_port, SOCK_STREAM);
 +return inet_connect(address_and_port, true);
  }
  
  int tcp_socket_incoming(const char *address, uint16_t port)
 diff --git a/qemu-sockets.c b/qemu-sockets.c
 index 6bcb8e3..908479e 100644
 --- a/qemu-sockets.c
 +++ b/qemu-sockets.c
 @@ -51,6 +51,9 @@ static QemuOptsList dummy_opts = {
  },{
  .name = ipv6,
  .type = QEMU_OPT_BOOL,
 +},{
 +.name = block,
 +.type = QEMU_OPT_BOOL,
  },
  { /* end if list */ }
  },
 @@ -201,7 +204,8 @@ int inet_connect_opts(QemuOpts *opts)
  const char *port;
  char uaddr[INET6_ADDRSTRLEN+1];
  char uport[33];
 -int sock,rc;
 +int sock, rc, err;
 +bool block;
  
  memset(ai,0, sizeof(ai));
  ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
 @@ -210,9 +214,11 @@ int inet_connect_opts(QemuOpts *opts)
  
  addr = qemu_opt_get(opts, host);
  port = qemu_opt_get(opts, port);
 +block = qemu_opt_get_bool(opts, block, 0);
  if (addr == NULL || port == NULL) {
  fprintf(stderr, inet_connect: host and/or port not specified\n);
 -return -1;
 +err = -EINVAL;
 +goto err;

I would call set_socket_error with -EINVAL and than return -1;

  }
  
  if (qemu_opt_get_bool(opts, ipv4, 0))
 @@ -224,7 +230,8 @@ int inet_connect_opts(QemuOpts *opts)
  if (0 != (rc = getaddrinfo(addr, port, ai, res))) {
  fprintf(stderr,getaddrinfo(%s,%s): %s\n, addr, port,
  gai_strerror(rc));
 - return -1;
 +err = -EINVAL;
 +goto err;

I would call set_socket_error with EINVAL and than return -1;

  }
  
  for (e = res; e != NULL; e = e-ai_next) {
 @@ -241,21 +248,44 @@ int inet_connect_opts(QemuOpts *opts)
  continue;
  }
  setsockopt(sock,SOL_SOCKET,SO_REUSEADDR,(void*)on,sizeof(on));
 -
 +if (!block) {
 +socket_set_nonblock(sock);
 +}
  /* connect to peer */
 -if (connect(sock,e-ai_addr,e-ai_addrlen)  0) {
 -if (NULL == e-ai_next)
 -fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, 
 __FUNCTION__,
 -inet_strfamily(e-ai_family),
 -e-ai_canonname, uaddr, uport, strerror(errno));
 -closesocket(sock);
 -continue;
 +do {
 +err = 0;
 +if (connect(sock, e-ai_addr, e-ai_addrlen)  0) {
 +err = -socket_error();
 +}
 +} while (err == -EINTR || err == -EWOULDBLOCK);
 +

this is only relevant for non blocking socket
it should look something like this :

while ( (rc = connect(sock, e-ai_addr, e-ai_addrlen))  0  !block 
(socket_error() == EINTR || socket_error() == -EWOULDBLOCK))

 +if (err = 0) {
 +goto success;
 +} else if (!block  err == -EINPROGRESS) {
 +goto success;
 +#ifdef _WIN32
 +} else if (!block  (err == -WSAEALREADY || err == -WSAEINVAL)) {
 +goto success;
 +#endif

I prefer to check for error the code should look:

if (rc  0   (block ||
#ifdef _WIN32
   (socket_error() != WSAEALREADY  socket_error() != -WSAEINVAL)) {
#else
(socket_error() != EINPROGRESS)) {
#endif
   if (NULL == e-ai_next)  
   fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, __func__,
   inet_strfamily(e-ai_family),
  e-ai_canonname, uaddr, uport, strerror(errno));
   closesocket(sock);
   continue;
} 
...

Orit
  }
 -freeaddrinfo(res);
 -return sock;
 +
 +if (NULL == e-ai_next) {
 +fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, __func__,
 +inet_strfamily(e-ai_family),
 +e-ai_canonname, uaddr, uport, 

Re: [Qemu-devel] [PATCH v4 1/2] qemu-socket: change inet_connect() to to support nonblock socket

2012-03-20 Thread Orit Wasserman
On 03/19/2012 04:11 PM, Amos Kong wrote:
 Change inet_connect(const char *str, int socktype) to
 inet_connect(const char *str, bool block, int *sock_err),
 socktype is unused, block is used to assign if set socket
 to block/nonblock, sock_err is used to restore socket error.

It is more common to call the parameter blocking/nonblocking.

You removed socktype (I noticed it was not implemented) , can you keep the same 
API but fix the implementation ?
I don't like the use of sock_err.
How about adding a function set_socket_error that will set the errno and for 
win32 you can use SetLastError,
you will be able to read the error using socket_error function.

Cheers,
Orit
 
 Connect's successful for nonblock socket when following errors are returned:
   -EINPROGRESS or -EWOULDBLOCK
   -WSAEALREADY or -WSAEINVAL (win32)
 
 Also change the wrap function inet_connect_opts(QemuOpts *opts)
 to inet_connect_opts(QemuOpts *opts, int *sock_err).
 
 Add a bool entry(block) for dummy_opts to tag block type.
 Change nbd, vnc to use new interface.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  nbd.c  |2 +-
  qemu-char.c|2 +-
  qemu-sockets.c |   73 
 
  qemu_socket.h  |4 ++-
  ui/vnc.c   |2 +-
  5 files changed, 62 insertions(+), 21 deletions(-)
 
 diff --git a/nbd.c b/nbd.c
 index 567e94e..ad4de06 100644
 --- a/nbd.c
 +++ b/nbd.c
 @@ -146,7 +146,7 @@ int tcp_socket_outgoing(const char *address, uint16_t 
 port)
  
  int tcp_socket_outgoing_spec(const char *address_and_port)
  {
 -return inet_connect(address_and_port, SOCK_STREAM);
 +return inet_connect(address_and_port, true, NULL);
  }
  
  int tcp_socket_incoming(const char *address, uint16_t port)
 diff --git a/qemu-char.c b/qemu-char.c
 index bb9e3f5..d3543ea 100644
 --- a/qemu-char.c
 +++ b/qemu-char.c
 @@ -2443,7 +2443,7 @@ static CharDriverState *qemu_chr_open_socket(QemuOpts 
 *opts)
  if (is_listen) {
  fd = inet_listen_opts(opts, 0);
  } else {
 -fd = inet_connect_opts(opts);
 +fd = inet_connect_opts(opts, NULL);
  }
  }
  if (fd  0) {
 diff --git a/qemu-sockets.c b/qemu-sockets.c
 index 6bcb8e3..8ed45f8 100644
 --- a/qemu-sockets.c
 +++ b/qemu-sockets.c
 @@ -51,6 +51,9 @@ static QemuOptsList dummy_opts = {
  },{
  .name = ipv6,
  .type = QEMU_OPT_BOOL,
 +},{
 +.name = block,
 +.type = QEMU_OPT_BOOL,
  },
  { /* end if list */ }
  },
 @@ -194,14 +197,15 @@ listen:
  return slisten;
  }
  
 -int inet_connect_opts(QemuOpts *opts)
 +int inet_connect_opts(QemuOpts *opts, int *sock_err)
  {
  struct addrinfo ai,*res,*e;
  const char *addr;
  const char *port;
  char uaddr[INET6_ADDRSTRLEN+1];
  char uport[33];
 -int sock,rc;
 +int sock, rc, err;
 +bool block;
  
  memset(ai,0, sizeof(ai));
  ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
 @@ -210,9 +214,11 @@ int inet_connect_opts(QemuOpts *opts)
  
  addr = qemu_opt_get(opts, host);
  port = qemu_opt_get(opts, port);
 +block = qemu_opt_get_bool(opts, block, 0);
  if (addr == NULL || port == NULL) {
  fprintf(stderr, inet_connect: host and/or port not specified\n);
 -return -1;
 +err = -EINVAL;
 +goto err;
  }
  
  if (qemu_opt_get_bool(opts, ipv4, 0))
 @@ -224,7 +230,8 @@ int inet_connect_opts(QemuOpts *opts)
  if (0 != (rc = getaddrinfo(addr, port, ai, res))) {
  fprintf(stderr,getaddrinfo(%s,%s): %s\n, addr, port,
  gai_strerror(rc));
 - return -1;
 +err = -EINVAL;
 +goto err;
  }
  
  for (e = res; e != NULL; e = e-ai_next) {
 @@ -241,21 +248,52 @@ int inet_connect_opts(QemuOpts *opts)
  continue;
  }
  setsockopt(sock,SOL_SOCKET,SO_REUSEADDR,(void*)on,sizeof(on));
 -
 +if (!block) {
 +socket_set_nonblock(sock);
 +}
  /* connect to peer */
 -if (connect(sock,e-ai_addr,e-ai_addrlen)  0) {
 -if (NULL == e-ai_next)
 -fprintf(stderr, %s: connect(%s,%s,%s,%s): %s\n, 
 __FUNCTION__,
 -inet_strfamily(e-ai_family),
 -e-ai_canonname, uaddr, uport, strerror(errno));
 -closesocket(sock);
 -continue;
 +do {
 +err = 0;
 +if (connect(sock, e-ai_addr, e-ai_addrlen)  0) {
 +err = -socket_error();
 +if (block) {
 +break;
 +}
 +}
 +} while (err == -EINTR);
 +
 +if (err = 0) {
 +goto success;
 +} else if (!block  (err == -EINPROGRESS || err == -EWOULDBLOCK)) {
 +goto success;
 +#ifdef _WIN32
 +} else if (!block  (sock_err == -WSAEALREADY ||
 +  sock_err == -WSAEINVAL)) {
 

Re: [Qemu-devel] [PATCH v3 1/9] net: introduce tcp_server_start()

2012-03-14 Thread Orit Wasserman
On 03/07/2012 12:47 AM, Amos Kong wrote:
 Introduce tcp_server_start() by moving original code in
 tcp_start_incoming_migration().
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  net.c |   28 
  qemu_socket.h |2 ++
  2 files changed, 30 insertions(+), 0 deletions(-)
 
 diff --git a/net.c b/net.c
 index c34474f..e90ff23 100644
 --- a/net.c
 +++ b/net.c
 @@ -99,6 +99,34 @@ static int get_str_sep(char *buf, int buf_size, const char 
 **pp, int sep)
  return 0;
  }
  
 +int tcp_server_start(const char *str, int *fd)
 +{
 +int val, ret;
 +struct sockaddr_in saddr;
 +
 +if (parse_host_port(saddr, str)  0) {
 +error_report(invalid host/port combination: %s, str);
 +return -EINVAL;
 +}
 +
 +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
 +if (fd  0) {
 +perror(socket);

 +return -1;

return -socket_error();

 +}
 +socket_set_nonblock(*fd);

In incoming migration we don't use non-blocking socket.
How about a flag to the function for non-blocking ?

 +
 +/* allow fast reuse */
 +val = 1;
 +setsockopt(*fd, SOL_SOCKET, SO_REUSEADDR, (const char *)val, 
 sizeof(val));
 +
 +ret = bind(*fd, (struct sockaddr *)saddr, sizeof(saddr));
 +if (ret  0) {
 +closesocket(*fd);
 +}
 +return ret;

if (bind(*fd, (struct sockaddr *)saddr, sizeof(saddr))  0)
{   
closesocket(*fd);
return -socket_error();
}
return 0;

and than you will not need ret

 +}
 +
  int parse_host_port(struct sockaddr_in *saddr, const char *str)
  {
  char buf[512];
 diff --git a/qemu_socket.h b/qemu_socket.h
 index fe4cf6c..d612793 100644
 --- a/qemu_socket.h
 +++ b/qemu_socket.h
 @@ -54,6 +54,8 @@ int unix_listen(const char *path, char *ostr, int olen);
  int unix_connect_opts(QemuOpts *opts);
  int unix_connect(const char *path);
  
 +int tcp_server_start(const char *str, int *fd);
 +
  /* Old, ipv4 only bits.  Don't use for new code. */
  int parse_host_port(struct sockaddr_in *saddr, const char *str);
  int socket_init(void);
 
 

Orit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 3/9] net: introduce tcp_client_start()

2012-03-14 Thread Orit Wasserman
On 03/07/2012 12:48 AM, Amos Kong wrote:
 Introduce tcp_client_start() by moving original code in
 tcp_start_outgoing_migration().
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  net.c |   41 +
  qemu_socket.h |1 +
  2 files changed, 42 insertions(+), 0 deletions(-)
 
 diff --git a/net.c b/net.c
 index e90ff23..9afb0d1 100644
 --- a/net.c
 +++ b/net.c
 @@ -127,6 +127,47 @@ int tcp_server_start(const char *str, int *fd)
  return ret;
  }
  
 +int tcp_client_start(const char *str, int *fd)
 +{
 +struct sockaddr_in saddr;
 +int ret;
 +
 +*fd = -1;
 +if (parse_host_port(saddr, str)  0) {
 +error_report(invalid host/port combination: %s, str);
 +return -EINVAL;
 +}
 +
 +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
 +if (fd  0) {
 +perror(socket);
 +return -1;

return -socket_error();

 +}
 +socket_set_nonblock(*fd);
 +
 +for (;;) {
 +ret = connect(*fd, (struct sockaddr *)saddr, sizeof(saddr));
 +if (ret  0) {
 +ret = -socket_error();
 +if (ret == -EINPROGRESS) {
 +break;
 +#ifdef _WIN32
 +} else if (ret == -WSAEALREADY || ret == -WSAEINVAL) {
 +break;
 +#endif
 +} else if (ret != -EINTR  ret != -EWOULDBLOCK) {
 +perror(connect);
 +closesocket(*fd);
 +return ret;
 +}
 +} else {
 +break;
 +}
 +}
 +
 +return ret;
 +}
 +
  int parse_host_port(struct sockaddr_in *saddr, const char *str)
  {
  char buf[512];
 diff --git a/qemu_socket.h b/qemu_socket.h
 index d612793..9246578 100644
 --- a/qemu_socket.h
 +++ b/qemu_socket.h
 @@ -55,6 +55,7 @@ int unix_connect_opts(QemuOpts *opts);
  int unix_connect(const char *path);
  
  int tcp_server_start(const char *str, int *fd);
 +int tcp_client_start(const char *str, int *fd);
  
  /* Old, ipv4 only bits.  Don't use for new code. */
  int parse_host_port(struct sockaddr_in *saddr, const char *str);
 
 

Orit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 1/9] net: introduce tcp_server_start()

2012-03-14 Thread Orit Wasserman
On 03/14/2012 09:51 AM, Amos Kong wrote:
 On 14/03/12 15:27, Paolo Bonzini wrote:

 
 Hi Paolo,
 
 Il 14/03/2012 08:14, Orit Wasserman ha scritto:
 if (bind(*fd, (struct sockaddr *)saddr, sizeof(saddr))  0)
 {   
  closesocket(*fd);
  return -socket_error();
 }
 return 0;

 and than you will not need ret

 But closesocket could clobber socket_error(), no?

Completely correct.

 
 Yes, it will effect socket_error()
 
 How about this fix ?
 
 ret = bind(*fd, (struct sockaddr *)saddr, sizeof(saddr));
 if (ret  0) {
 ret = -socket_error();
 closesocket(*fd);
 }
 return ret;
 }
 

Looks good.

Orit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 1/9] net: introduce tcp_server_start()

2012-03-05 Thread Orit Wasserman
On 03/05/2012 12:03 PM, Amos Kong wrote:
 Introduce tcp_server_start() by moving original code in
 tcp_start_incoming_migration().
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  net.c |   27 +++
  qemu_socket.h |2 ++
  2 files changed, 29 insertions(+), 0 deletions(-)
 
 diff --git a/net.c b/net.c
 index c34474f..0260968 100644
 --- a/net.c
 +++ b/net.c
 @@ -99,6 +99,33 @@ static int get_str_sep(char *buf, int buf_size, const char 
 **pp, int sep)
  return 0;
  }
  
 +int tcp_server_start(const char *str, int *fd)
 +{
 +int val, ret;
 +struct sockaddr_in saddr;
 +
 +if (parse_host_port(saddr, str)  0) {

error message would be nice 

 +return -1;
 +}
 +
 +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
 +if (fd  0) {
 +perror(socket);
 +return -1;
 +}

this is actually net_socket_listen_init version 
tcp_start_incoming_migration returns the error -socket_error().
I prefer not to lose the errno.

I know that when calling net_socket_listen_init for some unknown reason there 
is an explict check for -1
 if (net_socket_listen_init(vlan, socket, name, listen) == -1) {
I think it is a good opportunity to change this check.

Orit
 +socket_set_nonblock(*fd);
 +
 +/* allow fast reuse */
 +val = 1;
 +setsockopt(*fd, SOL_SOCKET, SO_REUSEADDR, (const char *)val, 
 sizeof(val));
 +
 +ret = bind(*fd, (struct sockaddr *)saddr, sizeof(saddr));
 +if (ret  0) {
 +closesocket(*fd);
 +}
 +return ret;
 +}
 +
  int parse_host_port(struct sockaddr_in *saddr, const char *str)
  {
  char buf[512];
 diff --git a/qemu_socket.h b/qemu_socket.h
 index fe4cf6c..d612793 100644
 --- a/qemu_socket.h
 +++ b/qemu_socket.h
 @@ -54,6 +54,8 @@ int unix_listen(const char *path, char *ostr, int olen);
  int unix_connect_opts(QemuOpts *opts);
  int unix_connect(const char *path);
  
 +int tcp_server_start(const char *str, int *fd);
 +
  /* Old, ipv4 only bits.  Don't use for new code. */
  int parse_host_port(struct sockaddr_in *saddr, const char *str);
  int socket_init(void);
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 3/9] net: introduce tcp_client_start()

2012-03-05 Thread Orit Wasserman
On 03/05/2012 12:03 PM, Amos Kong wrote:
 Introduce tcp_client_start() by moving original code in
 tcp_start_outgoing_migration().
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  net.c |   39 +++
  qemu_socket.h |1 +
  2 files changed, 40 insertions(+), 0 deletions(-)
 
 diff --git a/net.c b/net.c
 index 0260968..5c20e22 100644
 --- a/net.c
 +++ b/net.c
 @@ -126,6 +126,45 @@ int tcp_server_start(const char *str, int *fd)
  return ret;
  }
  
 +int tcp_client_start(const char *str, int *fd)
 +{
 +struct sockaddr_in saddr;
 +int ret;
 +
 +if (parse_host_port(saddr, str)  0) {
 +return -EINVAL;
 +}
 +
 +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
 +if (fd  0) {
 +perror(socket);
 +return -1;
 +}
 +socket_set_nonblock(*fd);
 +
 +for (;;) {
 +ret = connect(*fd, (struct sockaddr *)saddr, sizeof(saddr));
 +if (ret  0) {
 +ret = -socket_error();
 +if (ret == -EINPROGRESS) {
 +break;
 +#ifdef _WIN32
 +} else if (ret == -WSAEALREADY || ret == -WSAEINVAL) {
 +break;
 +#endif
 +} else if (ret != -EINTR  ret != -EWOULDBLOCK) {
 +perror(connect);
 +closesocket(*fd);
 +return -1;

I think it should be: return ret (otherwise you lose the error code).
And you need it.

Orit
 +}
 +} else {
 +break;
 +}
 +}
 +
 +return ret;
 +}
 +
  int parse_host_port(struct sockaddr_in *saddr, const char *str)
  {
  char buf[512];
 diff --git a/qemu_socket.h b/qemu_socket.h
 index d612793..9246578 100644
 --- a/qemu_socket.h
 +++ b/qemu_socket.h
 @@ -55,6 +55,7 @@ int unix_connect_opts(QemuOpts *opts);
  int unix_connect(const char *path);
  
  int tcp_server_start(const char *str, int *fd);
 +int tcp_client_start(const char *str, int *fd);
  
  /* Old, ipv4 only bits.  Don't use for new code. */
  int parse_host_port(struct sockaddr_in *saddr, const char *str);
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 2/9] net: use tcp_server_start() for tcp server creation

2012-03-05 Thread Orit Wasserman
On 03/05/2012 12:03 PM, Amos Kong wrote:
 Use tcp_server_start in those two functions:
  tcp_start_incoming_migration()
  net_socket_listen_init()
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  migration-tcp.c |   21 +
  net/socket.c|   23 +++
  2 files changed, 8 insertions(+), 36 deletions(-)
 
 diff --git a/migration-tcp.c b/migration-tcp.c
 index 35a5781..ecadd10 100644
 --- a/migration-tcp.c
 +++ b/migration-tcp.c
 @@ -157,28 +157,17 @@ out2:
  
  int tcp_start_incoming_migration(const char *host_port)
  {
 -struct sockaddr_in addr;
 -int val;
 +int ret;
  int s;
  
  DPRINTF(Attempting to start an incoming migration\n);
  
 -if (parse_host_port(addr, host_port)  0) {
 -fprintf(stderr, invalid host/port combination: %s\n, host_port);
 -return -EINVAL;
 -}
 -
 -s = qemu_socket(PF_INET, SOCK_STREAM, 0);
 -if (s == -1) {
 -return -socket_error();
 +ret = tcp_server_start(host_port, s);
 +if (ret  0) {
 +fprintf(stderr, tcp_server_start: %s\n, strerror(-ret));
 +return ret;
  }
  
 -val = 1;
 -setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (const char *)val, sizeof(val));
 -
 -if (bind(s, (struct sockaddr *)addr, sizeof(addr)) == -1) {
 -goto err;
 -}
  if (listen(s, 1) == -1) {
  goto err;
  }
 diff --git a/net/socket.c b/net/socket.c
 index 0bcf229..5feb3d2 100644
 --- a/net/socket.c
 +++ b/net/socket.c
 @@ -403,31 +403,14 @@ static int net_socket_listen_init(VLANState *vlan,
const char *host_str)
  {
  NetSocketListenState *s;
 -int fd, val, ret;
 -struct sockaddr_in saddr;
 -
 -if (parse_host_port(saddr, host_str)  0)
 -return -1;
 +int fd, ret;
  
  s = g_malloc0(sizeof(NetSocketListenState));
  
 -fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
 -if (fd  0) {
 -perror(socket);
 -g_free(s);
 -return -1;
 -}
 -socket_set_nonblock(fd);
 -
 -/* allow fast reuse */
 -val = 1;
 -setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, (const char *)val, 
 sizeof(val));
 -
 -ret = bind(fd, (struct sockaddr *)saddr, sizeof(saddr));
 +ret = tcp_server_start(host_str, fd);
  if (ret  0) {
 -perror(bind);
 +error_report(tcp_server_start: %s, strerror(-ret));

If the return value is always -1 this has no meaning

Orit
  g_free(s);
 -closesocket(fd);
  return -1;
  }
  ret = listen(fd, 0);
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 3/9] net: introduce tcp_client_start()

2012-03-05 Thread Orit Wasserman
On 03/05/2012 12:03 PM, Amos Kong wrote:
 Introduce tcp_client_start() by moving original code in
 tcp_start_outgoing_migration().
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  net.c |   39 +++
  qemu_socket.h |1 +
  2 files changed, 40 insertions(+), 0 deletions(-)
 
 diff --git a/net.c b/net.c
 index 0260968..5c20e22 100644
 --- a/net.c
 +++ b/net.c
 @@ -126,6 +126,45 @@ int tcp_server_start(const char *str, int *fd)
  return ret;
  }
  
 +int tcp_client_start(const char *str, int *fd)
 +{
 +struct sockaddr_in saddr;
 +int ret;
 +
 +if (parse_host_port(saddr, str)  0) {
 +return -EINVAL;

You use this in order to know when to call migrate_fd_error this is problematic 
as another error can return this error code.
I think that setting *fd = -1 in the beginning of the function would be enough.

Orit
 +}
 +
 +*fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
 +if (fd  0) {
 +perror(socket);
 +return -1;
 +}
 +socket_set_nonblock(*fd);
 +
 +for (;;) {
 +ret = connect(*fd, (struct sockaddr *)saddr, sizeof(saddr));
 +if (ret  0) {
 +ret = -socket_error();
 +if (ret == -EINPROGRESS) {
 +break;
 +#ifdef _WIN32
 +} else if (ret == -WSAEALREADY || ret == -WSAEINVAL) {
 +break;
 +#endif
 +} else if (ret != -EINTR  ret != -EWOULDBLOCK) {
 +perror(connect);
 +closesocket(*fd);
 +return -1;
should be return ret;
 +}
 +} else {
 +break;
 +}
 +}
 +
 +return ret;
 +}
 +
  int parse_host_port(struct sockaddr_in *saddr, const char *str)
  {
  char buf[512];
 diff --git a/qemu_socket.h b/qemu_socket.h
 index d612793..9246578 100644
 --- a/qemu_socket.h
 +++ b/qemu_socket.h
 @@ -55,6 +55,7 @@ int unix_connect_opts(QemuOpts *opts);
  int unix_connect(const char *path);
  
  int tcp_server_start(const char *str, int *fd);
 +int tcp_client_start(const char *str, int *fd);
  
  /* Old, ipv4 only bits.  Don't use for new code. */
  int parse_host_port(struct sockaddr_in *saddr, const char *str);
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 4/9] net: use tcp_client_start for tcp client creation

2012-03-05 Thread Orit Wasserman
On 03/05/2012 12:03 PM, Amos Kong wrote:
 Use tcp_client_start() in those two functions:
  tcp_start_outgoing_migration()
  net_socket_connect_init()
 
 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  migration-tcp.c |   41 +
  net/socket.c|   41 +++--
  2 files changed, 24 insertions(+), 58 deletions(-)
 
 diff --git a/migration-tcp.c b/migration-tcp.c
 index ecadd10..4f89bff 100644
 --- a/migration-tcp.c
 +++ b/migration-tcp.c
 @@ -81,43 +81,28 @@ static void tcp_wait_for_connect(void *opaque)
  
  int tcp_start_outgoing_migration(MigrationState *s, const char *host_port)
  {
 -struct sockaddr_in addr;
  int ret;
 -
 -ret = parse_host_port(addr, host_port);
 -if (ret  0) {
 -return ret;
 -}
 +int fd;
  
  s-get_error = socket_errno;
  s-write = socket_write;
  s-close = tcp_close;
  
 -s-fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
 -if (s-fd == -1) {
 -DPRINTF(Unable to open socket);
 -return -socket_error();
 -}
 -
 -socket_set_nonblock(s-fd);
 -
 -do {
 -ret = connect(s-fd, (struct sockaddr *)addr, sizeof(addr));
 -if (ret == -1) {
 -ret = -socket_error();
 -}
 -if (ret == -EINPROGRESS || ret == -EWOULDBLOCK) {
 -qemu_set_fd_handler2(s-fd, NULL, NULL, tcp_wait_for_connect, s);
 -return 0;
 -}
 -} while (ret == -EINTR);
 -
 -if (ret  0) {
 +ret = tcp_client_start(host_port, fd);
 +s-fd = fd;

you don't need fd you can pass s-fd to the function.

Orit

 +if (ret == -EINPROGRESS || ret == -EWOULDBLOCK) {
 +DPRINTF(connect in progress);
 +qemu_set_fd_handler2(s-fd, NULL, NULL, tcp_wait_for_connect, s);
 +} else if (ret  0) {
  DPRINTF(connect failed\n);
 -migrate_fd_error(s);
 +if (ret != -EINVAL) {
 +migrate_fd_error(s);
 +}
  return ret;
 +} else {
 +migrate_fd_connect(s);
  }
 -migrate_fd_connect(s);
 +
  return 0;
  }
  
 diff --git a/net/socket.c b/net/socket.c
 index 5feb3d2..b7cd8ec 100644
 --- a/net/socket.c
 +++ b/net/socket.c
 @@ -434,41 +434,22 @@ static int net_socket_connect_init(VLANState *vlan,
 const char *host_str)
  {
  NetSocketState *s;
 -int fd, connected, ret, err;
 +int fd, connected, ret;
  struct sockaddr_in saddr;
  
 -if (parse_host_port(saddr, host_str)  0)
 -return -1;
 -
 -fd = qemu_socket(PF_INET, SOCK_STREAM, 0);
 -if (fd  0) {
 -perror(socket);
 -return -1;
 -}
 -socket_set_nonblock(fd);
 -
 -connected = 0;
 -for(;;) {
 -ret = connect(fd, (struct sockaddr *)saddr, sizeof(saddr));
 -if (ret  0) {
 -err = socket_error();
 -if (err == EINTR || err == EWOULDBLOCK) {
 -} else if (err == EINPROGRESS) {
 -break;
 +ret = tcp_client_start(host_str, fd);
 +if (ret == -EINPROGRESS || ret == -EWOULDBLOCK) {
 +connected = 0;
  #ifdef _WIN32
 -} else if (err == WSAEALREADY || err == WSAEINVAL) {
 -break;
 +} else if (ret == -WSAEALREADY || ret == -WSAEINVAL) {
 +connected = 0;
  #endif
 -} else {
 -perror(connect);
 -closesocket(fd);
 -return -1;
 -}
 -} else {
 -connected = 1;
 -break;
 -}
 +} else if (ret  0) {
 +return -1;
 +} else {
 +connected = 1;
  }
 +
  s = net_socket_fd_init(vlan, model, name, fd, connected);
  if (!s)
  return -1;
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMXON region vs VMCS region?

2012-01-31 Thread Orit Wasserman
On 01/31/2012 05:35 AM, Zhi Yong Wu wrote:
 HI,
 
 Can anyone let me know know the difference  between VMXON region and
 VMCS region? relationship?
 

There is no relationship between them:
VMXON region is created per logical processor and used by it for VMX ops.
VMCS region is created for each guest vcpu and used both by the hypervisor and 
the processor.

 It will be appreciated if you can make some comments.
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration

2012-01-01 Thread Orit Wasserman
On 12/30/2011 12:39 AM, Anthony Liguori wrote:
 On 12/28/2011 07:25 PM, Isaku Yamahata wrote:
 Intro
 =
 This patch series implements postcopy live migration.[1]
 As discussed at KVM forum 2011, dedicated character device is used for
 distributed shared memory between migration source and destination.
 Now we can discuss/benchmark/compare with precopy. I believe there are
 much rooms for improvement.

 [1] http://wiki.qemu.org/Features/PostCopyLiveMigration


 Usage
 =
 You need load umem character device on the host before starting migration.
 Postcopy can be used for tcg and kvm accelarator. The implementation depend
 on only linux umem character device. But the driver dependent code is split
 into a file.
 I tested only host page size == guest page size case, but the implementation
 allows host page size != guest page size case.

 The following options are added with this patch series.
 - incoming part
command line options
-postcopy [-postcopy-flagsflags]
where flags is for changing behavior for benchmark/debugging
Currently the following flags are available
0: default
1: enable touching page request

example:
qemu -postcopy -incoming tcp:0: -monitor stdio -machine accel=kvm

 - outging part
options for migrate command
migrate [-p [-n]] URI
-p: indicate postcopy migration
-n: disable background transferring pages: This is for benchmark/debugging

example:
migrate -p -n tcp:dest ip address:


 TODO
 
 - benchmark/evaluation. Especially how async page fault affects the result.
 
 I'll review this series next week (Mike/Juan, please also review when you 
 can).
 
 But we really need to think hard about whether this is the right thing to 
 take into the tree.  I worry a lot about the fact that we don't test pre-copy 
 migration nearly enough and adding a second form just introduces more things 
 to test.

 It's also not clear to me why post-copy is better.  If you were going to sit 
 down and explain to someone building a management tool when they should use 
 pre-copy and when they should use post-copy, what would you tell them?

Start with pre-copy , if it doesn't converge switch to post-copy  

Orit
 
 Regards,
 
 Anthony Liguori
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-13 Thread Orit Wasserman
On 11/12/2011 11:37 PM, Nadav Har'El wrote:
 On Sat, Nov 12, 2011, Avi Kivity wrote about Re: [PATCH 02/10] nEPT: MMU 
 context for nested EPT:
 host may write-protect a page.  Second, the shadow and guest ptes may be
 in different formats (ept vs ia32).
 
 I'm afraid I've lost you here... The shadow table and the to-be-shadowed 
 table are both ia32 (this is the normal shadow table code), or both ept
 (the nested tdp code). When are they supposed to be in different
 formats (ept vs ia32)?
 
 I'm also puzzled in what situation will the host will write-protect an EPT02
 (shadow EPT) page?
 
 In fact that happens to accidentally work, no?  Intermediate ptes are
 always present/write/user, which translates to read/write/execute in EPT.
 
 It didn't work because it also used to set the accessed bit, bit 5,
 which on EPT is reserved and caused EPT misconfiguration. So I had to
 fix link_shadow_page, or nested EPT would not work at all.
 
 Don't optimize for least changes, optimize for best result afterwards.
 
 As I'm sure you remember, two years ago, in September 6 2009, you wrote in
 your blog about the newly contributed nested VMX patch set, and in
 particular its nested EPT (which predated the nested NPT contribution).
 
 Nested EPT was, for some workloads, a huge performance improvement, but
 you (if I understand correctly) did not want that code in KVM because
 it, basically, optimized for getting the job done, in the most correct
 and most efficient manner - but without regard of how cleanly this fit with
 other types of shadowing (normal shadow page tables, and nested NPT),
 or how much of the code was being duplicated or circumvented.
 
 So this time around, I couldn't really not optimize for least changes.
 This time, the nested EPT had to fit (like a square peg in a round hole
 ;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write
 the most correct and most efficient code (which Orit Wasserman already
 did, two years earlier). This time I needed to figure out the least obtrusive
 way of changing the existing code. The hardest thing about doing this
 was trying to understand all the complexities and subtleties of the existing
 MMU code in KVM, which already does 101 different cases in one
 overloaded piece of code, which is not commented or documented.
 And of course, add to that all the complexities (some might even say cruft)
 which the underlying x86 architecture itself has acrued over the years.
 So it's not surprising I've missed some of the important subtleties which
 didn't have any effect in the typical case I've tried. Like I said, in my
 tests nested EPT *did* work. And even getting to that point was hard enough 
 :-)
 
 We need a third variant of walk_addr_generic that parses EPT format
 PTEs.  Whether that's best done by writing paging_ept.h or modifying
 paging_tmpl.h, I don't know.
 
 Thanks. I'll think about everything you've said in this thread (I'm still
 not convinced I understood all your points, so just understanding them
 will be the first step). I'll see what I can do to improve the patch.
 
 But I have to be honest - I'm not sure how quickly I can finish this.
 I really appreciate all your comments about nested VMX in the last two
 years - most of them have been spot-on, 100% correct, and really helpful
 for making me understand things which I had previously misunderstood.
 However, since you are (of course) extremely familiar with every nook and
 cranny of KVM, what normally happens is that every comment which took you
 5 minutes to figure out, takes me 5 days to fully understand, and to actually
 write, debug and test the fixed code. Every review that takes you two days
 to go through (and is very much appreciated!) takes me several months to fix
 each and every thing you asked for.
 
 Don't get me wrong, I *am* planning to continue working (part-time) on nested
 VMX, and nested EPT in particular. But if you want it to pick up the pace,
 I could use some help with actual coding from people who have much more
 intimate knowledge of the non-nested-VMX parts of KVM than I have.
 
 In the meantime, if anybody wants to experiment with a much faster
 Nested VMX than we had before, you can try my current patch. It may not
 be perfect, but in many ways it is better than the old shadow-on-ept code.
 And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it
 works well.
 
 Nadav.
 

Maybe this patch can help, this is roughly what Avi wants (I hope) done very 
quickly.
I'm sorry I don't have setup to run nested VMX at the moment so i can't test it.

Orit
From 032ba0221ea690f61ecc4f6d946090d18d6f4dfb Mon Sep 17 00:00:00 2001
From: Orit Wasserman owass...@redhard.com
Date: Sun, 13 Nov 2011 12:57:17 +0200
Subject: [PATCH] Add EPT_walk_addr_genric

---
 arch/x86/kvm/mmu.c |   13 +
 arch/x86/kvm/mmu.h |5 +
 arch/x86/kvm/paging_tmpl.h |   44 
 3 files changed, 58

Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-13 Thread Orit Wasserman
On 11/13/2011 04:32 PM, Avi Kivity wrote:
 On 11/13/2011 01:30 PM, Orit Wasserman wrote:
 Maybe this patch can help, this is roughly what Avi wants (I hope) done very 
 quickly.
 I'm sorry I don't have setup to run nested VMX at the moment so i can't test 
 it.

 Orit

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 9335e1b..bbe212f 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -3180,6 +3180,10 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, 
 unsigned access,
  #include paging_tmpl.h
  #undef PTTYPE
  
 +#define PTTYPE EPT
 +#include paging_tmpl.h
 +#undef PTTYPE
 +
 
 Yes, that's the key.
 
 
  int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 
 sptes[4]);
  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
  int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
 direct);
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 507e2b8..70d4cfd 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -39,6 +39,21 @@
  #define CMPXCHG cmpxchg64
  #define PT_MAX_FULL_LEVELS 2
  #endif
 +#elif PTTYPE == EPT
 +#define pt_element_t u64
 +#define FNAME(name) EPT_##name
 +#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
 +#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
 +#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 +#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 +#define PT_LEVEL_BITS PT64_LEVEL_BITS
 +#ifdef CONFIG_X86_64
 +#define PT_MAX_FULL_LEVELS 4
 +#define CMPXCHG cmpxchg
 +#else
 +#define CMPXCHG cmpxchg64
 +#define PT_MAX_FULL_LEVELS 2
 +#endif
 
 The various masks should be defined here, to avoid lots of #ifdefs later.
 

That what I did first but than I was afraid that the MASK will be changed for 
mmu.c too.
so I decided on ifdefs.
The more I think about it I think we need rapper function for mask checking (at 
least for this file).
What do you think ?


  #elif PTTYPE == 32
  #define pt_element_t u32
  #define guest_walker guest_walker32
 @@ -106,14 +121,19 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu 
 *vcpu, pt_element_t gpte,
  {
  unsigned access;
  
 +#if PTTYPE == EPT   
  access = (gpte  (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
 +#else
 +access = (gpte  EPT_WRITABLE_MASK) | EPT_EXEC_MASK;
  if (last  !is_dirty_gpte(gpte))
  access = ~ACC_WRITE_MASK;
 +#endif
 
 Like here, you could make is_dirty_gpte() local to paging_tmpl()
 returning true for EPT and the dirty bit otherwise.
 
 
  
  #if PTTYPE == 64
  if (vcpu-arch.mmu.nx)
  access = ~(gpte  PT64_NX_SHIFT);
 
 The ept X bit is lost.
 
 Could do something like
 
access = (gpte  PT_X_NX_SHIFT) ^ PT_X_NX_SENSE;
 
 
 +#if PTTYPE == EPT
 +const int write_fault = access  EPT_WRITABLE_MASK;
 +const int user_fault  = 0;
 +const int fetch_fault = 0;
 +#else
 
 EPT has fetch permissions (but not user permissions); anyway
 translate_nested_gpa() already does this.
 
  const int write_fault = access  PFERR_WRITE_MASK;
  const int user_fault  = access  PFERR_USER_MASK;
  const int fetch_fault = access  PFERR_FETCH_MASK;
 +#endif
  u16 errcode = 0;
  
  trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault,
 @@ -174,6 +200,9 @@ retry_walk:
 (mmu-get_cr3(vcpu)  CR3_NONPAE_RESERVED_BITS) == 0);
  
  pt_access = ACC_ALL;
 +#if PTTYPE == EPT 
 +pt_access =  PT_PRESENT_MASK | EPT_WRITABLE_MASK | EPT_EXEC_MASK;
 +#endif
 
 pt_access is not in EPT or ia32 format - it's our own format (xwu).  So
 this doesn't need changing.  Updating gpte_access() is sufficient.
 
  
  for (;;) {
  gfn_t real_gfn;
 @@ -186,9 +215,14 @@ retry_walk:
  pte_gpa   = gfn_to_gpa(table_gfn) + offset;
  walker-table_gfn[walker-level - 1] = table_gfn;
  walker-pte_gpa[walker-level - 1] = pte_gpa;
 -
 +#if PTTYPE == EPT 
 +real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn),
 +  EPT_WRITABLE_MASK);
 +#else
  real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn),
PFERR_USER_MASK|PFERR_WRITE_MASK);
 +#endif
 +
 
 Unneeded, I think.

Is it because translate_nested_gpa always set USER_MASK ? 

 
  if (unlikely(real_gfn == UNMAPPED_GVA))
  goto error;
  real_gfn = gpa_to_gfn(real_gfn);
 @@ -221,6 +255,7 @@ retry_walk:
  eperm = true;
  #endif
  
 +#if PTTYPE != EPT
  if (!eperm  unlikely(!(pte  PT_ACCESSED_MASK))) {
  int ret;
  trace_kvm_mmu_set_accessed_bit(table_gfn, index,
 @@ -235,7 +270,7 @@ retry_walk:
  mark_page_dirty(vcpu-kvm, table_gfn);
  pte |= PT_ACCESSED_MASK;
  }
 -
 +#endif
 
 If PT_ACCESSED_MASK is 0 for EPT

Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Orit Wasserman
On 11/10/2011 11:59 AM, Nadav Har'El wrote:
 When the existing KVM MMU code creates a shadow page table, it assumes it
 has the normal x86 page table format. This is obviously correct for normal
 shadow page tables, and also correct for AMD's NPT.
 Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary
 page tables, so when we create a shadow EPT table (i.e., in nested EPT),
 we need to slightly modify the way in which this table table is built.
 
 In particular, when mmu.c's link_shadow_page() creates non-leaf page table
 entries, it used to enable the present, accessed, writable and user
 flags on these entries. While this is correct for ordinary page tables, it
 is wrong in EPT tables - where these bits actually have completely different
 meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h).
 In particular, leaving the code as-is causes bit 5 of the PTE to be turned on
 (supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes
 an EPT Misconfiguration failure.
 
 So we must move link_shadow_page's list of extra bits to a new mmu context
 field, which is set differently for nested EPT.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/mmu.c  |   16 +---
  arch/x86/kvm/paging_tmpl.h  |6 --
  arch/x86/kvm/vmx.c  |3 +++
  4 files changed, 17 insertions(+), 9 deletions(-)
 
 --- .before/arch/x86/include/asm/kvm_host.h   2011-11-10 11:33:59.0 
 +0200
 +++ .after/arch/x86/include/asm/kvm_host.h2011-11-10 11:33:59.0 
 +0200
 @@ -287,6 +287,7 @@ struct kvm_mmu {
   bool nx;
  
   u64 pdptrs[4]; /* pae */
 + u64 link_shadow_page_set_bits;
  };
  
  struct kvm_vcpu_arch {
 --- .before/arch/x86/kvm/vmx.c2011-11-10 11:33:59.0 +0200
 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200
 @@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s
   vcpu-arch.mmu.get_pdptr = nested_ept_get_pdptr;
   vcpu-arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
   vcpu-arch.mmu.shadow_root_level = get_ept_level();
 + vcpu-arch.mmu.link_shadow_page_set_bits =
 + VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
 + VMX_EPT_EXECUTABLE_MASK;
  
   vcpu-arch.walk_mmu  = vcpu-arch.nested_mmu;
  
 --- .before/arch/x86/kvm/mmu.c2011-11-10 11:33:59.0 +0200
 +++ .after/arch/x86/kvm/mmu.c 2011-11-10 11:33:59.0 +0200
 @@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_
   return __shadow_walk_next(iterator, *iterator-sptep);
  }
  
 -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
 +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 
 set_bits)
  {
 - u64 spte;
 -
 - spte = __pa(sp-spt)
 - | PT_PRESENT_MASK | PT_ACCESSED_MASK
 - | PT_WRITABLE_MASK | PT_USER_MASK;
 - mmu_spte_set(sptep, spte);
 + mmu_spte_set(sptep, __pa(sp-spt) | set_bits);
  }
  
  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 @@ -3366,6 +3361,13 @@ int kvm_init_shadow_mmu(struct kvm_vcpu 
   vcpu-arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
   vcpu-arch.mmu.base_role.smep_andnot_wp
   = smep  !is_write_protection(vcpu);
 + /*
 +  * link_shadow() should apply these bits in shadow page tables, and
 +  * in shadow NPT tables (nested NPT). For nested EPT, different bits
 +  * apply.
 +  */
 + vcpu-arch.mmu.link_shadow_page_set_bits = PT_PRESENT_MASK |
 + PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
  
   return r;
  }
 --- .before/arch/x86/kvm/paging_tmpl.h2011-11-10 11:33:59.0 
 +0200
 +++ .after/arch/x86/kvm/paging_tmpl.h 2011-11-10 11:33:59.0 +0200
 @@ -515,7 +515,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
   goto out_gpte_changed;
  
   if (sp)
 - link_shadow_page(it.sptep, sp);
 + link_shadow_page(it.sptep, sp,
 + vcpu-arch.mmu.link_shadow_page_set_bits);
   }
  
   for (;
 @@ -535,7 +536,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
  
   sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 true, direct_access, it.sptep);
 - link_shadow_page(it.sptep, sp);
 + link_shadow_page(it.sptep, sp,
 + vcpu-arch.mmu.link_shadow_page_set_bits);
   }
  
   clear_sp_write_flooding_count(it.sptep);
We need to consider the permission in L1's EPT table,what if an entry is read 
only ?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume

2009-10-28 Thread Orit Wasserman


Gleb Natapov g...@redhat.com wrote on 25/10/2009 11:44:31:

 From:

 Gleb Natapov g...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/
 Haifa/i...@ibmil, kvm@vger.kernel.org, md...@us.ibm.com, Muli Ben-
 Yehuda/Haifa/i...@ibmil

 Date:

 25/10/2009 11:44

 Subject:

 Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume

 On Thu, Oct 22, 2009 at 05:46:16PM +0200, Orit Wasserman wrote:
 
 
  Gleb Natapov g...@redhat.com wrote on 22/10/2009 11:04:58:
 
   From:
  
   Gleb Natapov g...@redhat.com
  
   To:
  
   Orit Wasserman/Haifa/i...@ibmil
  
   Cc:
  
   Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/
   Haifa/i...@ibmil, kvm@vger.kernel.org, md...@us.ibm.com, Muli Ben-
   Yehuda/Haifa/i...@ibmil
  
   Date:
  
   22/10/2009 11:05
  
   Subject:
  
   Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume
  
   On Wed, Oct 21, 2009 at 04:43:44PM +0200, Orit Wasserman wrote:
  @@ -4641,10 +4955,13 @@ static void vmx_complete_interrupts
(struct
 vcpu_vmx *vmx)
  int type;
  bool idtv_info_valid;
 
  -   exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
  -
  vmx-exit_reason = vmcs_read32(VM_EXIT_REASON);
 
  +   if (vmx-nested.nested_mode)
  +  return;
  +
 Why return here? What the function does that should not be done
in
 nested mode?
In nested mode L0 injects an interrupt to L2 only in one scenario,
if there is an IDT_VALID event and L0 decides to run L2 again and
not
  to
switch back to L1.
In all other cases the injection is handled by L1.
   This is exactly the kind of scenario that is handled by
   vmx_complete_interrupts(). (vmx|svm)_complete_interrups() store
   pending event in arch agnostic way and re-injection is handled by
   x86.c You bypass this logic by inserting return here and introducing
   nested_handle_valid_idt() function below.
  The only location we can truly know if we are switching to L1 is in
  vmx_vcpu_run
  because enable_irq_window (that is called after handling the exit) can
  decide to
  switch to L1 because of an interrupt.
 enable_irq_window() will be called after L2 VMCS will be setup for event
 re-injection by previous call to inject_pending_event(). As far as I
 can see this should work for interrupt injection. For exception we
 should probably require l2 guest to re execute faulted instruction for
 now like svm does.
The main issue is that L0 doesn't inject events to L2 but L1 hypervisor (we
want to keep the nested hypervisor semantics as
much as possible). Only if the event was caused by the fact that L2 is a
nested guest
and L1 can't handle it L0 will re-inject and event to L2, for example IDT
event
with page fault that is caused by a missing entry in SPT02 (the shadow page
table L0 create for L2).
In this case when vmx_complete_intterupts is called L0 doesn't know if the
page fault should be handled by it or
by L1 (it is decided later when handling the exit).
In most other cases , L0 will switch to L1 and L1 will decide if there will
be re-injection
(depends on the L1 hypervisor logic) and update L2 VMCS accordingly.

  In order to simplify our code it was simpler to bypass
  vmx_complete_interrupts when it is called (after
  running L2) and to add nested_handle_valid_idt just before running L2.

  +   exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
  +
  /* Handle machine checks before interrupts are enabled */
  if ((vmx-exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
  || (vmx-exit_reason == EXIT_REASON_EXCEPTION_NMI
  @@ -4747,6 +5064,60 @@ static void fixup_rmode_irq(struct
vcpu_vmx
*vmx)
 | vmx-rmode.irq.vector;
   }
 
  +static int nested_handle_valid_idt(struct kvm_vcpu *vcpu)
  +{
 It seems by this function you are trying to bypass general event
 reinjection logic. Why?
See above.
   The logic implemented by this function is handled in x86.c in arch
   agnostic way. Is there something wrong with this?
  See my comment before
 Sometimes it is wrong to reinject events from L0 to L2 directly. If L2
 was not able to handle event because its IDT is not mapped by L1 shadow
 page table we should generate PF vmexit with valid idt vectoring info to
 L1 and let L1 handle event reinjection.

 --
  Gleb.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Nested VMX patch 1 implements vmon and vmoff

2009-10-22 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 20/10/2009 06:00:50:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/
 Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil,
 aligu...@us.ibm.com, md...@us.ibm.com

 Date:

 20/10/2009 06:02

 Subject:

 Re: [PATCH 1/5] Nested VMX patch 1 implements vmon and vmoff

 On 10/15/2009 11:41 PM, or...@il.ibm.com wrote:
 
/*
  + * Handles msr read for nested virtualization
  + */
  +static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index,
  +   u64 *pdata)
  +{
  +   u64 vmx_msr = 0;
  +
  +   switch (msr_index) {
  +   case MSR_IA32_FEATURE_CONTROL:
  +  *pdata = 0;
  +  break;
  +   case MSR_IA32_VMX_BASIC:
  +  *pdata = 0;
  +  rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
  +  *pdata = (vmx_msr  0x00cf);
  +  break;
  +
 

 This (and the rest of the msrs) must be controllable from userspace.
 Otherwise a live migration from a newer host to an older host would
break.
OK.

 
/*
  + * Writes msr value for nested virtualization
  + * Returns 0 on success, non-0 otherwise.
  + */
  +static int nested_vmx_set_msr(struct kvm_vcpu *vcpu, u32
 msr_index, u64 data)
  +{
  +   switch (msr_index) {
  +   case MSR_IA32_FEATURE_CONTROL:
  +  if ((data  (FEATURE_CONTROL_LOCKED |
  +  FEATURE_CONTROL_VMXON_ENABLED))
  +  != (FEATURE_CONTROL_LOCKED |
  + FEATURE_CONTROL_VMXON_ENABLED))
  + return 1;
  +  break;
  +   default:
  +  return 1;
  +   }
  +
  +   return 0;
  +}
  +
 

 Need to export this msr to userspace for live migration.  See
 msrs_to_save[].
OK.

 
  +/*
  + * Check to see if vcpu can execute vmx command
  + * Inject the corrseponding exception
  + */
  +static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
  +{
  +   struct kvm_segment cs;
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +   struct kvm_msr_entry *msr;
  +
  +   vmx_get_segment(vcpu,cs, VCPU_SREG_CS);
  +
  +   if (!vmx-nested.vmxon) {
  +  printk(KERN_DEBUG %s: vmx not on\n, __func__);
 

 pr_debug
I will change it.

  +  kvm_queue_exception(vcpu, UD_VECTOR);
  +  return 0;
  +   }
  +
  +   msr = find_msr_entry(vmx, MSR_EFER);
  +
  +   if ((vmx_get_rflags(vcpu)  X86_EFLAGS_VM) ||
  +   ((msr-data  EFER_LMA)  !cs.l)) {
 

 is_long_mode()
I will change it.

static int handle_vmx_insn(struct kvm_vcpu *vcpu)
{
   kvm_queue_exception(vcpu, UD_VECTOR);
   return 1;
}
 
  +static int handle_vmoff(struct kvm_vcpu *vcpu)
  +{
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +
  +   if (!nested_vmx_check_permission(vcpu))
  +  return 1;
  +
  +   vmx-nested.vmxon = 0;
  +
  +   skip_emulated_instruction(vcpu);
  +   return 1;
  +}
  +
  +static int handle_vmon(struct kvm_vcpu *vcpu)
  +{
  +   struct kvm_segment cs;
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +
  +   if (!nested) {
  +  printk(KERN_DEBUG %s: nested vmx not enabled\n, __func__);
  +  kvm_queue_exception(vcpu, UD_VECTOR);
  +  return 1;
  +   }
  +
  +   vmx_get_segment(vcpu,cs, VCPU_SREG_CS);
  +
  +   if (!(vcpu-arch.cr4  X86_CR4_VMXE) ||
  +   !(vcpu-arch.cr0  X86_CR0_PE) ||
  +   (vmx_get_rflags(vcpu)  X86_EFLAGS_VM)) {
  +  kvm_queue_exception(vcpu, UD_VECTOR);
  +  printk(KERN_INFO %s invalid register state\n, __func__);
  +  return 1;
  +   }
  +#ifdef CONFIG_X86_64
  +   if (((find_msr_entry(to_vmx(vcpu),
  +  MSR_EFER)-data  EFER_LMA)  !cs.l)) {
 

 is_long_mode(), and you can avoid the #ifdef.
I will change it.


 VMXON is supposed to block INIT, please add that (in a separate patch).
I will add it.

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst

2009-10-22 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 20/10/2009 06:24:33:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/
 Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil,
 aligu...@us.ibm.com, md...@us.ibm.com

 Date:

 20/10/2009 06:24

 Subject:

 Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst

 On 10/15/2009 11:41 PM, or...@il.ibm.com wrote:
 
  +
  +struct __attribute__ ((__packed__)) shadow_vmcs {
 

 Since this is in guest memory, we need it packed so the binary format is
 preserved across migration.  Please add a comment so it isn't changed
 (at least without changing the revision_id).
I will add a comment.

 vmclear state should be here, that will help multiguest support.
Working on it ...

 
struct nested_vmx {
   /* Has the level1 guest done vmxon? */
   bool vmxon;
  -
  +   /* What is the location of the  vmcs l1 keeps for l2? (in level1
gpa) */
  +   u64 vmptr;
 

 Need to expose it for live migration.
I will look into it.

   /*
* Level 2 state : includes vmcs,registers and
* a copy of vmcs12 for vmread/vmwrite
*/
   struct level_state *l2_state;
  +   /* Level 1 state for switching to level 2 and back */
  +   struct level_state *l1_state;
 

 This creates a ton of duplication.

 Some of the data is completely unnecessary, for example we can
 recalculate cr0 from HOST_CR0 and GUEST_CR0.
I will look into it .

  +
  +static int vmptrld(struct kvm_vcpu *vcpu,
  + u64 phys_addr)
  +{
  +   u8 error;
  +
  +   asm volatile (__ex(ASM_VMX_VMPTRLD_RAX) ; setna %0
  +: =g(error) : a(phys_addr), m(phys_addr)
  +: cc);
  +   if (error) {
  +  printk(KERN_ERR kvm: %s vmptrld %llx failed\n,
  + __func__, phys_addr);
  +  return 1;
  +   }
  +
  +   return 0;
  +}
  +
/*
 * Switches to specified vcpu, until a matching vcpu_put(), but
assumes
 * vcpu mutex is already taken.
  @@ -736,15 +923,8 @@ static void vmx_vcpu_load(struct kvm_vcpu
 *vcpu, int cpu)
   }
 
   if (per_cpu(current_vmcs, cpu) != vmx-vmcs) {
  -  u8 error;
  -
  per_cpu(current_vmcs, cpu) = vmx-vmcs;
  -  asm volatile (__ex(ASM_VMX_VMPTRLD_RAX) ; setna %0
  -   : =g(error) : a(phys_addr), m(phys_addr)
  -   : cc);
  -  if (error)
  - printk(KERN_ERR kvm: vmptrld %p/%llx fail\n,
  -vmx-vmcs, phys_addr);
  +  vmptrld(vcpu, phys_addr);
   }
 

 This part of the patch is no longer needed.
I will remove it.
  +   if (cpu_has_vmx_msr_bitmap())
  +  vmx-nested.l2_state-msr_bitmap = vmcs_read64(MSR_BITMAP);
  +   else
  +  vmx-nested.l2_state-msr_bitmap = 0;
  +
  +   vmx-nested.l2_state-io_bitmap_a = vmcs_read64(IO_BITMAP_A);
  +   vmx-nested.l2_state-io_bitmap_b = vmcs_read64(IO_BITMAP_B);
  +
 

 This no longer works, since we don't load the guest vmcs.
I will look into it.

  +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
  + struct kvm_vcpu *vcpu);
 

 Isn't this in a header somewhere?
I will move it into the header.

  +
  +int read_guest_vmcs_gpa(struct kvm_vcpu *vcpu, u64 *gentry)
  +{
  +
  +   int r = 0;
  +
  +   r = kvm_read_guest_virt(vcpu-arch.regs[VCPU_REGS_RAX], gentry,
  +sizeof(u64), vcpu);
  +   if (r) {
  +  printk(KERN_ERR %s cannot read guest vmcs addr %lx : %d\n,
  + __func__, vcpu-arch.regs[VCPU_REGS_RAX], r);
  +  return r;
  +   }
  +
  +   if (!IS_ALIGNED(*gentry, PAGE_SIZE)) {
  +  printk(KERN_DEBUG %s addr %llx not aligned\n,
  + __func__, *gentry);
  +  return 1;
  +   }
  +
  +   return 0;
  +}
  +
 

 Should go through the emulator to evaluate arguments.
OK.

  +static int handle_vmptrld(struct kvm_vcpu *vcpu)
  +{
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +   struct page *vmcs_page;
  +   u64 guest_vmcs_addr;
  +
  +   if (!nested_vmx_check_permission(vcpu))
  +  return 1;
  +
  +   if (read_guest_vmcs_gpa(vcpu,guest_vmcs_addr))
  +  return 1;
  +
  +   if (create_l1_state(vcpu)) {
  +  printk(KERN_ERR %s create_l1_state failed\n, __func__);
  +  return 1;
  +   }
  +
  +   if (create_l2_state(vcpu)) {
  +  printk(KERN_ERR %s create_l2_state failed\n, __func__);
  +  return 1;
  +   }
 

 return errors here, so we see the problem.
OK.

  +
  +static int handle_vmptrst(struct kvm_vcpu *vcpu)
  +{
  +   int r = 0;
  +
  +   if (!nested_vmx_check_permission(vcpu))
  +  return 1;
  +
  +   r = kvm_write_guest_virt(vcpu-arch.regs[VCPU_REGS_RAX],
  + (void *)to_vmx(vcpu)-nested.vmptr,
  + sizeof(u64), vcpu);
 

 Emulator again.
OK.

  +void save_vmcs(struct shadow_vmcs *dst)
  +{
 
  +   dst-io_bitmap_a = vmcs_read64(IO_BITMAP_A);
  +   dst-io_bitmap_b = vmcs_read64(IO_BITMAP_B);
 

 These (and many others) can never change due to a nested guest running,
 so no need to save

Re: [PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite

2009-10-22 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 20/10/2009 06:44:41:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/
 Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil,
 aligu...@us.ibm.com, md...@us.ibm.com

 Date:

 20/10/2009 06:44

 Subject:

 Re: [PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite

 On 10/15/2009 11:41 PM, or...@il.ibm.com wrote:
 
  +static int nested_map_shadow_vmcs(struct kvm_vcpu *vcpu)
  +{
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +   struct page *vmcs_page = nested_get_page(vcpu, vmx-nested.vmptr);
  +
  +   if (vmcs_page == NULL) {
  +  printk(KERN_INFO %s: failure in nested_get_page\n,__func__);
  +  return 0;
  +   }
  +
  +   if (vmx-nested.l2_state-shadow_vmcs) {
  +  printk(KERN_INFO %s: shadow vmcs already mapped\n,__func__);
  +  return 0;
  +   }
  +
 

 Consider dropping shadow_vmcs from l2_state and just passing it
 everywhere.  Less convenient but safer.
I will think about it, it is called from many places ...

  +   vmx-nested.l2_state-shadow_vmcs = kmap_atomic(vmcs_page,
KM_USER0);
  +
  +   if (!vmx-nested.l2_state-shadow_vmcs) {
  +  printk(KERN_INFO %s: error in kmap_atomic\n,__func__);
  +  return 0;
  +   }
 

 kmap_atomic() can't fail.
I will remove the check.
 
  +static int handle_vmread(struct kvm_vcpu *vcpu)
  +{
  +#ifndef CONFIG_X86_64
  +   u64 value;
  +#endif
  +
  +   if (!nested_vmx_check_permission(vcpu))
  +  return 1;
  +
  +   if (!nested_map_shadow_vmcs(vcpu)) {
  +  printk(KERN_INFO %s invalid shadow vmcs\n, __func__);
  +  set_rflags_to_vmx_fail_invalid(vcpu);
  +  return 1;
  +   }
 

 return an error.
OK.

  +
  +   switch (vmcs_field_length(vcpu-arch.regs[VCPU_REGS_RDX])) {
  +   case VMCS_FIELD_TYPE_U16:
  +  vcpu-arch.regs[VCPU_REGS_RAX] =
  + nested_vmcs_read16(vcpu,
  +  vcpu-arch.regs[VCPU_REGS_RDX]);
  +  break;
 

 Use the emulator to decode operands.
OK.


 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume

2009-10-22 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 20/10/2009 06:56:39:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/
 Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil,
 aligu...@us.ibm.com, md...@us.ibm.com

 Date:

 20/10/2009 06:56

 Subject:

 Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume

 On 10/15/2009 11:41 PM, or...@il.ibm.com wrote:
  From: Orit Wassermanor...@il.ibm.com
 
  ---
arch/x86/kvm/vmx.c | 1173 ++
 --
1 files changed, 1148 insertions(+), 25 deletions(-)
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 6a4c252..e814029 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -209,6 +209,7 @@ struct __attribute__ ((__packed__)) level_state {
   struct vmcs *vmcs;
   int cpu;
   int launched;
  +   bool first_launch;
};
 
struct nested_vmx {
  @@ -216,6 +217,12 @@ struct nested_vmx {
   bool vmxon;
   /* What is the location of the  vmcs l1 keeps for l2? (in
 level1 gpa) */
   u64 vmptr;
  +   /* Are we running nested guest */
  +   bool nested_mode;
  +   /* L1 requested VMLAUNCH or VMRESUME but we didn't run L2 yet */
  +   bool nested_run_pending;
  +   /* flag indicating if there was a valid IDT after exiting from l2
*/
  +   bool nested_valid_idt;
 

 Did you mean valid_idt_vectoring_info?
yes.

 No need to prefix everything with nested_ inside nested_vmx.
OK.

  +void prepare_vmcs_12(struct kvm_vcpu *vcpu)
  +{
  +   struct shadow_vmcs *l2_shadow_vmcs =
  +  get_shadow_vmcs(vcpu);
  +
  +   l2_shadow_vmcs-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
  +   l2_shadow_vmcs-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
  +   l2_shadow_vmcs-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
  +   l2_shadow_vmcs-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
  +   l2_shadow_vmcs-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
  +   l2_shadow_vmcs-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
  +   l2_shadow_vmcs-guest_ldtr_selector = vmcs_read16
(GUEST_LDTR_SELECTOR);
  +   l2_shadow_vmcs-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
  +
  +   l2_shadow_vmcs-tsc_offset = vmcs_read64(TSC_OFFSET);
  +   l2_shadow_vmcs-guest_physical_address =
  +  vmcs_read64(GUEST_PHYSICAL_ADDRESS);
  +   l2_shadow_vmcs-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
 

 Physical addresses need translation,  no?
If you are referring to GUEST_PHYSICAL_ADDRESS than there is no need for
translation for L1.
It need to stay in L2 physical address.

  +   l2_shadow_vmcs-guest_cr0 = vmcs_readl(GUEST_CR0);
  +
  +   l2_shadow_vmcs-guest_cr4 = vmcs_readl(GUEST_CR4);
 

 We don't allow the guest to modify these, so no need to read them.  If
 you do, you need to remove the bits that we modify.
You are correct for example CR0.TS , it will be fixed in the next set of
patches.

  +
  +int load_vmcs_common(struct shadow_vmcs *src)
  +{
  +
  +   vmcs_write64(VMCS_LINK_POINTER, src-vmcs_link_pointer);
 

 Why load this?
At the moment it is not used , maybe in the future.
I can add a check to see if it was changed.

  +   vmcs_write64(GUEST_IA32_DEBUGCTL, src-guest_ia32_debugctl);
 

 I think some features there are dangerous.
I will look into it.

  +   vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, src-
vm_entry_msr_load_count);
 

 Need to verify?  Also need to validate the loaded MSRs and run them
 through kvm_set_msr() instead of letting the cpu do it.
I will add the checks.

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume

2009-10-22 Thread Orit Wasserman


Gleb Natapov g...@redhat.com wrote on 22/10/2009 11:04:58:

 From:

 Gleb Natapov g...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/
 Haifa/i...@ibmil, kvm@vger.kernel.org, md...@us.ibm.com, Muli Ben-
 Yehuda/Haifa/i...@ibmil

 Date:

 22/10/2009 11:05

 Subject:

 Re: [PATCH 5/5] Nested VMX patch 5 implements vmlaunch and vmresume

 On Wed, Oct 21, 2009 at 04:43:44PM +0200, Orit Wasserman wrote:
@@ -4641,10 +4955,13 @@ static void vmx_complete_interrupts(struct
   vcpu_vmx *vmx)
int type;
bool idtv_info_valid;
   
-   exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
-
vmx-exit_reason = vmcs_read32(VM_EXIT_REASON);
   
+   if (vmx-nested.nested_mode)
+  return;
+
   Why return here? What the function does that should not be done in
   nested mode?
  In nested mode L0 injects an interrupt to L2 only in one scenario,
  if there is an IDT_VALID event and L0 decides to run L2 again and not
to
  switch back to L1.
  In all other cases the injection is handled by L1.
 This is exactly the kind of scenario that is handled by
 vmx_complete_interrupts(). (vmx|svm)_complete_interrups() store
 pending event in arch agnostic way and re-injection is handled by
 x86.c You bypass this logic by inserting return here and introducing
 nested_handle_valid_idt() function below.
The only location we can truly know if we are switching to L1 is in
vmx_vcpu_run
because enable_irq_window (that is called after handling the exit) can
decide to
switch to L1 because of an interrupt.
In order to simplify our code it was simpler to bypass
vmx_complete_interrupts when it is called (after
running L2) and to add nested_handle_valid_idt just before running L2.
  
+   exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+
/* Handle machine checks before interrupts are enabled */
if ((vmx-exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
|| (vmx-exit_reason == EXIT_REASON_EXCEPTION_NMI
@@ -4747,6 +5064,60 @@ static void fixup_rmode_irq(struct vcpu_vmx
  *vmx)
   | vmx-rmode.irq.vector;
 }
   
+static int nested_handle_valid_idt(struct kvm_vcpu *vcpu)
+{
   It seems by this function you are trying to bypass general event
   reinjection logic. Why?
  See above.
 The logic implemented by this function is handled in x86.c in arch
 agnostic way. Is there something wrong with this?
See my comment before

+   vmx-launched = vmx-nested.l2_state-launched;
+
   Can you explain why -launched logic is needed?
  It is possible L1 called vmlaunch but we didn't actually run L2 (for
  example there was an interrupt and
  enable_irq_window switched back to L1 before running L2). L1 thinks the
  vmlaunch was successful and call vmresume in the next time
  but KVM needs to call vmlaunch for L2.
 handle_vmlauch() and handle_vmresume() are exactly the same. Why KVM
needs
 to run one and not the other?
Yes they are very similar (almost the same code) the only difference is the
check of vmclear,
we need to emulate the vmx hardware behavior for those two commands and
check VMC12 state.

+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu,
+  bool is_interrupt)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   int initial_pfu_active = vcpu-fpu_active;
+
+   if (!vmx-nested.nested_mode) {
+  printk(KERN_INFO WARNING: %s called but not in nested mode
\n,
+ __func__);
+  return 0;
+   }
+
+   save_msrs(vmx-guest_msrs, vmx-save_nmsrs);
+
+   sync_cached_regs_to_vmcs(vcpu);
+
+   if (!nested_map_shadow_vmcs(vcpu)) {
+  printk(KERN_INFO Error mapping shadow vmcs\n);
+  set_rflags_to_vmx_fail_valid(vcpu);
   Error during vmexit should set abort flag, not change flags.
  I think this is more a vmlaunch/vmresume error (in the code that switch
  back to L1).
 How is this vmlaunch/vmresume error? This function is called to exit
 from L2 guest while on L2 vcms. It is called asynchronously in respect
 to L2 guest and you modify L2 guest rflags register at unpredictable
 place here.
OK.

 --
  Gleb.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst

2009-10-21 Thread Orit Wasserman


Gleb Natapov g...@redhat.com wrote on 19/10/2009 13:17:41:

 From:

 Gleb Natapov g...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/
 Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil,
 aligu...@us.ibm.com, md...@us.ibm.com

 Date:

 19/10/2009 13:17

 Subject:

 Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst

 On Thu, Oct 15, 2009 at 04:41:44PM +0200, or...@il.ibm.com wrote:
  From: Orit Wasserman or...@il.ibm.com
 
  ---
   arch/x86/kvm/vmx.c |  468 +++
 +++--
   arch/x86/kvm/x86.c |3 +-
   2 files changed, 459 insertions(+), 12 deletions(-)
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 411cbdb..8c186e0 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -61,20 +61,168 @@ module_param_named(unrestricted_guest,
   static int __read_mostly emulate_invalid_guest_state = 0;
   module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 
  +
  +struct __attribute__ ((__packed__)) shadow_vmcs {
  +   u32 revision_id;
  +   u32 abort;
  +   u16 virtual_processor_id;
  +   u16 guest_es_selector;
  +   u16 guest_cs_selector;
  +   u16 guest_ss_selector;
  +   u16 guest_ds_selector;
  +   u16 guest_fs_selector;
  +   u16 guest_gs_selector;
  +   u16 guest_ldtr_selector;
  +   u16 guest_tr_selector;
  +   u16 host_es_selector;
  +   u16 host_cs_selector;
  +   u16 host_ss_selector;
  +   u16 host_ds_selector;
  +   u16 host_fs_selector;
  +   u16 host_gs_selector;
  +   u16 host_tr_selector;
  +   u64 io_bitmap_a;
  +   u64 io_bitmap_b;
  +   u64 msr_bitmap;
  +   u64 vm_exit_msr_store_addr;
  +   u64 vm_exit_msr_load_addr;
  +   u64 vm_entry_msr_load_addr;
  +   u64 tsc_offset;
  +   u64 virtual_apic_page_addr;
  +   u64 apic_access_addr;
  +   u64 ept_pointer;
  +   u64 guest_physical_address;
  +   u64 vmcs_link_pointer;
  +   u64 guest_ia32_debugctl;
  +   u64 guest_ia32_pat;
  +   u64 guest_pdptr0;
  +   u64 guest_pdptr1;
  +   u64 guest_pdptr2;
  +   u64 guest_pdptr3;
  +   u64 host_ia32_pat;
  +   u32 pin_based_vm_exec_control;
  +   u32 cpu_based_vm_exec_control;
  +   u32 exception_bitmap;
  +   u32 page_fault_error_code_mask;
  +   u32 page_fault_error_code_match;
  +   u32 cr3_target_count;
  +   u32 vm_exit_controls;
  +   u32 vm_exit_msr_store_count;
  +   u32 vm_exit_msr_load_count;
  +   u32 vm_entry_controls;
  +   u32 vm_entry_msr_load_count;
  +   u32 vm_entry_intr_info_field;
  +   u32 vm_entry_exception_error_code;
  +   u32 vm_entry_instruction_len;
  +   u32 tpr_threshold;
  +   u32 secondary_vm_exec_control;
  +   u32 vm_instruction_error;
  +   u32 vm_exit_reason;
  +   u32 vm_exit_intr_info;
  +   u32 vm_exit_intr_error_code;
  +   u32 idt_vectoring_info_field;
  +   u32 idt_vectoring_error_code;
  +   u32 vm_exit_instruction_len;
  +   u32 vmx_instruction_info;
  +   u32 guest_es_limit;
  +   u32 guest_cs_limit;
  +   u32 guest_ss_limit;
  +   u32 guest_ds_limit;
  +   u32 guest_fs_limit;
  +   u32 guest_gs_limit;
  +   u32 guest_ldtr_limit;
  +   u32 guest_tr_limit;
  +   u32 guest_gdtr_limit;
  +   u32 guest_idtr_limit;
  +   u32 guest_es_ar_bytes;
  +   u32 guest_cs_ar_bytes;
  +   u32 guest_ss_ar_bytes;
  +   u32 guest_ds_ar_bytes;
  +   u32 guest_fs_ar_bytes;
  +   u32 guest_gs_ar_bytes;
  +   u32 guest_ldtr_ar_bytes;
  +   u32 guest_tr_ar_bytes;
  +   u32 guest_interruptibility_info;
  +   u32 guest_activity_state;
  +   u32 guest_sysenter_cs;
  +   u32 host_ia32_sysenter_cs;
  +   unsigned long cr0_guest_host_mask;
  +   unsigned long cr4_guest_host_mask;
  +   unsigned long cr0_read_shadow;
  +   unsigned long cr4_read_shadow;
  +   unsigned long cr3_target_value0;
  +   unsigned long cr3_target_value1;
  +   unsigned long cr3_target_value2;
  +   unsigned long cr3_target_value3;
  +   unsigned long exit_qualification;
  +   unsigned long guest_linear_address;
  +   unsigned long guest_cr0;
  +   unsigned long guest_cr3;
  +   unsigned long guest_cr4;
  +   unsigned long guest_es_base;
  +   unsigned long guest_cs_base;
  +   unsigned long guest_ss_base;
  +   unsigned long guest_ds_base;
  +   unsigned long guest_fs_base;
  +   unsigned long guest_gs_base;
  +   unsigned long guest_ldtr_base;
  +   unsigned long guest_tr_base;
  +   unsigned long guest_gdtr_base;
  +   unsigned long guest_idtr_base;
  +   unsigned long guest_dr7;
  +   unsigned long guest_rsp;
  +   unsigned long guest_rip;
  +   unsigned long guest_rflags;
  +   unsigned long guest_pending_dbg_exceptions;
  +   unsigned long guest_sysenter_esp;
  +   unsigned long guest_sysenter_eip;
  +   unsigned long host_cr0;
  +   unsigned long host_cr3;
  +   unsigned long host_cr4;
  +   unsigned long host_fs_base;
  +   unsigned long host_gs_base;
  +   unsigned long host_tr_base;
  +   unsigned long host_gdtr_base;
  +   unsigned long host_idtr_base;
  +   unsigned long host_ia32_sysenter_esp;
  +   unsigned long host_ia32_sysenter_eip

Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst

2009-10-21 Thread Orit Wasserman


Gleb Natapov g...@redhat.com wrote on 19/10/2009 14:59:53:

 From:

 Gleb Natapov g...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/
 Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil,
 aligu...@us.ibm.com, md...@us.ibm.com

 Date:

 19/10/2009 15:00

 Subject:

 Re: [PATCH 3/5] Nested VMX patch 3 implements vmptrld and vmptrst

 On Thu, Oct 15, 2009 at 04:41:44PM +0200, or...@il.ibm.com wrote:
  +static struct page *nested_get_page(struct kvm_vcpu *vcpu,
  +u64 vmcs_addr)
  +{
  +   struct page *vmcs_page = NULL;
  +
  +   down_read(current-mm-mmap_sem);
  +   vmcs_page = gfn_to_page(vcpu-kvm, vmcs_addr  PAGE_SHIFT);
  +   up_read(current-mm-mmap_sem);
 Why are you taking mmap_sem here? gup_fast() takes it if required.
I will remove it.

  +
  +   if (is_error_page(vmcs_page)) {
  +  printk(KERN_ERR %s error allocating page \n, __func__);
  +  kvm_release_page_clean(vmcs_page);
  +  return NULL;
  +   }
  +
  +   return vmcs_page;
  +
  +}
  +

 --
  Gleb.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite

2009-10-21 Thread Orit Wasserman


Gleb Natapov g...@redhat.com wrote on 19/10/2009 15:17:20:

 From:

 Gleb Natapov g...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/
 Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil,
 aligu...@us.ibm.com, md...@us.ibm.com

 Date:

 19/10/2009 15:17

 Subject:

 Re: [PATCH 4/5] Nested VMX patch 4 implements vmread and vmwrite

 On Thu, Oct 15, 2009 at 04:41:45PM +0200, or...@il.ibm.com wrote:
  From: Orit Wasserman or...@il.ibm.com
 
  ---
   arch/x86/kvm/vmx.c |  591 +++
 -
   1 files changed, 589 insertions(+), 2 deletions(-)
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 8c186e0..6a4c252 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -225,6 +225,21 @@ struct nested_vmx {
  struct level_state *l1_state;
   };
 
  +enum vmcs_field_type {
  +   VMCS_FIELD_TYPE_U16 = 0,
  +   VMCS_FIELD_TYPE_U64 = 1,
  +   VMCS_FIELD_TYPE_U32 = 2,
  +   VMCS_FIELD_TYPE_ULONG = 3
  +};
  +
  +#define VMCS_FIELD_LENGTH_OFFSET 13
  +#define VMCS_FIELD_LENGTH_MASK 0x6000
  +
  +static inline int vmcs_field_length(unsigned long field)
  +{
  +   return (VMCS_FIELD_LENGTH_MASK  field)  13;
  +}
  +
   struct vmcs {
  u32 revision_id;
  u32 abort;
  @@ -288,6 +303,404 @@ static inline struct vcpu_vmx *to_vmx(struct
 kvm_vcpu *vcpu)
  return container_of(vcpu, struct vcpu_vmx, vcpu);
   }
 
  +#define SHADOW_VMCS_OFFSET(x) offsetof(struct shadow_vmcs, x)
  +
  +static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
  +
  +   [VIRTUAL_PROCESSOR_ID] =
  +  SHADOW_VMCS_OFFSET(virtual_processor_id),
  +   [GUEST_ES_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(guest_es_selector),
  +   [GUEST_CS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(guest_cs_selector),
  +   [GUEST_SS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(guest_ss_selector),
  +   [GUEST_DS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(guest_ds_selector),
  +   [GUEST_FS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(guest_fs_selector),
  +   [GUEST_GS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(guest_gs_selector),
  +   [GUEST_LDTR_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(guest_ldtr_selector),
  +   [GUEST_TR_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(guest_tr_selector),
  +   [HOST_ES_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(host_es_selector),
  +   [HOST_CS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(host_cs_selector),
  +   [HOST_SS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(host_ss_selector),
  +   [HOST_DS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(host_ds_selector),
  +   [HOST_FS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(host_fs_selector),
  +   [HOST_GS_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(host_gs_selector),
  +   [HOST_TR_SELECTOR] =
  +  SHADOW_VMCS_OFFSET(host_tr_selector),
  +   [IO_BITMAP_A] =
  +  SHADOW_VMCS_OFFSET(io_bitmap_a),
  +   [IO_BITMAP_A_HIGH] =
  +  SHADOW_VMCS_OFFSET(io_bitmap_a)+4,
  +   [IO_BITMAP_B] =
  +  SHADOW_VMCS_OFFSET(io_bitmap_b),
  +   [IO_BITMAP_B_HIGH] =
  +  SHADOW_VMCS_OFFSET(io_bitmap_b)+4,
  +   [MSR_BITMAP] =
  +  SHADOW_VMCS_OFFSET(msr_bitmap),
  +   [MSR_BITMAP_HIGH] =
  +  SHADOW_VMCS_OFFSET(msr_bitmap)+4,
  +   [VM_EXIT_MSR_STORE_ADDR] =
  +  SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr),
  +   [VM_EXIT_MSR_STORE_ADDR_HIGH] =
  +  SHADOW_VMCS_OFFSET(vm_exit_msr_store_addr)+4,
  +   [VM_EXIT_MSR_LOAD_ADDR] =
  +  SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr),
  +   [VM_EXIT_MSR_LOAD_ADDR_HIGH] =
  +  SHADOW_VMCS_OFFSET(vm_exit_msr_load_addr)+4,
  +   [VM_ENTRY_MSR_LOAD_ADDR] =
  +  SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr),
  +   [VM_ENTRY_MSR_LOAD_ADDR_HIGH] =
  +  SHADOW_VMCS_OFFSET(vm_entry_msr_load_addr)+4,
  +   [TSC_OFFSET] =
  +  SHADOW_VMCS_OFFSET(tsc_offset),
  +   [TSC_OFFSET_HIGH] =
  +  SHADOW_VMCS_OFFSET(tsc_offset)+4,
  +   [VIRTUAL_APIC_PAGE_ADDR] =
  +  SHADOW_VMCS_OFFSET(virtual_apic_page_addr),
  +   [VIRTUAL_APIC_PAGE_ADDR_HIGH] =
  +  SHADOW_VMCS_OFFSET(virtual_apic_page_addr)+4,
  +   [APIC_ACCESS_ADDR] =
  +  SHADOW_VMCS_OFFSET(apic_access_addr),
  +   [APIC_ACCESS_ADDR_HIGH] =
  +  SHADOW_VMCS_OFFSET(apic_access_addr)+4,
  +   [EPT_POINTER] =
  +  SHADOW_VMCS_OFFSET(ept_pointer),
  +   [EPT_POINTER_HIGH] =
  +  SHADOW_VMCS_OFFSET(ept_pointer)+4,
  +   [GUEST_PHYSICAL_ADDRESS] =
  +  SHADOW_VMCS_OFFSET(guest_physical_address),
  +   [GUEST_PHYSICAL_ADDRESS_HIGH] =
  +  SHADOW_VMCS_OFFSET(guest_physical_address)+4,
  +   [VMCS_LINK_POINTER] =
  +  SHADOW_VMCS_OFFSET(vmcs_link_pointer),
  +   [VMCS_LINK_POINTER_HIGH] =
  +  SHADOW_VMCS_OFFSET(vmcs_link_pointer)+4,
  +   [GUEST_IA32_DEBUGCTL] =
  +  SHADOW_VMCS_OFFSET(guest_ia32_debugctl),
  +   [GUEST_IA32_DEBUGCTL_HIGH] =
  +  SHADOW_VMCS_OFFSET(guest_ia32_debugctl)+4,
  +   [GUEST_IA32_PAT] =
  +  SHADOW_VMCS_OFFSET(guest_ia32_pat),
  +   [GUEST_IA32_PAT_HIGH] =
  +  SHADOW_VMCS_OFFSET

Re: Nested VMX support v3

2009-10-21 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 20/10/2009 05:30:34:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Abel Gordon/
 Haifa/i...@ibmil, Muli Ben-Yehuda/Haifa/i...@ibmil,
 aligu...@us.ibm.com, md...@us.ibm.com

 Date:

 20/10/2009 05:30

 Subject:

 Re: Nested VMX support v3

 On 10/15/2009 11:41 PM, or...@il.ibm.com wrote:
  Avi,
  We have addressed all of the comments, please apply.
 
  The following patches implement nested VMX support. The patches
 enable a guest
  to use the VMX APIs in order to run its own nested guest (i.e.,
 enable running
  other hypervisors which use VMX under KVM). The current patches
 support running
  Linux under a nested KVM using shadow page table (with bypass_guest_pf
  disabled). SMP support was fixed.  Reworking EPT support to mesh
 cleanly with
  the current shadow paging design per Avi's comments is a
work-in-progress.
 

 Why is bypass_guest_pf disabled?
It was not implemented.
We need to modify the walk_addr code to handle the sptes that have invalid
content (used only for the bypass_guest_pf
optimization) and identify them as not present. Maybe remove some other
validity checks too.

  The current patches only support a single nested hypervisor, which
 can only run
  a single guest (multiple guests are work in progress). Only 64-bit
nested
  hypervisors are supported.
 

 Multiple guests and 32-bit support are merge requirements.  As far as I
 can tell there shouldn't be anything special required to support them?
Ok.


  vpid allocation will be updated with the multiguest support (work
 in progress).
  We are working on fixing the cr0.TS handling, it works for nested kvm
by not
  for vmware server.
 

 Please either drop or fix vpid before merging.  What's wrong with
 cr0.ts?  I'd like to see that fixed as well.
Ok.

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst

2009-09-06 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 06/09/2009 12:25:17:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/
 Haifa/i...@ibmil, kvm@vger.kernel.org, mm...@us.ibm.com, Muli Ben-
 Yehuda/Haifa/i...@ibmil

 Date:

 06/09/2009 12:25

 Subject:

 Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst

 On 09/03/2009 05:25 PM, Orit Wasserman wrote:
 
  +   /*
  +* Level 2 state : includes vmcs,registers and
  +* a copy of vmcs12 for vmread/vmwrite
  +*/
  +   struct level_state *l2_state;
  +
  +   /* Level 1 state for switching to level 2 and back */
  +   struct level_state *l1_state;
 
 
  Can you explain why we need two of them?  in the guest vmcs we have
host
  and guest values, and in l1_state and l2_state we have more copies,
and
  in struct vcpu we have yet another set of copies.  We also have a
couple
  of copies in the host vmcs.  I'm getting dizzy...
 
  L2_state stores all the L2 guest state:
 vmcs - A pointer to VMCS02, the VMCS used to run it by L0.
 shadow vmcs - a structure storing the values of VMCS12 (the vmcs
L1
  create to run L2).
 

 When we support multiple nested guests, we'll run into a problem of
 where to store shadow_vmcs.  I see these options:

 - maintain a cache of limited size of shadow_vmcs; when evicting, copy
 the shadow_vmcs into the guest's vmptr]
 - always put shadow_vmcs in the guest's vmptr, and write protect it so
 the guest can't play with it
 - always put shadow_vmcs in the guest's vmptr, and verify everything you
 read (that's what nsvm does)

The second option looks a bit complicated I prefer one of the other two.
 cpu - the cpu id
 

 Why is it needed?
This is a copy of the cpu id from the vcpu to store the last cpu id the L2
guest run on.

 launched- launched flag
 

 Can be part of shadow_vmcs
I prefer to keep the shadow_vmcs as a separate structure to store only VMCS
fields.

 vpid - the vpid allocate by L0 for L2 (we need to store it
somewhere)
 

 Note the guest can DoS the host by allocating a lot of vpids.  So we to
 allocate host vpids on demand and be able to flush them out.
The guest is not allocating the vpids the host(L0) does using
allocate_vpid.
I agree that with nested the danger for them to run out is bigger.

 msr_bitmap - At the moment we use L0 msr_bitmap(as we are
running kvm
  on kvm) in the future we will use a merge of both bitmaps.
 

 Note kvm uses two bitmaps (for long mode and legacy mode).
OK.

  L1 state stores the L1 state -
 vmcs - pointer to VMCS01
 

 So it's the same as vmx-vmcs in normal operation?
yes , but with nested the vmx-vmcs is changed when running an L2 (nested)
guest.

 shadow vmcs - a structure storing the values of VMCS01. we use
it
  when updating VMCS02 in order to avoid the need to switch between
VMCS02
  and VMCS01.
 

 Sorry, don't understand.
VMCS02 - the VMCS L0 uses to run L2.
When we create/update VMCS02 we need to read fields from VMCS01 (host state
is taken fully, control fields ).
For L1 the shadow_vmcs is a copy of VMCS01 in a structure format, we used
the same structure.

 cpu - the cpu id
 launched- launched flag
 

 This is a copy of vmx-launched?
Exactly .The vmx-launched is updated when switching from L1/L2 and back so
we need to store it here.

  +
  +   if (vmx-nested.l1_cur_vmcs != guest_vmcs_addr) {
  +  vmcs_page = nested_get_page(vcpu, guest_vmcs_addr);
  +  if (vmcs_page == NULL)
  + return 1;
  +
  +  /* load nested vmcs to processor */
  +  if (vmptrld(vcpu, page_to_phys(vmcs_page))) {
 
 
  So, you're loading a guest page as the vmcs.  This is dangerous as the
  guest can play with it.  Much better to use inaccessible memory (and
you
  do alloc_vmcs() earlier?)
 
  We can copy the vmcs and than vmptrld it. As for the allocate vmcs this
is
  a memory leak and I will fix it (it should be allocated only once).
 

 But why do it?  Your approach is to store the guest vmcs in the same
 format as the processor (which we don't really know), so you have to use
 vmread/vmwrite to maintain it.  Instead, you can choose that the guest
 vmcs is a shadow_vmcs structure and then you can access it using normal
 memory operations.
I got it now.
We will need a way to distinguish between processor format VMCS and
structure based VMCS,
we can use the revision id field (create a unique revision id for nested
like 0x or 0x0).

  I see.  You're using the processor's format when reading the guest
  vmcs.  But we don't have to do that, we can use the shadow_vmcs
  structure (and a memcpy).
 
  I'm sorry I don't understand your comment can u elaborate ?
 
 

 See previous comment.  Basically you can do

struct shadow_vmcs *svmcs = kmap_atomic(gpa_to_page(vmx-vmptr));
printk(guest_cs = %x\n, svmcs-guest_cs_selector);
See above.

 instead of

vmptrld(gpa_to_hpa(vmx-vmptr

Re: [PATCH 5/6] Nested VMX patch 5 implements vmlaunch and vmresume

2009-09-06 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 06/09/2009 12:29:58:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/
 Haifa/i...@ibmil, kvm@vger.kernel.org, mm...@us.ibm.com, Muli Ben-
 Yehuda/Haifa/i...@ibmil

 Date:

 06/09/2009 12:30

 Subject:

 Re: [PATCH 5/6] Nested VMX patch 5 implements vmlaunch and vmresume

 On 09/03/2009 05:53 PM, Orit Wasserman wrote:
 
  Need to auto-ack the interrupt if requested by the guest.
 
  The is_interrupt means L1 has interrupts, kvm regular code will handle
it.
 

 If the VM-Exit Controls bit 15 (Acknowledge interrupts on exit) is set,
 when the nested guest exits you need to run kvm_cpu_get_interrupt() and
 put the vector number in the VM-Exit interruption-information field.
 kvm doesn't set this bit but I think Xen does.
VMware doesn't set it either.
We have to run L2 with the bit off even if the L1 hypervisor set it, and
emulate the L2 exit to L1 hypervisor correctly.
I will look at it.

 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst

2009-09-06 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 06/09/2009 16:52:56:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/
 Haifa/i...@ibmil, kvm@vger.kernel.org, m...@us.ibm.com, Muli Ben-
 Yehuda/Haifa/i...@ibmil

 Date:

 06/09/2009 16:53

 Subject:

 Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst

 On 09/06/2009 04:36 PM, Orit Wasserman wrote:
 
  When we support multiple nested guests, we'll run into a problem of
  where to store shadow_vmcs.  I see these options:
 
  - maintain a cache of limited size of shadow_vmcs; when evicting, copy
  the shadow_vmcs into the guest's vmptr]
  - always put shadow_vmcs in the guest's vmptr, and write protect it so
  the guest can't play with it
  - always put shadow_vmcs in the guest's vmptr, and verify everything
you
  read (that's what nsvm does)
 
 
  The second option looks a bit complicated I prefer one of the other
two.
 

 I agree, the third option looks easiest but not sure how much
 verification is needed.

 Note other things like the msr bitmaps may need write protection,
 otherwise you have to re-merge the bitmap on every guest entry, which
 can be very slow.  So we may be forced to add write protection anyway.
We will also need to write protected L1's EPT tables , to allow L1 to swap
out his guests.

  launched- launched flag
 
 
  Can be part of shadow_vmcs
 
  I prefer to keep the shadow_vmcs as a separate structure to store only
VMCS
  fields.
 

 It is a vmcs field - it is manipulated by vmx instructions which operate
 on the vmcs.  You'll need to store it in guest memory when you support
 multiple nested guests.

 You can put the vmcs fields in a sub-structure if you want to separate
 between explicit fields and implicit fields (I can only see one implicit
 field (launched), but maybe there are more).
OK.

 
  vpid - the vpid allocate by L0 for L2 (we need to store it
 
  somewhere)
 
 
  Note the guest can DoS the host by allocating a lot of vpids.  So we
to
  allocate host vpids on demand and be able to flush them out.
 
  The guest is not allocating the vpids the host(L0) does using
  allocate_vpid.
 

 I meant, the guest can force the host to allocate vpids if we don't
 protect against it.
You meant by launching a lot of guests ?
We can limit the number of guests as a very quick solution.
More complicated is limiting the number of vpids per L1 hypervisor and
reusing them.
This means we will sometime need to invalidate the vpid when switching
between L2 guests.

  I agree that with nested the danger for them to run out is bigger.
 


  Sorry, don't understand.
 
  VMCS02 - the VMCS L0 uses to run L2.
  When we create/update VMCS02 we need to read fields from VMCS01 (host
state
  is taken fully, control fields ).
  For L1 the shadow_vmcs is a copy of VMCS01 in a structure format, we
used
  the same structure.
 

 I don't understand why you need it.  Host state shouldn't change.  Only
 the control fields are interesting, and things like exception_bitmap.
I think that when KVM switches to Qemu the host state can change (L0 host
state). If this happens between different runs of L2
we will need to update VMCS02 host state. Of course we can optimize and
update it only than.

  But why do it?  Your approach is to store the guest vmcs in the same
  format as the processor (which we don't really know), so you have to
use
  vmread/vmwrite to maintain it.  Instead, you can choose that the guest
  vmcs is a shadow_vmcs structure and then you can access it using
normal
  memory operations.
 
  I got it now.
  We will need a way to distinguish between processor format VMCS and
  structure based VMCS,
  we can use the revision id field (create a unique revision id for
nested
  like 0x or 0x0).
 

 No, you can always store guest vmcs in software format, since we'll
 never load it with vmptrld.  We'll only load a real vmcs with vmptrld.
You are right , a new VMCS will be zeroed.

 Note it also solves live migration, since now all guest vmcss are copied
 as part of normal guest memory (including their launched state).
Great.

 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] Nested VMX patch 1 implements vmon and vmoff

2009-09-03 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 02/09/2009 22:34:58:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben-
 Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil,
 aligu...@us.ibm.com, mm...@us.ibm.com

 Date:

 02/09/2009 22:34

 Subject:

 Re: [PATCH 1/6] Nested VMX patch 1 implements vmon and vmoff

 On 09/02/2009 06:38 PM, or...@il.ibm.com wrote:
  From: Orit Wassermanor...@il.ibm.com
 
  ---
arch/x86/kvm/svm.c |3 -
arch/x86/kvm/vmx.c |  187 ++
 +-
arch/x86/kvm/x86.c |6 ++-
arch/x86/kvm/x86.h |2 +
4 files changed, 192 insertions(+), 6 deletions(-)
 
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index 2df9b45..3c1f22a 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -124,9 +124,6 @@ static int npt = 1;
 
module_param(npt, int, S_IRUGO);
 
  -static int nested = 1;
  -module_param(nested, int, S_IRUGO);
  -
static void svm_flush_tlb(struct kvm_vcpu *vcpu);
static void svm_complete_interrupts(struct vcpu_svm *svm);
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 78101dd..abba325 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -67,6 +67,11 @@ struct vmcs {
   char data[0];
};
 
  +struct nested_vmx {
  +   /* Has the level1 guest done vmon? */
  +   bool vmon;
  +};
 

 vmxon
fixed

  @@ -967,6 +975,69 @@ static void guest_write_tsc(u64 guest_tsc,
 u64 host_tsc)
}
 
/*
  + * Handles msr read for nested virtualization
  + */
  +static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index,
  +   u64 *pdata)
  +{
  +   u32 vmx_msr_low = 0, vmx_msr_high = 0;
  +
  +   switch (msr_index) {
  +   case MSR_IA32_FEATURE_CONTROL:
  +  *pdata = 0;
  +  break;
  +   case MSR_IA32_VMX_BASIC:
  +  rdmsr(MSR_IA32_VMX_BASIC, vmx_msr_low, vmx_msr_high);
 

 Use rdmsrl, it's easier.
fixed

 I think we need to mask it with the capabilities we support.  Otherwise
 the guest can try to use some new feature which we don't support yet,
 and crash.
I agree , but I went over the Intel spec and didn't find any problematic
feature.
We may need to consider it in the future.

  +  *pdata = vmx_msr_low | ((u64)vmx_msr_high  32);
  +  break;
  +   case MSR_IA32_VMX_PINBASED_CTLS:
  +  *pdata = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING |
  + PIN_BASED_VIRTUAL_NMIS;
 

 Need to mask with actual cpu capabilities in case we run on an older cpu.
fixed

  +  break;
  +   case MSR_IA32_VMX_PROCBASED_CTLS:
  +  *pdata =  CPU_BASED_HLT_EXITING |
  +#ifdef CONFIG_X86_64
  + CPU_BASED_CR8_LOAD_EXITING |
  + CPU_BASED_CR8_STORE_EXITING |
  +#endif
  + CPU_BASED_CR3_LOAD_EXITING |
  + CPU_BASED_CR3_STORE_EXITING |
  + CPU_BASED_USE_IO_BITMAPS |
  + CPU_BASED_MOV_DR_EXITING |
  + CPU_BASED_USE_TSC_OFFSETING |
  + CPU_BASED_INVLPG_EXITING;
 

 Same here... or are all these guaranteed to be present?
fixed

  +
  +static int handle_vmon(struct kvm_vcpu *vcpu)
  +{
  +   struct kvm_segment cs;
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +
  +   if (!nested) {
  +  printk(KERN_DEBUG %s: nested vmx not enabled\n, __func__);
  +  kvm_queue_exception(vcpu, UD_VECTOR);
  +  return 1;
  +   }
  +
  +   vmx_get_segment(vcpu,cs, VCPU_SREG_CS);
  +
  +   if (!(vcpu-arch.cr4  X86_CR4_VMXE) ||
  +   !(vcpu-arch.cr0  X86_CR0_PE) ||
  +   (vmx_get_rflags(vcpu)  X86_EFLAGS_VM) ||
  +   ((find_msr_entry(to_vmx(vcpu),
  +  MSR_EFER)-data  EFER_LMA)  !cs.l)) {
 

 Not everyone has EFER.  Better to wrap this in an #ifdef CONFIG_X86_64.
fixed

 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] Nested VMX patch 2 implements vmclear

2009-09-03 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 02/09/2009 22:38:22:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben-
 Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil,
 aligu...@us.ibm.com, mm...@us.ibm.com

 Date:

 02/09/2009 23:01

 Subject:

 Re: [PATCH 2/6] Nested VMX patch 2 implements vmclear

 On 09/02/2009 06:38 PM, or...@il.ibm.com wrote:
  From: Orit Wassermanor...@il.ibm.com
 
  ---
arch/x86/kvm/vmx.c |   24 +++-
1 files changed, 23 insertions(+), 1 deletions(-)
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index abba325..2b1fc3b 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -70,6 +70,8 @@ struct vmcs {
struct nested_vmx {
   /* Has the level1 guest done vmon? */
   bool vmon;
  +   /* Has the level1 guest done vmclear? */
  +   bool vmclear;
};
 

 Doesn't seem these two belong in the same structure - vmclear is
 per-vmcs... but you're probably aware of that with the multi-guest
 support coming.
You are right vmclear flag is part of the L2 guest state.


 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst

2009-09-03 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 02/09/2009 23:05:09:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben-
 Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil,
 aligu...@us.ibm.com, mm...@us.ibm.com

 Date:

 02/09/2009 23:04

 Subject:

 Re: [PATCH 3/6] Nested VMX patch 3 implements vmptrld and vmptrst

 On 09/02/2009 06:38 PM, or...@il.ibm.com wrote:
  +struct __attribute__ ((__packed__)) level_state {
  +   struct shadow_vmcs *shadow_vmcs;
  +
  +   u16 vpid;
  +   u64 shadow_efer;
  +   unsigned long cr2;
  +   unsigned long cr3;
  +   unsigned long cr4;
  +   unsigned long cr8;
  +
  +   u64 io_bitmap_a;
  +   u64 io_bitmap_b;
  +   u64 msr_bitmap;
  +
  +   struct vmcs *vmcs;
  +   int cpu;
  +   int launched;
  +};
 



  +
struct vmcs {
   u32 revision_id;
   u32 abort;
  @@ -72,6 +217,17 @@ struct nested_vmx {
   bool vmon;
   /* Has the level1 guest done vmclear? */
   bool vmclear;
  +   /* What is the location of the  vmcs l1 keeps for l2? (in level1
gpa) */
  +   u64 l1_cur_vmcs;
 

 This is the vmptr (exactly as loaded by vmptrld), right?  If so, please
 call it vmptr.
OK I will change it.

  +   /*
  +* Level 2 state : includes vmcs,registers and
  +* a copy of vmcs12 for vmread/vmwrite
  +*/
  +   struct level_state *l2_state;
  +
  +   /* Level 1 state for switching to level 2 and back */
  +   struct level_state *l1_state;
 

 Can you explain why we need two of them?  in the guest vmcs we have host
 and guest values, and in l1_state and l2_state we have more copies, and
 in struct vcpu we have yet another set of copies.  We also have a couple
 of copies in the host vmcs.  I'm getting dizzy...
L2_state stores all the L2 guest state:
  vmcs - A pointer to VMCS02, the VMCS used to run it by L0.
  shadow vmcs - a structure storing the values of VMCS12 (the vmcs L1
create to run L2).
  cpu - the cpu id
  launched- launched flag
  vpid - the vpid allocate by L0 for L2 (we need to store it somewhere)
  msr_bitmap - At the moment we use L0 msr_bitmap(as we are running kvm
on kvm) in the future we will use a merge of both bitmaps.
  io_bitmaps - At the moment we use L0 io_bitmaps (as we are running
kvm on kvm) in the future we will use a merge of both io_bitmaps.

L1 state stores the L1 state -
  vmcs - pointer to VMCS01
  shadow vmcs - a structure storing the values of VMCS01. we use it
when updating VMCS02 in order to avoid the need to switch between VMCS02
and VMCS01.
  cpu - the cpu id
  launched- launched flag
  vpid - the vpid allocate by L0 for L1 (we need to store it somewhere)
  shadow_efer - Till recently wasn't part of the VMCS and may be needed
for older processors.
  cr0 - not used I will remove it
  cr2 - not used I will remove it
  cr3
  cr4

We didn't use the state stored in the vcpu for L1 because sometime it
changes during L2 run.
The vmcs in the vcpu point to the active vmcs  it is pointing to VMCS01
when L1 is running and to VMCS02 when L2 guest is running.



static int init_rmode(struct kvm *kvm);
static u64 construct_eptp(unsigned long root_hpa);
 
 
 
  +int read_guest_vmcs_gpa(struct kvm_vcpu *vcpu, u64 *gentry)
  +{
  +   gpa_t gpa;
  +   struct page *page;
  +   int r = 0;
  +
  +   gpa = vcpu-arch.mmu.gva_to_gpa(vcpu, vcpu-arch.regs
[VCPU_REGS_RAX]);
  +
  +   /* checking guest gpa */
  +   page = gfn_to_page(vcpu-kvm, gpa  PAGE_SHIFT);
  +   if (is_error_page(page)) {
  +  printk(KERN_ERR %s Invalid guest vmcs addr %llx\n,
  + __func__, gpa);
  +  r = 1;
  +  goto out;
  +   }
  +
  +   r = kvm_read_guest(vcpu-kvm, gpa, gentry, sizeof(u64));
  +   if (r) {
  +  printk(KERN_ERR %s cannot read guest vmcs addr %llx : %d\n,
  + __func__, gpa, r);
  +  goto out;
  +   }
 

 You can use kvm_read_guest_virt() to simplify this.
I will fix it.

  +
  +   if (!IS_ALIGNED(*gentry, PAGE_SIZE)) {
  +  printk(KERN_DEBUG %s addr %llx not aligned\n,
  + __func__, *gentry);
  +  return 1;
  +   }
  +
  +out:
  +   kvm_release_page_clean(page);
  +   return r;
  +}
  +
  +static int handle_vmptrld(struct kvm_vcpu *vcpu)
  +{
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +   struct page *vmcs_page;
  +   u64 guest_vmcs_addr;
  +
  +   if (!nested_vmx_check_permission(vcpu))
  +  return 1;
  +
  +   if (read_guest_vmcs_gpa(vcpu,guest_vmcs_addr))
  +  return 1;
  +
  +   if (create_l1_state(vcpu)) {
  +  printk(KERN_ERR %s create_l1_state failed\n, __func__);
  +  return 1;
  +   }
  +
  +   if (create_l2_state(vcpu)) {
  +  printk(KERN_ERR %s create_l2_state failed\n, __func__);
  +  return 1;
  +   }
  +
  +   vmx-nested.l2_state-vmcs = alloc_vmcs();
  +   if (!vmx-nested.l2_state-vmcs) {
  +  printk(KERN_ERR %s error in creating level 2 vmcs, __func__);
  +  return 1

Re: [PATCH 2/2] Nested VMX patch 4 implements vmread and vmwrite

2009-09-03 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 02/09/2009 23:15:40:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben-
 Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil,
 aligu...@us.ibm.com, mm...@us.ibm.com

 Date:

 02/09/2009 23:15

 Subject:

 Re: [PATCH 2/2] Nested VMX patch 4 implements vmread and vmwrite

 On 09/02/2009 06:38 PM, or...@il.ibm.com wrote:
  +
  +static void init_vmcs_field_to_offset_table(void)
  +{
  +   memset(vmcs_field_to_offset_table,0xff,
  +  sizeof(vmcs_field_to_offset_table));
  +
  +   vmcs_field_to_offset_table[VIRTUAL_PROCESSOR_ID] =
  +  offsetof(struct shadow_vmcs, virtual_processor_id);
  +   vmcs_field_to_offset_table[GUEST_ES_SELECTOR] =
  +  offsetof(struct shadow_vmcs, guest_es_selector);
  +   vmcs_field_to_offset_table[GUEST_CS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, guest_cs_selector);
  +   vmcs_field_to_offset_table[GUEST_SS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, guest_ss_selector);
  +   vmcs_field_to_offset_table[GUEST_DS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, guest_ds_selector);
  +   vmcs_field_to_offset_table[GUEST_FS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, guest_fs_selector);
  +   vmcs_field_to_offset_table[GUEST_GS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, guest_gs_selector);
  +   vmcs_field_to_offset_table[GUEST_LDTR_SELECTOR] =
  +  offsetof(struct shadow_vmcs, guest_ldtr_selector);
  +   vmcs_field_to_offset_table[GUEST_TR_SELECTOR] =
  +  offsetof(struct shadow_vmcs, guest_tr_selector);
  +   vmcs_field_to_offset_table[HOST_ES_SELECTOR] =
  +  offsetof(struct shadow_vmcs, host_es_selector);
  +   vmcs_field_to_offset_table[HOST_CS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, host_cs_selector);
  +   vmcs_field_to_offset_table[HOST_SS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, host_ss_selector);
  +   vmcs_field_to_offset_table[HOST_DS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, host_ds_selector);
  +   vmcs_field_to_offset_table[HOST_FS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, host_fs_selector);
  +   vmcs_field_to_offset_table[HOST_GS_SELECTOR] =
  +  offsetof(struct shadow_vmcs, host_gs_selector);
  +   vmcs_field_to_offset_table[HOST_TR_SELECTOR] =
  +  offsetof(struct shadow_vmcs, host_tr_selector);
  +   vmcs_field_to_offset_table[IO_BITMAP_A] =
  +  offsetof(struct shadow_vmcs, io_bitmap_a);
  +   vmcs_field_to_offset_table[IO_BITMAP_A_HIGH] =
  +  offsetof(struct shadow_vmcs, io_bitmap_a)+4;
  +   vmcs_field_to_offset_table[IO_BITMAP_B] =
  +  offsetof(struct shadow_vmcs, io_bitmap_b);
  +   vmcs_field_to_offset_table[IO_BITMAP_B_HIGH] =
  +  offsetof(struct shadow_vmcs, io_bitmap_b)+4;
  +   vmcs_field_to_offset_table[MSR_BITMAP] =
  +  offsetof(struct shadow_vmcs, msr_bitmap);
  +   vmcs_field_to_offset_table[MSR_BITMAP_HIGH] =
  +  offsetof(struct shadow_vmcs, msr_bitmap)+4;
  +   vmcs_field_to_offset_table[VM_EXIT_MSR_STORE_ADDR] =
  +  offsetof(struct shadow_vmcs, vm_exit_msr_store_addr);
  +   vmcs_field_to_offset_table[VM_EXIT_MSR_STORE_ADDR_HIGH] =
  +  offsetof(struct shadow_vmcs, vm_exit_msr_store_addr)+4;
  +   vmcs_field_to_offset_table[VM_EXIT_MSR_LOAD_ADDR] =
  +  offsetof(struct shadow_vmcs, vm_exit_msr_load_addr);
  +   vmcs_field_to_offset_table[VM_EXIT_MSR_LOAD_ADDR_HIGH] =
  +  offsetof(struct shadow_vmcs, vm_exit_msr_load_addr)+4;
  +   vmcs_field_to_offset_table[VM_ENTRY_MSR_LOAD_ADDR] =
  +  offsetof(struct shadow_vmcs, vm_entry_msr_load_addr);
  +   vmcs_field_to_offset_table[VM_ENTRY_MSR_LOAD_ADDR_HIGH] =
  +  offsetof(struct shadow_vmcs, vm_entry_msr_load_addr)+4;
  +   vmcs_field_to_offset_table[TSC_OFFSET] =
  +  offsetof(struct shadow_vmcs, tsc_offset);
  +   vmcs_field_to_offset_table[TSC_OFFSET_HIGH] =
  +  offsetof(struct shadow_vmcs, tsc_offset)+4;
  +   vmcs_field_to_offset_table[VIRTUAL_APIC_PAGE_ADDR] =
  +  offsetof(struct shadow_vmcs, virtual_apic_page_addr);
  +   vmcs_field_to_offset_table[VIRTUAL_APIC_PAGE_ADDR_HIGH] =
  +  offsetof(struct shadow_vmcs, virtual_apic_page_addr)+4;
  +   vmcs_field_to_offset_table[APIC_ACCESS_ADDR] =
  +  offsetof(struct shadow_vmcs, apic_access_addr);
  +   vmcs_field_to_offset_table[APIC_ACCESS_ADDR_HIGH] =
  +  offsetof(struct shadow_vmcs, apic_access_addr)+4;
  +   vmcs_field_to_offset_table[EPT_POINTER] =
  +  offsetof(struct shadow_vmcs, ept_pointer);
  +   vmcs_field_to_offset_table[EPT_POINTER_HIGH] =
  +  offsetof(struct shadow_vmcs, ept_pointer)+4;
  +   vmcs_field_to_offset_table[GUEST_PHYSICAL_ADDRESS] =
  +  offsetof(struct shadow_vmcs, guest_physical_address);
  +   vmcs_field_to_offset_table[GUEST_PHYSICAL_ADDRESS_HIGH] =
  +  offsetof(struct shadow_vmcs, guest_physical_address)+4;
  +   vmcs_field_to_offset_table[VMCS_LINK_POINTER] =
  +  offsetof(struct

Re: [PATCH 5/6] Nested VMX patch 5 implements vmlaunch and vmresume

2009-09-03 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 03/09/2009 00:38:16:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 kvm@vger.kernel.org, Ben-Ami Yassour1/Haifa/i...@ibmil, Muli Ben-
 Yehuda/Haifa/i...@ibmil, Abel Gordon/Haifa/i...@ibmil,
 aligu...@us.ibm.com, mm...@us.ibm.com

 Date:

 03/09/2009 00:38

 Subject:

 Re: [PATCH 5/6] Nested VMX patch 5 implements vmlaunch and vmresume

 On 09/02/2009 06:38 PM, or...@il.ibm.com wrote:
  -struct nested_vmx {
  -   /* Has the level1 guest done vmon? */
  +struct nested_vmx {   /* Has the level1 guest done vmon? */
 

 A \n died here.
I will fix it.

   bool vmon;
   /* Has the level1 guest done vmclear? */
   bool vmclear;
  +
  +   /* Are we running nested guest */
  +   bool nested_mode;
  +
  +   /* L1 requested VMLAUNCH or VMRESUME but we didn't run L2 yet */
  +   bool nested_run_pending;
  +
  +   /* flag indicating if there was a valid IDT after exiting from l2
*/
  +   bool nested_pending_valid_idt;
 

 What does this mean?  pending event?
I will rename it.
 
  +
  +static inline int nested_cpu_has_vmx_tpr_shadow(struct  kvm_vcpu
*vcpu)
  +{
  +   return to_vmx(vcpu)-nested.l2_state-shadow_vmcs-
  +  cpu_based_vm_exec_control  CPU_BASED_TPR_SHADOW;
  +}
 

 Don't we need to check if the host supports it too?
We check it separately but I can add it here

  +static inline bool nested_vm_need_virtualize_apic_accesses(struct
kvm_vcpu
  +*vcpu)
  +{
  +   struct shadow_vmcs *shadow = to_vmx(vcpu)-nested.l2_state-
shadow_vmcs;
  +
  +   return (shadow-secondary_vm_exec_control
  +  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)
  +  to_vmx(vcpu)-nested.l2_state-shadow_vmcs-apic_access_addr !=
0;
  +}
 

 Why check apic_access_addr?
I will remove it.

  +
  +static inline int nested_cpu_has_vmx_ept(struct kvm_vcpu *vcpu)
  +{
  +   return to_vmx(vcpu)-nested.l2_state-shadow_vmcs-
  +  secondary_vm_exec_control  SECONDARY_EXEC_ENABLE_EPT;
  +}
 

 Need to check if secondary controls enabled?
If the secondary controls are not enabled this field is zero.

  +static void vmx_set_irq(struct kvm_vcpu *vcpu)
  +{
  +   if (to_vmx(vcpu)-nested.nested_mode)
  +  return;
 

 Why?

 Note if the guest didn't enable external interrupt exiting, we need to
 inject as usual.
I look into it.

 
  +static int nested_handle_pending_idt(struct kvm_vcpu *vcpu)
  +{
 

 Again the name is confusing.  pending_event_injection?
I will rename it.

  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +   int irq;
  +   int type;
  +   int errCodeValid;
  +   u32 idt_vectoring_info;
  +   u32 guest_intr;
  +   bool nmi_window_open;
  +   bool interrupt_window_open;
  +
  +   if (vmx-nested.nested_mode  vmx-
nested.nested_pending_valid_idt) {
  +  idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
  +  irq  = idt_vectoring_info  VECTORING_INFO_VECTOR_MASK;
  +  type = idt_vectoring_info  VECTORING_INFO_TYPE_MASK;
  +  errCodeValid = idt_vectoring_info
  + VECTORING_INFO_DELIVER_CODE_MASK;
  +
  +  guest_intr = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
  +  nmi_window_open =
  + !(guest_intr  (GUEST_INTR_STATE_STI |
  +   GUEST_INTR_STATE_MOV_SS |
  +   GUEST_INTR_STATE_NMI));
  +
  +  interrupt_window_open =
  + ((vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF)
  +  !(guest_intr  (GUEST_INTR_STATE_STI |
  +GUEST_INTR_STATE_MOV_SS)));
  +
  +  if (type == INTR_TYPE_EXT_INTR  !interrupt_window_open) {
  + printk(KERN_INFO IDT ignored, l2 interrupt window
closed!\n);
  + return 0;
  +  }
 

 How can this happen?  Unless it's on nested entry, in which case we need
 to abort the entry.
Ok i will fix it. The truth I never saw it happen.

  +
#ifdef CONFIG_X86_64
#define R r
#define Q q
  @@ -4646,6 +4842,15 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
   struct vcpu_vmx *vmx = to_vmx(vcpu);
 
  +   nested_handle_pending_idt(vcpu);
 

 You're not checking the return code (need to do that on entry).
I will fix it.

  +
  +   if (vmx-nested.nested_mode) {
  +  vmcs_writel(GUEST_CR0, vmx-nested.l2_state-shadow_vmcs-
guest_cr0);
 

 Might not be legal.  We may also want to force-enable caching.  Lastly,
 don't we need to handle cr0.ts and ct0.mp specially to manage the fpu
state?
We are working on implementing this correctly . kvm seems to handle it fine
but vmware doesn't like it.

 
  +   if (vmx-nested.nested_mode)
  +  vmx-nested.vmclear = 0;
  +
 

 Why?
I will check it.

free_vmcs:
  @@ -5122,6 +5339,228 @@ static int shadow_vmcs_load(struct kvm_vcpu
*vcpu)
   return 0;
}
 
  +void prepare_vmcs_12(struct kvm_vcpu *vcpu)
  +{
  +   struct shadow_vmcs *l2_shadow_vmcs =
  +  to_vmx(vcpu)-nested.l2_state-shadow_vmcs;
  +   struct shadow_vmcs *l1_shadow_vmcs =
  +  to_vmx(vcpu)-nested.l1_state-shadow_vmcs;
  +
  +   l2_shadow_vmcs-guest_es_selector

Re: [PATCH 1/6] Nested VMX patch 1 implements vmon and vmoff

2009-09-03 Thread Orit Wasserman


Avi Kivity a...@redhat.com wrote on 03/09/2009 16:39:09:

 From:

 Avi Kivity a...@redhat.com

 To:

 Orit Wasserman/Haifa/i...@ibmil

 Cc:

 Abel Gordon/Haifa/i...@ibmil, aligu...@us.ibm.com, Ben-Ami Yassour1/
 Haifa/i...@ibmil, kvm@vger.kernel.org, mm...@us.ibm.com, Muli Ben-
 Yehuda/Haifa/i...@ibmil

 Date:

 03/09/2009 16:39

 Subject:

 Re: [PATCH 1/6] Nested VMX patch 1 implements vmon and vmoff

 On 09/03/2009 03:34 PM, Orit Wasserman wrote:
 
  I think we need to mask it with the capabilities we support.
Otherwise
  the guest can try to use some new feature which we don't support yet,
  and crash.
 
  I agree , but I went over the Intel spec and didn't find any
problematic
  feature.
  We may need to consider it in the future.
 

 We need to do it, since we don't know anything about future processors.
OK

 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html