Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-24 Thread Avi Kivity

On 05/23/2011 09:06 PM, Alexander Graf wrote:

On 23.05.2011, at 17:23, Avi Kivity wrote:

  On 05/23/2011 05:44 PM, Nadav Har'El wrote:
  On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested VMX, 
v9:
 vmcs01 and vmcs02 will both be generated from vmcs12.

  If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be 
generated
  from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
  non-trapped bits of guest_cr0), and these modifications are not copied back
  to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
  this task).

  If you do a nested exit (a fake one), vmcs12 is made up to date, and then
  indeed vmcs02 can be thrown away and regenerated.

  You would flush this state back to the vmcs.  But that just confirms Joerg's 
statement that a fake vmexit/vmrun is more or less equivalent.

  The question is whether %rip points to the VMRUN/VMLAUNCH instruction, 
HOST_RIP (or the next instruction for svm), or to guest code.  But the actual 
things we need to do are all very similar subsets of a vmexit.

%rip should certainly point to VMRUN. That way there is no need to save any 
information whatsoever, as the VMCB is already in sane state and nothing needs 
to be special cased, as the next VCPU_RUN would simply go back into guest mode 
- which is exactly what we want.

The only tricky part is how we distinguish between I need to live migrate and 
info registers. In the former case, %rip should be on VMRUN. In the latter, on the 
guest rip.


We can split vmrun emulation into save host state, load guest state 
and prepare nested vmcb.  Then, when we load registers, if we see that 
we're in guest mode, we do just the prepare nested vmcb bit.


This way register state is always nested guest state.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-24 Thread Joerg Roedel
On Tue, May 24, 2011 at 02:09:00PM +0300, Avi Kivity wrote:
 On 05/23/2011 09:06 PM, Alexander Graf wrote:
 On 23.05.2011, at 17:23, Avi Kivity wrote:

   On 05/23/2011 05:44 PM, Nadav Har'El wrote:
   On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: 
  Nested VMX, v9:
  vmcs01 and vmcs02 will both be generated from vmcs12.
 
   If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be 
  generated
   from vmcs12... while L2 runs, it is possible that it modifies vmcs02 
  (e.g.,
   non-trapped bits of guest_cr0), and these modifications are not copied 
  back
   to vmcs12 until the nested exit (when prepare_vmcs12() is called to 
  perform
   this task).
 
   If you do a nested exit (a fake one), vmcs12 is made up to date, and 
  then
   indeed vmcs02 can be thrown away and regenerated.
 
   You would flush this state back to the vmcs.  But that just confirms 
  Joerg's statement that a fake vmexit/vmrun is more or less equivalent.
 
   The question is whether %rip points to the VMRUN/VMLAUNCH instruction, 
  HOST_RIP (or the next instruction for svm), or to guest code.  But the 
  actual things we need to do are all very similar subsets of a vmexit.

 %rip should certainly point to VMRUN. That way there is no need to save any 
 information whatsoever, as the VMCB is already in sane state and nothing 
 needs to be special cased, as the next VCPU_RUN would simply go back into 
 guest mode - which is exactly what we want.

 The only tricky part is how we distinguish between I need to live migrate 
 and info registers. In the former case, %rip should be on VMRUN. In the 
 latter, on the guest rip.

 We can split vmrun emulation into save host state, load guest state  
 and prepare nested vmcb.  Then, when we load registers, if we see that  
 we're in guest mode, we do just the prepare nested vmcb bit.

Or we just emulate a VMEXIT in the VCPU_FREEZE ioctl and set the
%rip back to the VMRUN that entered the L2 guest. For 'info registers'
the VCPU_FREEZE ioctl will not be issued and the guest registers be
displayed.
That way we don't need to migrate any additional state for SVM.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Joerg Roedel
On Sun, May 22, 2011 at 10:32:39PM +0300, Nadav Har'El wrote:

 At the risk of sounding blasphemous, I'd like to make the case that perhaps
 the current nested-VMX design - regarding the IDT-vectoring-info-field
 handling - is actually closer than nested-SVM to the goal of moving clean
 nested-supporting logic into x86.c, instead of having ad-hoc, unnatural,
 workarounds.

Well, the nested SVM implementation is certainly not perfect in this
regard :)

 Therefore, I believe that the clean solution isn't to leave the original
 non-nested logic that always queues the idt-vectoring-info assuming it will
 be injected, and then if it shouldn't (because we want to exit during entry)
 we need to skip the entry once as a trick to avoid this wrong injection.
 
 Rather, a clean solution is, I think, to recognize that in nested
 virtualization, idt-vectoring-info is a different kind of beast than regular
 injected events, and it needs to be saved at exit time in a different field
 (which will of course be common to SVM and VMX). Only at entry time, after
 the regular injection code (which may cause a nested exit), we can call a
 x86_op to handle this special injection.

Things are complicated either way. If you keep the vectoring-info
seperate from the kvm exception queue you need special logic to combine
the vectoring-info and the queue. For example, imagine something is
pending in idt-vectoring info and the intercept causes another
exception for the guest. KVM needs to turn this into the #DF then. When
we just queue the vectoring-info into the exception queue we get this
implicitly without extra code. This is a cleaner way imho.

On the other side, when using the exception queue we need to keep
extra-information for nesting in the queue because an event which is
just re-injected into L2 must not cause a nested vmexit, even if the
exception vector is intercepted by L1. But this is the same for SVM and
VMX so we can do this in generic x86 code. This is not the case when
keeping track of idt-vectoring info seperate in architecture code.

Regards,

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Avi Kivity

On 05/22/2011 10:32 PM, Nadav Har'El wrote:

On Thu, May 12, 2011, Gleb Natapov wrote about Re: [PATCH 0/30] nVMX: Nested VMX, 
v9:
But if my interpretation of the code is correct, SVM isn't much closer
than VMX to the goal of moving this logic to x86.c. When some logic is
moved there, both SVM and VMX code will need to change - perhaps even
considerably. So how will it be helpful to make VMX behave exactly like
SVM does now, when the latter will also need to change considerably?
  
  SVM design is much close to the goal of moving the logic into x86.c
  because IIRC it does not bypass parsing of IDT vectoring info into arch
  independent structure. VMX code uses vmx-idt_vectoring_info directly.

At the risk of sounding blasphemous, I'd like to make the case that perhaps
the current nested-VMX design - regarding the IDT-vectoring-info-field
handling - is actually closer than nested-SVM to the goal of moving clean
nested-supporting logic into x86.c, instead of having ad-hoc, unnatural,
workarounds.

Let me explain, and see if you agree with my logic:

We discover at exit time whether the virtualization hardware (VMX or SVM)
exited while *delivering* an interrupt or exception to the current guest.
This is known as idt-vectoring-information in VMX.

What do we need to do with this idt-vectoring-information? In regular (non-
nested) guests, the answer is simple: On the next entry, we need to inject
this event again into the guest, so it can resume the delivery of the
same event it was trying to deliver. This is why the nested-unaware code
has a vmx_complete_interrupts which basically adds this idt-vectoring-info
into KVM's event queue, which on the next entry will be injected similarly
to the way virtual interrupts from userspace are injected, and so on.


The other thing we may need to do, is to expose it to userspace in case 
we're live migrating at exactly this point in time.



But with nested virtualization, this is *not* what is supposed to happen -
we do not *always* need to inject the event to the guest. We will only need
to inject the event if the next entry will be again to the same guest, i.e.,
L1 after L1, or L2 after L2. If the idt-vectoring-info came from L2, but
our next entry will be into L1 (i.e., a nested exit), we *shouldn't* inject
the event as usual, but should rather pass this idt-vectoring-info field
as the exit information that L1 gets (in nested vmx terminology, in vmcs12).

However, at the time of exit, we cannot know for sure whether L2 will actually
run next, because it is still possible that an injection from user space,
before the next entry, will cause us to decide to exit to L1.

Therefore, I believe that the clean solution isn't to leave the original
non-nested logic that always queues the idt-vectoring-info assuming it will
be injected, and then if it shouldn't (because we want to exit during entry)
we need to skip the entry once as a trick to avoid this wrong injection.

Rather, a clean solution is, I think, to recognize that in nested
virtualization, idt-vectoring-info is a different kind of beast than regular
injected events, and it needs to be saved at exit time in a different field
(which will of course be common to SVM and VMX). Only at entry time, after
the regular injection code (which may cause a nested exit), we can call a
x86_op to handle this special injection.

The benefit of this approach, which is closer to the current vmx code,
is, I think, that x86.c will contain clear, self-explanatory nested logic,
instead of relying on vmx.c or svm.c circumventing various x86.c functions
and mechanisms to do something different from what they were meant to do.



IMO this will cause confusion, especially with the user interfaces used 
to read/write pending events.


I think what we need to do is:

1. change -interrupt_allowed() to return true if the interrupt flag is 
unmasked OR if in a nested guest, and we're intercepting interrupts
2. change -set_irq() to cause a nested vmexit if in a nested guest and 
we're intercepting interrupts

3. change -nmi_allowed() and -set_nmi() in a similar way
4. add a .injected flag to the interrupt queue which overrides the 
nested vmexit for VM_ENTRY_INTR_INFO_FIELD and the svm equivalent; if 
present normal injection takes place (or an error vmexit if the 
interrupt flag is clear and we cannot inject)



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Joerg Roedel
On Mon, May 23, 2011 at 12:52:50PM +0300, Avi Kivity wrote:
 On 05/22/2011 10:32 PM, Nadav Har'El wrote:
 What do we need to do with this idt-vectoring-information? In regular (non-
 nested) guests, the answer is simple: On the next entry, we need to inject
 this event again into the guest, so it can resume the delivery of the
 same event it was trying to deliver. This is why the nested-unaware code
 has a vmx_complete_interrupts which basically adds this idt-vectoring-info
 into KVM's event queue, which on the next entry will be injected similarly
 to the way virtual interrupts from userspace are injected, and so on.

 The other thing we may need to do, is to expose it to userspace in case  
 we're live migrating at exactly this point in time.

About live-migration with nesting, we had discussed the idea of just
doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
The problem was that the hypervisor may not expect an INTR intercept.

How about doing an implicit VMEXIT in this case and an implicit VMRUN
after the vcpu is migrated? The nested hypervisor will not see the
vmexit and the vcpu will be in a state where it is safe to migrate. This
should work for nested-vmx too if the guest-state is written back to
guest memory on VMEXIT. Is this the case?

Joerg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Avi Kivity

On 05/23/2011 04:02 PM, Joerg Roedel wrote:

On Mon, May 23, 2011 at 12:52:50PM +0300, Avi Kivity wrote:
  On 05/22/2011 10:32 PM, Nadav Har'El wrote:
  What do we need to do with this idt-vectoring-information? In regular (non-
  nested) guests, the answer is simple: On the next entry, we need to inject
  this event again into the guest, so it can resume the delivery of the
  same event it was trying to deliver. This is why the nested-unaware code
  has a vmx_complete_interrupts which basically adds this idt-vectoring-info
  into KVM's event queue, which on the next entry will be injected similarly
  to the way virtual interrupts from userspace are injected, and so on.

  The other thing we may need to do, is to expose it to userspace in case
  we're live migrating at exactly this point in time.

About live-migration with nesting, we had discussed the idea of just
doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
The problem was that the hypervisor may not expect an INTR intercept.

How about doing an implicit VMEXIT in this case and an implicit VMRUN
after the vcpu is migrated?


What if there's something in EXIT_INT_INFO?


  The nested hypervisor will not see the
vmexit and the vcpu will be in a state where it is safe to migrate. This
should work for nested-vmx too if the guest-state is written back to
guest memory on VMEXIT. Is this the case?


It is the case with the current implementation, and we can/should make 
it so in future implementations, just before exit to userspace.  Or at 
least provide an ABI to sync memory.


But I don't see why we shouldn't just migrate all the hidden state (in 
guest mode flag, svm host paging mode, svm host interrupt state, vmcb 
address/vmptr, etc.).  It's more state, but no thinking is involved, so 
it's clearly superior.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Nadav Har'El
On Mon, May 23, 2011, Joerg Roedel wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
 About live-migration with nesting, we had discussed the idea of just
 doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
 The problem was that the hypervisor may not expect an INTR intercept.
 
 How about doing an implicit VMEXIT in this case and an implicit VMRUN
 after the vcpu is migrated? The nested hypervisor will not see the
 vmexit and the vcpu will be in a state where it is safe to migrate. This
 should work for nested-vmx too if the guest-state is written back to
 guest memory on VMEXIT. Is this the case?

Indeed, on nested exit (L2 to L1), the L2 guest state is written back to
vmcs12 (in guest memory). In theory, at that point, the vmcs02 (the vmcs
used by L0 to actually run L2) can be discarded, without risking losing
anything.

The receiving hypervisor will need to remember to do that implicit VMRUN
when it starts the guest; It also needs to know what is the current L2
guest - in VMX this would be vmx-nested.current_vmptr, which needs to me
migrated as well (on the other hand, other variables like
vmx-nested.current_vmcs12, will need to be recalculated by the receiver, and
not migrated as-is). I haven't started considering how to wrap up all these
pieces into a complete working solution - it is one of the things on my TODO
list after the basic nested VMX is merged.

-- 
Nadav Har'El|   Monday, May 23 2011, 19 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Live as if you were to die tomorrow,
http://nadav.harel.org.il   |learn as if you were to live forever.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Joerg Roedel
On Mon, May 23, 2011 at 04:08:00PM +0300, Avi Kivity wrote:
 On 05/23/2011 04:02 PM, Joerg Roedel wrote:

 About live-migration with nesting, we had discussed the idea of just
 doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
 The problem was that the hypervisor may not expect an INTR intercept.

 How about doing an implicit VMEXIT in this case and an implicit VMRUN
 after the vcpu is migrated?

 What if there's something in EXIT_INT_INFO?

On real SVM hardware EXIT_INT_INFO should only contain something for
exception and npt intercepts. These are all handled in the kernel and do
not cause an exit to user-space so that no valid EXIT_INT_INFO should be
around when we actually go back to user-space (so that migration can
happen).

The exception might be the #PF/NPT intercept when the guest is doing
very obscure things like putting an exception/interrupt handler on mmio
memory, but that isn't really supported by KVM anyway so I doubt we
should care.

Unless I miss something here we should be safe by just not looking at
EXIT_INT_INFO while migrating.

   The nested hypervisor will not see the
 vmexit and the vcpu will be in a state where it is safe to migrate. This
 should work for nested-vmx too if the guest-state is written back to
 guest memory on VMEXIT. Is this the case?

 It is the case with the current implementation, and we can/should make  
 it so in future implementations, just before exit to userspace.  Or at  
 least provide an ABI to sync memory.

 But I don't see why we shouldn't just migrate all the hidden state (in  
 guest mode flag, svm host paging mode, svm host interrupt state, vmcb  
 address/vmptr, etc.).  It's more state, but no thinking is involved, so  
 it's clearly superior.

An issue is that there is different state to migrate for Intel and AMD
hosts. If we keep all that information in guest memory the kvm kernel
module can handle those details and all KVM needs to migrate is the
in-guest-mode flag and the gpa of the vmcb/vmcs which is currently
executed. This state should be enough for Intel and AMD nesting.

The next benefit is that it works seemlessly even if the state that
needs to be transfered is extended (e.g. by emulating a new
virtualization hardware feature). This support can be implemented in the
kernel module and no changes to qemu are required.


Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Avi Kivity

On 05/23/2011 04:40 PM, Joerg Roedel wrote:

On Mon, May 23, 2011 at 04:08:00PM +0300, Avi Kivity wrote:
  On 05/23/2011 04:02 PM, Joerg Roedel wrote:

  About live-migration with nesting, we had discussed the idea of just
  doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
  The problem was that the hypervisor may not expect an INTR intercept.

  How about doing an implicit VMEXIT in this case and an implicit VMRUN
  after the vcpu is migrated?

  What if there's something in EXIT_INT_INFO?

On real SVM hardware EXIT_INT_INFO should only contain something for
exception and npt intercepts. These are all handled in the kernel and do
not cause an exit to user-space so that no valid EXIT_INT_INFO should be
around when we actually go back to user-space (so that migration can
happen).

The exception might be the #PF/NPT intercept when the guest is doing
very obscure things like putting an exception/interrupt handler on mmio
memory, but that isn't really supported by KVM anyway so I doubt we
should care.

Unless I miss something here we should be safe by just not looking at
EXIT_INT_INFO while migrating.


Agree.


The nested hypervisor will not see the
  vmexit and the vcpu will be in a state where it is safe to migrate. This
  should work for nested-vmx too if the guest-state is written back to
  guest memory on VMEXIT. Is this the case?

  It is the case with the current implementation, and we can/should make
  it so in future implementations, just before exit to userspace.  Or at
  least provide an ABI to sync memory.

  But I don't see why we shouldn't just migrate all the hidden state (in
  guest mode flag, svm host paging mode, svm host interrupt state, vmcb
  address/vmptr, etc.).  It's more state, but no thinking is involved, so
  it's clearly superior.

An issue is that there is different state to migrate for Intel and AMD
hosts. If we keep all that information in guest memory the kvm kernel
module can handle those details and all KVM needs to migrate is the
in-guest-mode flag and the gpa of the vmcb/vmcs which is currently
executed. This state should be enough for Intel and AMD nesting.


I think for Intel there is no hidden state apart from in-guest-mode 
(there is the VMPTR, but it is an actual register accessible via 
instructions).  For svm we can keep the hidden state in the host 
state-save area (including the vmcb pointer).  The only risk is that svm 
will gain hardware support for nesting, and will choose a different 
format than ours.


An alternative is a fake MSR for storing this data, or just another 
get/set ioctl pair.  We'll have a flags field that says which fields are 
filled in.



The next benefit is that it works seemlessly even if the state that
needs to be transfered is extended (e.g. by emulating a new
virtualization hardware feature). This support can be implemented in the
kernel module and no changes to qemu are required.


I agree it's a benefit.  But I don't like making the fake vmexit part of 
live migration, if it turns out the wrong choice it's hard to undo it.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Nadav Har'El
On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
 I think for Intel there is no hidden state apart from in-guest-mode 
 (there is the VMPTR, but it is an actual register accessible via 
 instructions).

is_guest_mode(vcpu), vmx-nested.vmxon, vmx-nested.current_vmptr are the
only three things I can think of. Vmxon is actually more than a boolean
(there's also a vmxon pointer).

What do you mean by the current_vmptr being available through an instruction?
It is (VMPTRST), but this would be an instruction run on L1 (emulated by L0).
How would L0's user space use that instruction?

 I agree it's a benefit.  But I don't like making the fake vmexit part of 
 live migration, if it turns out the wrong choice it's hard to undo it.

If you don't do this fake vmexit, you'll need to migrate both vmcs01 and
the current vmcs02 - the fact that vmcs12 is in guest memory will not be
enough, because vmcs02 isn't copied back to vmcs12 until the nested exit.


-- 
Nadav Har'El|   Monday, May 23 2011, 19 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |The world is coming to an end ... SAVE
http://nadav.harel.org.il   |YOUR BUFFERS!!!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Joerg Roedel
On Mon, May 23, 2011 at 04:52:47PM +0300, Avi Kivity wrote:
 On 05/23/2011 04:40 PM, Joerg Roedel wrote:

 The next benefit is that it works seemlessly even if the state that
 needs to be transfered is extended (e.g. by emulating a new
 virtualization hardware feature). This support can be implemented in the
 kernel module and no changes to qemu are required.

 I agree it's a benefit.  But I don't like making the fake vmexit part of  
 live migration, if it turns out the wrong choice it's hard to undo it.

Well, saving the state to the host-save-area and doing a fake-vmexit is
logically the same, only the memory where the information is stored
differs.

To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which
does all the necessary things.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Avi Kivity

On 05/23/2011 05:10 PM, Nadav Har'El wrote:

On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested VMX, 
v9:
  I think for Intel there is no hidden state apart from in-guest-mode
  (there is the VMPTR, but it is an actual register accessible via
  instructions).

is_guest_mode(vcpu), vmx-nested.vmxon, vmx-nested.current_vmptr are the
only three things I can think of. Vmxon is actually more than a boolean
(there's also a vmxon pointer).

What do you mean by the current_vmptr being available through an instruction?
It is (VMPTRST), but this would be an instruction run on L1 (emulated by L0).
How would L0's user space use that instruction?


I mean that it is an architectural register rather than hidden state.  
It doesn't mean that L0 user space can use it.




  I agree it's a benefit.  But I don't like making the fake vmexit part of
  live migration, if it turns out the wrong choice it's hard to undo it.

If you don't do this fake vmexit, you'll need to migrate both vmcs01 and
the current vmcs02 - the fact that vmcs12 is in guest memory will not be
enough, because vmcs02 isn't copied back to vmcs12 until the nested exit.



vmcs01 and vmcs02 will both be generated from vmcs12.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Avi Kivity

On 05/23/2011 05:28 PM, Joerg Roedel wrote:

On Mon, May 23, 2011 at 04:52:47PM +0300, Avi Kivity wrote:
  On 05/23/2011 04:40 PM, Joerg Roedel wrote:

  The next benefit is that it works seemlessly even if the state that
  needs to be transfered is extended (e.g. by emulating a new
  virtualization hardware feature). This support can be implemented in the
  kernel module and no changes to qemu are required.

  I agree it's a benefit.  But I don't like making the fake vmexit part of
  live migration, if it turns out the wrong choice it's hard to undo it.

Well, saving the state to the host-save-area and doing a fake-vmexit is
logically the same, only the memory where the information is stored
differs.


Right.  I guess the main difference is info registers after a stop.


To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which
does all the necessary things.



Or we can automatically flush things on any exit to userspace.  They 
should be very rare in guest mode.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Nadav Har'El
On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
 vmcs01 and vmcs02 will both be generated from vmcs12.

If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated
from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
non-trapped bits of guest_cr0), and these modifications are not copied back
to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
this task).

If you do a nested exit (a fake one), vmcs12 is made up to date, and then
indeed vmcs02 can be thrown away and regenerated.

Nadav.

-- 
Nadav Har'El|   Monday, May 23 2011, 19 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Jury: Twelve people who determine which
http://nadav.harel.org.il   |client has the better lawyer.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Joerg Roedel
On Mon, May 23, 2011 at 05:34:20PM +0300, Avi Kivity wrote:
 On 05/23/2011 05:28 PM, Joerg Roedel wrote:

 To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which
 does all the necessary things.

 Or we can automatically flush things on any exit to userspace.  They  
 should be very rare in guest mode.

This would make nesting mostly transparent to migration, so it sounds
good in this regard.

I do not completly agree that user-space exits in guest-mode are rare,
this depends on the hypervisor in the L1. In Hyper-V for example the
root-domain uses hardware virtualization too and has direct access to
devices (at least to some degree). IOIO is not intercepted in the
root-domain, for example. Not sure about the MMIO regions.


Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Avi Kivity

On 05/23/2011 05:58 PM, Joerg Roedel wrote:

On Mon, May 23, 2011 at 05:34:20PM +0300, Avi Kivity wrote:
  On 05/23/2011 05:28 PM, Joerg Roedel wrote:

  To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which
  does all the necessary things.

  Or we can automatically flush things on any exit to userspace.  They
  should be very rare in guest mode.

This would make nesting mostly transparent to migration, so it sounds
good in this regard.

I do not completly agree that user-space exits in guest-mode are rare,
this depends on the hypervisor in the L1. In Hyper-V for example the
root-domain uses hardware virtualization too and has direct access to
devices (at least to some degree). IOIO is not intercepted in the
root-domain, for example. Not sure about the MMIO regions.


Good point.  We were also talking about passing through virtio (or even 
host) devices to the guest.


So an ioctl to flush volatile state to memory would be a good idea.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Avi Kivity

On 05/23/2011 05:44 PM, Nadav Har'El wrote:

On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested VMX, 
v9:
  vmcs01 and vmcs02 will both be generated from vmcs12.

If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated
from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
non-trapped bits of guest_cr0), and these modifications are not copied back
to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
this task).

If you do a nested exit (a fake one), vmcs12 is made up to date, and then
indeed vmcs02 can be thrown away and regenerated.


You would flush this state back to the vmcs.  But that just confirms 
Joerg's statement that a fake vmexit/vmrun is more or less equivalent.


The question is whether %rip points to the VMRUN/VMLAUNCH instruction, 
HOST_RIP (or the next instruction for svm), or to guest code.  But the 
actual things we need to do are all very similar subsets of a vmexit.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-23 Thread Alexander Graf

On 23.05.2011, at 17:23, Avi Kivity wrote:

 On 05/23/2011 05:44 PM, Nadav Har'El wrote:
 On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested 
 VMX, v9:
   vmcs01 and vmcs02 will both be generated from vmcs12.
 
 If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be 
 generated
 from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
 non-trapped bits of guest_cr0), and these modifications are not copied back
 to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
 this task).
 
 If you do a nested exit (a fake one), vmcs12 is made up to date, and then
 indeed vmcs02 can be thrown away and regenerated.
 
 You would flush this state back to the vmcs.  But that just confirms Joerg's 
 statement that a fake vmexit/vmrun is more or less equivalent.
 
 The question is whether %rip points to the VMRUN/VMLAUNCH instruction, 
 HOST_RIP (or the next instruction for svm), or to guest code.  But the actual 
 things we need to do are all very similar subsets of a vmexit.

%rip should certainly point to VMRUN. That way there is no need to save any 
information whatsoever, as the VMCB is already in sane state and nothing needs 
to be special cased, as the next VCPU_RUN would simply go back into guest mode 
- which is exactly what we want.

The only tricky part is how we distinguish between I need to live migrate and 
info registers. In the former case, %rip should be on VMRUN. In the latter, 
on the guest rip.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-22 Thread Nadav Har'El
On Thu, May 12, 2011, Gleb Natapov wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
  But if my interpretation of the code is correct, SVM isn't much closer
  than VMX to the goal of moving this logic to x86.c. When some logic is
  moved there, both SVM and VMX code will need to change - perhaps even
  considerably. So how will it be helpful to make VMX behave exactly like
  SVM does now, when the latter will also need to change considerably?
  
 SVM design is much close to the goal of moving the logic into x86.c
 because IIRC it does not bypass parsing of IDT vectoring info into arch
 independent structure. VMX code uses vmx-idt_vectoring_info directly.

At the risk of sounding blasphemous, I'd like to make the case that perhaps
the current nested-VMX design - regarding the IDT-vectoring-info-field
handling - is actually closer than nested-SVM to the goal of moving clean
nested-supporting logic into x86.c, instead of having ad-hoc, unnatural,
workarounds.

Let me explain, and see if you agree with my logic:

We discover at exit time whether the virtualization hardware (VMX or SVM)
exited while *delivering* an interrupt or exception to the current guest.
This is known as idt-vectoring-information in VMX.

What do we need to do with this idt-vectoring-information? In regular (non-
nested) guests, the answer is simple: On the next entry, we need to inject
this event again into the guest, so it can resume the delivery of the
same event it was trying to deliver. This is why the nested-unaware code
has a vmx_complete_interrupts which basically adds this idt-vectoring-info
into KVM's event queue, which on the next entry will be injected similarly
to the way virtual interrupts from userspace are injected, and so on.

But with nested virtualization, this is *not* what is supposed to happen -
we do not *always* need to inject the event to the guest. We will only need
to inject the event if the next entry will be again to the same guest, i.e.,
L1 after L1, or L2 after L2. If the idt-vectoring-info came from L2, but
our next entry will be into L1 (i.e., a nested exit), we *shouldn't* inject
the event as usual, but should rather pass this idt-vectoring-info field
as the exit information that L1 gets (in nested vmx terminology, in vmcs12).

However, at the time of exit, we cannot know for sure whether L2 will actually
run next, because it is still possible that an injection from user space,
before the next entry, will cause us to decide to exit to L1.

Therefore, I believe that the clean solution isn't to leave the original
non-nested logic that always queues the idt-vectoring-info assuming it will
be injected, and then if it shouldn't (because we want to exit during entry)
we need to skip the entry once as a trick to avoid this wrong injection.

Rather, a clean solution is, I think, to recognize that in nested
virtualization, idt-vectoring-info is a different kind of beast than regular
injected events, and it needs to be saved at exit time in a different field
(which will of course be common to SVM and VMX). Only at entry time, after
the regular injection code (which may cause a nested exit), we can call a
x86_op to handle this special injection.

The benefit of this approach, which is closer to the current vmx code,
is, I think, that x86.c will contain clear, self-explanatory nested logic,
instead of relying on vmx.c or svm.c circumventing various x86.c functions
and mechanisms to do something different from what they were meant to do.

What do you think?


-- 
Nadav Har'El|   Sunday, May 22 2011, 19 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If I were two-faced, would I be wearing
http://nadav.harel.org.il   |this one? Abraham Lincoln
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-16 Thread Gleb Natapov
On Mon, May 16, 2011 at 02:11:40AM +0300, Nadav Har'El wrote:
 On Thu, May 12, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested 
 VMX, v9:
  Ah, yes.  For live migration to work, all vmcb state must be accessible 
  via vendor-independent accessors once an exit is completely handled.  
  For example, GPRs are accessible via kvm_register_read(), and without 
  nesting, interrupt state is stowed in the interrupt queue, but if you 
  keep IDT_VECTORING_INFO live between exit and entry, you can lose it if 
  you migrate at this point.
 
 Hi, I can quite easily save this state in a different place which is saved -
 The easiest will just be to use vmcs12, which has place for exactly the fields
 we want to save (and they are rewritten anyway when we exit to L1).
 
This will not address the problem that the state will not be visible to
generic logic in x86.c.

 Avi, would you you like me use this sort of solution to avoid the extra
 state? Of course, considering that anyway, live migration with nested VMX
 probably still doesn't work for a dozen other reasons :(
 
 Or do you consider this not enough, and rather that it is necessary that
 nested VMX should use exactly the same logic as nested SVM does - namely,
 use tricks like SVM's exit_required instead of our different tricks?
 
Given two solutions I prefer SVM one. Yes, I know that you asked Avi :)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-16 Thread Nadav Har'El
On Mon, May 16, 2011, Gleb Natapov wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
  Hi, I can quite easily save this state in a different place which is saved -
  The easiest will just be to use vmcs12, which has place for exactly the 
  fields
  we want to save (and they are rewritten anyway when we exit to L1).
  
 This will not address the problem that the state will not be visible to
 generic logic in x86.c.

Maybe I misunderstood your intention, but given that vmcs12 is in guest
memory, which is migrated as well, isn't that enough (for the live migration
issue)?


-- 
Nadav Har'El|   Monday, May 16 2011, 12 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |The trouble with political jokes is they
http://nadav.harel.org.il   |get elected.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-16 Thread Gleb Natapov
On Mon, May 16, 2011 at 10:44:28AM +0300, Nadav Har'El wrote:
 On Mon, May 16, 2011, Gleb Natapov wrote about Re: [PATCH 0/30] nVMX: Nested 
 VMX, v9:
   Hi, I can quite easily save this state in a different place which is 
   saved -
   The easiest will just be to use vmcs12, which has place for exactly the 
   fields
   we want to save (and they are rewritten anyway when we exit to L1).
   
  This will not address the problem that the state will not be visible to
  generic logic in x86.c.
 
 Maybe I misunderstood your intention, but given that vmcs12 is in guest
 memory, which is migrated as well, isn't that enough (for the live migration
 issue)?
 
I pointed two issues. Migration was a second and minor one since there
is a long why before migration will work with nested guest anyway. The
first one was much more important, so let me repeat it again. To move
nested event handling into generic code IDT vectoring info has to be
parsed into data structure that event injection code in x86.c actually
works with. And that code does not manipulate vmx-idt_vectoring_info
or SVM analog directly, but it works with event queue instead. SVM does
this right and there is nothing I can see that prevents moving SVM logic
into x86.c. I don't see how your VMX logic can be moved into x86.c as is
since it works on internal VMX fields directly.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-16 Thread Avi Kivity

On 05/16/2011 02:11 AM, Nadav Har'El wrote:

On Thu, May 12, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested VMX, 
v9:
  Ah, yes.  For live migration to work, all vmcb state must be accessible
  via vendor-independent accessors once an exit is completely handled.
  For example, GPRs are accessible via kvm_register_read(), and without
  nesting, interrupt state is stowed in the interrupt queue, but if you
  keep IDT_VECTORING_INFO live between exit and entry, you can lose it if
  you migrate at this point.

Hi, I can quite easily save this state in a different place which is saved -
The easiest will just be to use vmcs12, which has place for exactly the fields
we want to save (and they are rewritten anyway when we exit to L1).


You would still need -valid_idt_vectoring_info to know you need special 
handling, no?



Avi, would you you like me use this sort of solution to avoid the extra
state? Of course, considering that anyway, live migration with nested VMX
probably still doesn't work for a dozen other reasons :(

Or do you consider this not enough, and rather that it is necessary that
nested VMX should use exactly the same logic as nested SVM does - namely,
use tricks like SVM's exit_required instead of our different tricks?


I think svm is rather simple here using svm_complete_interrupts() to 
decode exit_int_info into the arch independent structures.  I don't 
think -exit_required is a hack - it could probably be improved but I 
think it does the right thing essentially.  For example 
svm_nmi_allowed() will return true if in guest mode and NMI interception 
is enabled.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-16 Thread Avi Kivity

On 05/16/2011 12:50 PM, Avi Kivity wrote:

Or do you consider this not enough, and rather that it is necessary that
nested VMX should use exactly the same logic as nested SVM does - 
namely,

use tricks like SVM's exit_required instead of our different tricks?



I think svm is rather simple here using svm_complete_interrupts() to 
decode exit_int_info into the arch independent structures.  I don't 
think -exit_required is a hack - it could probably be improved but I 
think it does the right thing essentially.  For example 
svm_nmi_allowed() will return true if in guest mode and NMI 
interception is enabled.




It would be better if it just returned true and let svm_inject_nmi() do 
the vmexit.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-15 Thread Nadav Har'El
On Thu, May 12, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
 Ah, yes.  For live migration to work, all vmcb state must be accessible 
 via vendor-independent accessors once an exit is completely handled.  
 For example, GPRs are accessible via kvm_register_read(), and without 
 nesting, interrupt state is stowed in the interrupt queue, but if you 
 keep IDT_VECTORING_INFO live between exit and entry, you can lose it if 
 you migrate at this point.

Hi, I can quite easily save this state in a different place which is saved -
The easiest will just be to use vmcs12, which has place for exactly the fields
we want to save (and they are rewritten anyway when we exit to L1).

Avi, would you you like me use this sort of solution to avoid the extra
state? Of course, considering that anyway, live migration with nested VMX
probably still doesn't work for a dozen other reasons :(

Or do you consider this not enough, and rather that it is necessary that
nested VMX should use exactly the same logic as nested SVM does - namely,
use tricks like SVM's exit_required instead of our different tricks?

Nadav.

-- 
Nadav Har'El|   Monday, May 16 2011, 12 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Corduroy pillows - they're making
http://nadav.harel.org.il   |headlines!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-12 Thread Nadav Har'El
On Wed, May 11, 2011, Gleb Natapov wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
 On Mon, May 09, 2011 at 02:18:25PM +0300, Avi Kivity wrote:
..
  We can probably merge the next iteration if there aren't significant
  comments from others.
..
  The only worrying thing is the issue you raise in patch 8.  Is there
  a simple fix you can push that addresses correctness?

Hi Avi, I have fixed the few issues you raised in this round, including
a rewritten patch 8 that doesn't leave anything as a TODO. I am ready
to send another round of patches once we decide what to do about the issue
raised by Gleb yesterday:

 I still feel that interrupt injection path should be reworked to be
 SVM like before merging the code.

Both I and my colleague Abel Gordon reviewed the relevant code-path in nested
VMX and nested SVM, and carefully (re)read the relevant parts of the VMX spec,
and we have come to several conclusions. Avi and Gleb, I'd love to hear your
comments on these issues:

Our first conclusion is that my description of our interrupt injection patch
was unduely negative. It wrongly suggested that our solution of using the
interrupt window was inefficient, and was just an ad-interim plug until a
cleaner solution could be written. This was wrong. In fact, the somewhat
strange exit and interrupt window that happens when injecting an interrupt
to L2 are *necessary* for accurate emulation. I explain why this is so in the
new version of this patch, which is included below. Please take a look.

Our second conclusion (and I hope that I'm not offending anyone here)
is that the changes for L2 interrupt injection in both SVM and VMX are both
ugly - they are just ugly in different ways. Both modified the non-nested
code in strange places in strange and unexpected ways, and tried to circumvent
the usual code path in x86.c without touching x86.c. They just did this in
two slightly different ways, neither (I think) is inherently uglier than the
other:

For accurate emulation (as I explain in the patch below), both codes need to
cause x86.c to change its normal behavior: It checks for interrupt_allowed()
and then (discovering that it isn't) enable_irq_window(). We want it to
instead exit to L1, and then enable the irq window on that. In the SVM code,
interrupt_allowed() is modified to always return false if nested, and
enable_irq_window() is modified to flag for an exit to L1 (which is performed
later) and turn on the interrupt window. In VMX, we modify the same places
but differently: In interrupt_allowed() we exit to L1 immediately (it's a
short operation, we didn't mind to do it in atomic context), and
enable_irq_window() doesn't need to be changed (it already runs in L1).

Continuing to survey the difference between nested VMX and and SVM, there
were other different choices made besides the ones mentioned above. nested SVM
uses an additional trick, of skipping one round of running the guest, when
it discovered the need for an exit in the wrong place, so it can get to
the right place again. Nested VMX solved the same problems with other
mechanisms, like a separate piece of code for handling IDT_VECTORING_INFO,
and nested_run_pending. Some differences can also be explained by the different
design of (non-nested) vmx.c vs svm.c - e.g., svm_complete_interrupts() is
called during the handle_exit(), while vmx_complete_interrupts() is called
after handle_exit() has completed (in atomic context) - this is one of the
reasons the nested IDT_VECTORING_INFO path is different.

I think that both solutions are far from being beautiful or easy to understand.
Nested SVM is perhaps slightly less ugly but also has a small performance cost
(with the extra vcpu_run iteration doing nothing) - and I think neither is
inherently better than the other.

So I guess my question is, and Avi and Gleb I'd love your comments about this
question: Is it really beneficial that I rewrite the ugly nested-VMX
injection code to be somewhat-ugly in exactly the same way that nested-SVM
injection code? Won't it be more beneficial to rewrite *both* codes to
be cleaner? This would probably mean changes to the common x86.c, that both
will use. For example, x86.c's injection code could check the nested case
itself, perhaps calling a special x86_op to handle the nested injection (exit,
set interrupt window, etc.) instead of calling the regular
interrupt_allowed/enable_irq_window and forcing those to be modified in
mysterious ways.

Now that there's a is_guest_mode(vcpu) function, more nested-related code
can be moved to x86.c, to make both nested VMX and nested SVM code cleaner.

Waiting to hear your opinions and suggestions,

Thanks,
Nadav.


=
Subject: [PATCH 23/31] nVMX: Correct handling of interrupt injection

The code in this patch correctly emulates external-interrupt injection
while a nested guest L2 is running. Because of this code's relative
oddity and un-obviousness, I include here a longer-than

Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-12 Thread Gleb Natapov
On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote:
 So I guess my question is, and Avi and Gleb I'd love your comments about this
 question: Is it really beneficial that I rewrite the ugly nested-VMX
 injection code to be somewhat-ugly in exactly the same way that nested-SVM
 injection code? Won't it be more beneficial to rewrite *both* codes to
 be cleaner? This would probably mean changes to the common x86.c, that both
 will use. For example, x86.c's injection code could check the nested case
 itself, perhaps calling a special x86_op to handle the nested injection (exit,
 set interrupt window, etc.) instead of calling the regular
 interrupt_allowed/enable_irq_window and forcing those to be modified in
 mysterious ways.
 
That is exactly what should be done and what I have in mind when I am
asking to change VMX code to be SVM like. To achieve what you outlined
above gradually we need to move common VMX and SVM logic into x86.c
and then change the logic to be more nested friendly.  If VMX will have
different interrupt handling logic we will have to have additional step:
making SVM and VMX code similar (so it will be possible to move it
into x86.c).  All I am asking is to make this step now, before merge,
while the code is still actively developed.
 
 Now that there's a is_guest_mode(vcpu) function, more nested-related code
 can be moved to x86.c, to make both nested VMX and nested SVM code cleaner.
 
 Waiting to hear your opinions and suggestions,
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-12 Thread Avi Kivity

On 05/12/2011 06:57 PM, Gleb Natapov wrote:

On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote:
  So I guess my question is, and Avi and Gleb I'd love your comments about this
  question: Is it really beneficial that I rewrite the ugly nested-VMX
  injection code to be somewhat-ugly in exactly the same way that nested-SVM
  injection code? Won't it be more beneficial to rewrite *both* codes to
  be cleaner? This would probably mean changes to the common x86.c, that both
  will use. For example, x86.c's injection code could check the nested case
  itself, perhaps calling a special x86_op to handle the nested injection 
(exit,
  set interrupt window, etc.) instead of calling the regular
  interrupt_allowed/enable_irq_window and forcing those to be modified in
  mysterious ways.

That is exactly what should be done and what I have in mind when I am
asking to change VMX code to be SVM like. To achieve what you outlined
above gradually we need to move common VMX and SVM logic into x86.c
and then change the logic to be more nested friendly.  If VMX will have
different interrupt handling logic we will have to have additional step:
making SVM and VMX code similar (so it will be possible to move it
into x86.c).  All I am asking is to make this step now, before merge,
while the code is still actively developed.



I don't think it's fair to ask Nadav to do a unification right now.  Or 
productive - there's a limit to the size of a patchset that can be 
carried outside.  Also it needs to be done in consideration with future 
changes to interrupt injection, like using the svm interrupt queue to 
avoid an interrupt window exit.


Are there vmx-only changes that you think can help?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-12 Thread Avi Kivity

On 05/12/2011 06:42 PM, Nadav Har'El wrote:

Our second conclusion (and I hope that I'm not offending anyone here)
is that the changes for L2 interrupt injection in both SVM and VMX are both
ugly - they are just ugly in different ways. Both modified the non-nested
code in strange places in strange and unexpected ways, and tried to circumvent
the usual code path in x86.c without touching x86.c. They just did this in
two slightly different ways, neither (I think) is inherently uglier than the
other:

For accurate emulation (as I explain in the patch below), both codes need to
cause x86.c to change its normal behavior: It checks for interrupt_allowed()
and then (discovering that it isn't) enable_irq_window(). We want it to
instead exit to L1, and then enable the irq window on that. In the SVM code,
interrupt_allowed() is modified to always return false if nested, and
enable_irq_window() is modified to flag for an exit to L1 (which is performed
later) and turn on the interrupt window. In VMX, we modify the same places
but differently: In interrupt_allowed() we exit to L1 immediately (it's a
short operation, we didn't mind to do it in atomic context), and
enable_irq_window() doesn't need to be changed (it already runs in L1).


I think that interrupt_allowed() should return true in L2 (if L1 has 
configured external interrupts to be trapped), and interrupt injection 
modified to cause an exit instead of queueing an interrupt.  Note that 
on vmx, intercepted interrupt injection can take two different paths 
depending on whether the L1 wants interrupts acked or not.



Continuing to survey the difference between nested VMX and and SVM, there
were other different choices made besides the ones mentioned above. nested SVM
uses an additional trick, of skipping one round of running the guest, when
it discovered the need for an exit in the wrong place, so it can get to
the right place again. Nested VMX solved the same problems with other
mechanisms, like a separate piece of code for handling IDT_VECTORING_INFO,
and nested_run_pending. Some differences can also be explained by the different
design of (non-nested) vmx.c vs svm.c - e.g., svm_complete_interrupts() is
called during the handle_exit(), while vmx_complete_interrupts() is called
after handle_exit() has completed (in atomic context) - this is one of the
reasons the nested IDT_VECTORING_INFO path is different.

I think that both solutions are far from being beautiful or easy to understand.
Nested SVM is perhaps slightly less ugly but also has a small performance cost
(with the extra vcpu_run iteration doing nothing) - and I think neither is
inherently better than the other.

So I guess my question is, and Avi and Gleb I'd love your comments about this
question: Is it really beneficial that I rewrite the ugly nested-VMX
injection code to be somewhat-ugly in exactly the same way that nested-SVM
injection code? Won't it be more beneficial to rewrite *both* codes to
be cleaner? This would probably mean changes to the common x86.c, that both
will use. For example, x86.c's injection code could check the nested case
itself, perhaps calling a special x86_op to handle the nested injection (exit,
set interrupt window, etc.) instead of calling the regular
interrupt_allowed/enable_irq_window and forcing those to be modified in
mysterious ways.

Now that there's a is_guest_mode(vcpu) function, more nested-related code
can be moved to x86.c, to make both nested VMX and nested SVM code cleaner.


I am fine with committing as is.  Later we can modify both vmx and svm 
to do the right thing (whatever that is), and later merge them into x86.c.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-12 Thread Nadav Har'El
Hi,

On Thu, May 12, 2011, Gleb Natapov wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
 That is exactly what should be done and what I have in mind when I am
 asking to change VMX code to be SVM like. To achieve what you outlined
 above gradually we need to move common VMX and SVM logic into x86.c
 and then change the logic to be more nested friendly.  If VMX will have
 different interrupt handling logic we will have to have additional step:
 making SVM and VMX code similar (so it will be possible to move it
 into x86.c).

But if my interpretation of the code is correct, SVM isn't much closer
than VMX to the goal of moving this logic to x86.c. When some logic is
moved there, both SVM and VMX code will need to change - perhaps even
considerably. So how will it be helpful to make VMX behave exactly like
SVM does now, when the latter will also need to change considerably?

It sounds to me that working to move some nested-interrupt-injection related
logic to x86.c is a worthy effort (and I'd be happy to start some discussion
on how to best design it), but working to duplicate the exact idiosyncrasies
of the current SVM implementation in the VMX code is not as productive.
But as usual, I'm open to arguments (or dictums ;-)) that I'm wrong here.

By the way, I hope that I'm being fair to the nested SVM implementation when
I call some of the code there, after only a short review, idiosyncrasies.
Basically I am working under the assumption that some of the modifications
there (I gave examples in my previous post) were done in the way they were
just to fit the mold of x86.c, and that it would have been possible to alter
x86.c in a way that could make the nested SVM code simpler - and quite
different (in the area of interrupt injection).

 All I am asking is to make this step now, before merge,
 while the code is still actively developed.

The code will continue to be actively developed even after the merge :-)

Nadav.

-- 
Nadav Har'El|  Thursday, May 12 2011, 9 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |A city is a large community where people
http://nadav.harel.org.il   |are lonesome together.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-12 Thread Gleb Natapov
On Thu, May 12, 2011 at 07:31:15PM +0300, Nadav Har'El wrote:
 Hi,
 
 On Thu, May 12, 2011, Gleb Natapov wrote about Re: [PATCH 0/30] nVMX: Nested 
 VMX, v9:
  That is exactly what should be done and what I have in mind when I am
  asking to change VMX code to be SVM like. To achieve what you outlined
  above gradually we need to move common VMX and SVM logic into x86.c
  and then change the logic to be more nested friendly.  If VMX will have
  different interrupt handling logic we will have to have additional step:
  making SVM and VMX code similar (so it will be possible to move it
  into x86.c).
 
 But if my interpretation of the code is correct, SVM isn't much closer
 than VMX to the goal of moving this logic to x86.c. When some logic is
 moved there, both SVM and VMX code will need to change - perhaps even
 considerably. So how will it be helpful to make VMX behave exactly like
 SVM does now, when the latter will also need to change considerably?
 
SVM design is much close to the goal of moving the logic into x86.c
because IIRC it does not bypass parsing of IDT vectoring info into arch
independent structure. VMX code uses vmx-idt_vectoring_info directly.
SVM is much close to working migration with nested guests for the same
reason.

 It sounds to me that working to move some nested-interrupt-injection related
 logic to x86.c is a worthy effort (and I'd be happy to start some discussion
 on how to best design it), but working to duplicate the exact idiosyncrasies
 of the current SVM implementation in the VMX code is not as productive.
 But as usual, I'm open to arguments (or dictums ;-)) that I'm wrong here.
 
 By the way, I hope that I'm being fair to the nested SVM implementation when
 I call some of the code there, after only a short review, idiosyncrasies.
 Basically I am working under the assumption that some of the modifications
 there (I gave examples in my previous post) were done in the way they were
 just to fit the mold of x86.c, and that it would have been possible to alter
 x86.c in a way that could make the nested SVM code simpler - and quite
 different (in the area of interrupt injection).
I think (haven't looked at the code for a long time) it can benefit
from additional x86 ops callbacks, but it would be silly to add one set
of callbacks to support SVM way of doing things and another set for VMX,
not because archs are so different that unification is impossible (they
are not, they are very close in this area in fact), but because two
implementations are different.

 
  All I am asking is to make this step now, before merge,
  while the code is still actively developed.
 
 The code will continue to be actively developed even after the merge :-)
 
Amen! :)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-12 Thread Avi Kivity

On 05/12/2011 07:51 PM, Gleb Natapov wrote:


  But if my interpretation of the code is correct, SVM isn't much closer
  than VMX to the goal of moving this logic to x86.c. When some logic is
  moved there, both SVM and VMX code will need to change - perhaps even
  considerably. So how will it be helpful to make VMX behave exactly like
  SVM does now, when the latter will also need to change considerably?

SVM design is much close to the goal of moving the logic into x86.c
because IIRC it does not bypass parsing of IDT vectoring info into arch
independent structure. VMX code uses vmx-idt_vectoring_info directly.
SVM is much close to working migration with nested guests for the same
reason.


Ah, yes.  For live migration to work, all vmcb state must be accessible 
via vendor-independent accessors once an exit is completely handled.  
For example, GPRs are accessible via kvm_register_read(), and without 
nesting, interrupt state is stowed in the interrupt queue, but if you 
keep IDT_VECTORING_INFO live between exit and entry, you can lose it if 
you migrate at this point.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-12 Thread Gleb Natapov
On Thu, May 12, 2011 at 07:08:59PM +0300, Avi Kivity wrote:
 On 05/12/2011 06:57 PM, Gleb Natapov wrote:
 On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote:
   So I guess my question is, and Avi and Gleb I'd love your comments about 
  this
   question: Is it really beneficial that I rewrite the ugly nested-VMX
   injection code to be somewhat-ugly in exactly the same way that nested-SVM
   injection code? Won't it be more beneficial to rewrite *both* codes to
   be cleaner? This would probably mean changes to the common x86.c, that 
  both
   will use. For example, x86.c's injection code could check the nested case
   itself, perhaps calling a special x86_op to handle the nested injection 
  (exit,
   set interrupt window, etc.) instead of calling the regular
   interrupt_allowed/enable_irq_window and forcing those to be modified in
   mysterious ways.
 
 That is exactly what should be done and what I have in mind when I am
 asking to change VMX code to be SVM like. To achieve what you outlined
 above gradually we need to move common VMX and SVM logic into x86.c
 and then change the logic to be more nested friendly.  If VMX will have
 different interrupt handling logic we will have to have additional step:
 making SVM and VMX code similar (so it will be possible to move it
 into x86.c).  All I am asking is to make this step now, before merge,
 while the code is still actively developed.
 
 
 I don't think it's fair to ask Nadav to do a unification right now.
Definitely. And I am not asking for it!

 Or productive - there's a limit to the size of a patchset that can
 be carried outside.  Also it needs to be done in consideration with
 future changes to interrupt injection, like using the svm interrupt
 queue to avoid an interrupt window exit.
 
 Are there vmx-only changes that you think can help?
 
I am asking for vmx-only change actually. To make interrupt handling logic
the same as SVM. This will allow me or you or someone else to handle
unification part later without rewriting VMX.
 
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-11 Thread Gleb Natapov
On Mon, May 09, 2011 at 02:18:25PM +0300, Avi Kivity wrote:
 Okay, truly excellent.  The code is now a lot more readable, and I'm
 almost beginning to understand it.  The code comments are also very
 good, I wish we had the same quality comments in the rest of kvm.
 We can probably merge the next iteration if there aren't significant
 comments from others.
 
I still feel that interrupt injection path should be reworked to be
SVM like before merging the code.

 The only worrying thing is the issue you raise in patch 8.  Is there
 a simple fix you can push that addresses correctness?
 
 -- 
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-09 Thread Avi Kivity

On 05/08/2011 11:15 AM, Nadav Har'El wrote:

Hi,

This is the ninth iteration of the nested VMX patch set. This iteration
addresses all of the comments and requests that were raised by reviewers in
the previous rounds, with only a few exception listed below.

Some of the issues which were solved in this version include:

  * Overhauled the hardware VMCS (vmcs02) allocation. Previously we had up to
256 vmcs02s, one for each L2. Now we only have one, which is reused.
We also have a compile-time option VMCS02_POOL_SIZE to keep a bigger pool
of vmcs02s. This option will be useful in the future if vmcs02 won't be
filled from scratch on each entry from L1 to L2 (currently, it is).

  * The vmcs01 structure, containing a copy of all fields from L1's VMCS, was
unnecessary, as all the necessary values are either known to KVM or appear
in vmcs12. This structure is now gone for good.

  * There is no longer a vmcs_fields sub-structure that everyone disliked.
All the VMCS fields appear directly in the vmcs12 structure, which makes
the code simpler and more readable.

  * Make sure that the vmcs12 fields have fixed sizes and location, and add
some extra padding, to support live migration and improve future-proofing.

  * For some fields, nested exit used to fail to return the host-state as set
by L1. Fixed that.

  * nested_vmx_exit_handled (deciding if to let L1 handle an exit, or handle it
in L0 and return to L2) is now more correct, and handles more exit reasons.

  * Complete overhaul of the cr0, exception bitmap, cr3 and cr4 handling code.
The code is now shorter (uses existing functions like kvm_set_cr3, etc.),
more readable, and more uniform (no pieces of code for enable_ept and not,
less special code for cr0.TS, and none of that ugly cr0.PG monkey-business).

  * Use kvm_register_write(), kvm_rip_read(), etc. Got rid of new and now
unneeded function sync_cached_regs_to_vcms().

  * Fix return value of the VMX msrs to be more correct, and more constant
(not to needlessly vary on different hosts).

  * Added some more missing verifications to vmcs12's fields (cleanly failing
the nested entry if these verifications fail).

  * Expose the MSR-bitmap feature to L1. Every MSR access still exits to L0,
but slow exits to L1 are avoided when L1's MSR bitmap doesn't want it.

  * Removed or rate limited printouts which could be exploited by guests.

  * Fix VM_ENTRY_LOAD_IA32_PAT feature handling.

  * Fixed potential bug and verified that nested vmx now works with both
CONFIG_PREEMPT and CONFIG_SMP enabled.

  * Dozens of other code cleanups and bug fixes.

Only a few issues from previous reviews remain unaddressed. These are:

  * The interrupt injection and IDT_VECTORING_INFO_FIELD handling code was
still not rewritten. It works, though ;-)

  * No KVM autotests for nested VMX yet.

  * Merging of L0's and L1's MSR bitmaps (and IO bitmaps) is still not
supported. As explained above, the current code uses L1's MSR bitmap
to avoid costly exits to L1, but still suffers exits to L0 on each
MSR access in L2.

  * Still no option for disabling some capabilities advertised to L1.

  * No support for TPR_SHADOW feature for L1.

This new set of patches applies to the current KVM trunk (I checked with
082f9eced53d50c136e42d072598da4be4b9ba23).
If you wish, you can also check out an already-patched version of KVM from
branch nvmx9 of the repository:
 git://github.com/nyh/kvm-nested-vmx.git


About nested VMX:
-

The following 30 patches implement nested VMX support. This feature enables
a guest to use the VMX APIs in order to run its own nested guests.
In other words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
The Turtles Project: Design and Implementation of Nested Virtualization,
and was awarded Jay Lepreau Best Paper. The paper is available online, at:

http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (L1 can't use EPT and
must use shadow page tables). It is also missing some features required to
run VMWare hypervisors as a guest. These missing features will be sent as
follow-on patchs.

Running nested VMX:
--

The nested VMX feature is currently disabled by default. It must be
explicitly enabled with the nested=1 option to the kvm-intel module.

No modifications are required to user space (qemu). However, qemu's default
emulated CPU type (qemu64) does not list the VMX CPU feature, so it must be
explicitly enabled, by giving 

Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-09 Thread Nadav Har'El
On Mon, May 09, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested 
VMX, v9:
 Okay, truly excellent.  The code is now a lot more readable, and I'm 
 almost beginning to understand it.  The code comments are also very 
 good, I wish we had the same quality comments in the rest of kvm.  We 
 can probably merge the next iteration if there aren't significant 
 comments from others.

Thanks!

 The only worrying thing is the issue you raise in patch 8.  Is there a 
 simple fix you can push that addresses correctness?

I'll fix this for the next iteration.
I wanted to avoid changing the existing vcpus_on_cpu machinary, but you're
probably right - it's better to just do this correctly once and for all than
to try to explain the problem away, or to pray that future processors also
continue to work properly if you forget to vmclear a vmcs...

-- 
Nadav Har'El|Monday, May  9 2011, 5 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |A diplomat thinks twice before saying
http://nadav.harel.org.il   |nothing.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/30] nVMX: Nested VMX, v9

2011-05-08 Thread Nadav Har'El
Hi,

This is the ninth iteration of the nested VMX patch set. This iteration
addresses all of the comments and requests that were raised by reviewers in
the previous rounds, with only a few exception listed below.

Some of the issues which were solved in this version include:

 * Overhauled the hardware VMCS (vmcs02) allocation. Previously we had up to
   256 vmcs02s, one for each L2. Now we only have one, which is reused.
   We also have a compile-time option VMCS02_POOL_SIZE to keep a bigger pool
   of vmcs02s. This option will be useful in the future if vmcs02 won't be
   filled from scratch on each entry from L1 to L2 (currently, it is).

 * The vmcs01 structure, containing a copy of all fields from L1's VMCS, was
   unnecessary, as all the necessary values are either known to KVM or appear
   in vmcs12. This structure is now gone for good.

 * There is no longer a vmcs_fields sub-structure that everyone disliked.
   All the VMCS fields appear directly in the vmcs12 structure, which makes
   the code simpler and more readable.

 * Make sure that the vmcs12 fields have fixed sizes and location, and add
   some extra padding, to support live migration and improve future-proofing.

 * For some fields, nested exit used to fail to return the host-state as set
   by L1. Fixed that.

 * nested_vmx_exit_handled (deciding if to let L1 handle an exit, or handle it
   in L0 and return to L2) is now more correct, and handles more exit reasons.

 * Complete overhaul of the cr0, exception bitmap, cr3 and cr4 handling code.
   The code is now shorter (uses existing functions like kvm_set_cr3, etc.),
   more readable, and more uniform (no pieces of code for enable_ept and not,
   less special code for cr0.TS, and none of that ugly cr0.PG monkey-business).

 * Use kvm_register_write(), kvm_rip_read(), etc. Got rid of new and now
   unneeded function sync_cached_regs_to_vcms().

 * Fix return value of the VMX msrs to be more correct, and more constant
   (not to needlessly vary on different hosts).

 * Added some more missing verifications to vmcs12's fields (cleanly failing
   the nested entry if these verifications fail).

 * Expose the MSR-bitmap feature to L1. Every MSR access still exits to L0,
   but slow exits to L1 are avoided when L1's MSR bitmap doesn't want it.

 * Removed or rate limited printouts which could be exploited by guests.

 * Fix VM_ENTRY_LOAD_IA32_PAT feature handling.

 * Fixed potential bug and verified that nested vmx now works with both
   CONFIG_PREEMPT and CONFIG_SMP enabled.

 * Dozens of other code cleanups and bug fixes.

Only a few issues from previous reviews remain unaddressed. These are:

 * The interrupt injection and IDT_VECTORING_INFO_FIELD handling code was
   still not rewritten. It works, though ;-)

 * No KVM autotests for nested VMX yet.

 * Merging of L0's and L1's MSR bitmaps (and IO bitmaps) is still not
   supported. As explained above, the current code uses L1's MSR bitmap
   to avoid costly exits to L1, but still suffers exits to L0 on each
   MSR access in L2.

 * Still no option for disabling some capabilities advertised to L1.

 * No support for TPR_SHADOW feature for L1.

This new set of patches applies to the current KVM trunk (I checked with
082f9eced53d50c136e42d072598da4be4b9ba23).
If you wish, you can also check out an already-patched version of KVM from
branch nvmx9 of the repository:
 git://github.com/nyh/kvm-nested-vmx.git


About nested VMX:
-

The following 30 patches implement nested VMX support. This feature enables
a guest to use the VMX APIs in order to run its own nested guests.
In other words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
The Turtles Project: Design and Implementation of Nested Virtualization,
and was awarded Jay Lepreau Best Paper. The paper is available online, at:

http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (L1 can't use EPT and
must use shadow page tables). It is also missing some features required to
run VMWare hypervisors as a guest. These missing features will be sent as
follow-on patchs.

Running nested VMX:
--

The nested VMX feature is currently disabled by default. It must be
explicitly enabled with the nested=1 option to the kvm-intel module.

No modifications are required to user space (qemu). However, qemu's default
emulated CPU type (qemu64) does not list the VMX CPU feature, so it must be
explicitly enabled, by giving qemu one of the following options:

 -cpu host  (emulated CPU has all