Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-14 Thread Nadav Har'El
On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 
to L1:
 +int switch_back_vmcs(struct kvm_vcpu *vcpu)
 +{
 IIUC vpids are not exposed to the guest yet?  So the VPID should not 
 change between guest and nested guest.

Right. Removed.

 
 +
 +vmcs_write64(IO_BITMAP_A, src-io_bitmap_a);
 +vmcs_write64(IO_BITMAP_B, src-io_bitmap_b);

 
 Why change the I/O bitmap?
...
 Or the msr bitmap?  After all, we're switching the entire vmcs?
...
 Why write all these?  What could have changed them?

You were right - most of these copies were utterly useless, and apparently
remained in our code since prehistory (when the same hardware vmcs was reused
for both L1 and L2). The great thing is that this removes several dozens of
vmwrites from the L2-L1 exit path in one fell swoop. In fact, the whole
function switch_back_vmcs() is now gone. Thanks for spotting this!


 +vmx_set_cr0(vcpu,
 +(vmx-nested.l1_shadow_vmcs-cr0_guest_host_mask
 +vmx-nested.l1_shadow_vmcs-cr0_read_shadow) |
 +(~vmx-nested.l1_shadow_vmcs-cr0_guest_host_mask
 +vmx-nested.l1_shadow_vmcs-guest_cr0));

 
 Helper wanted.

Done. The new helper looks like this:

static inline unsigned long guest_readable_cr0(struct vmcs_fields *fields)
{
return (fields-guest_cr0  ~fields-cr0_guest_host_mask) |
(fields-cr0_read_shadow  fields-cr0_guest_host_mask);
}

And is used in two places in the code (the above place, and another one).



 +vmcs_write64(GUEST_PDPTR3, src-guest_pdptr3);
 +}

 
 A kvm_set_cr3(src-host_cr3) should do all that and more, no?
 +
 +vmx_set_cr4(vcpu, vmx-nested.l1_state.cr4);
 +

 
 Again, the kvm_set_crx() versions have more meat.

I have to admit, I still don't understand this part of the code completely.
The fact that kvm_set_cr4 does more than vmx_set_cr4 doesn't always mean
that we want (or need) to do those things. In particular:
 
 +if (enable_ept) {
 +vcpu-arch.cr3 = vmx-nested.l1_shadow_vmcs-guest_cr3;
 +vmcs_write32(GUEST_CR3, 
 vmx-nested.l1_shadow_vmcs-guest_cr3);
 +} else {
 +kvm_set_cr3(vcpu, vmx-nested.l1_state.cr3);
 +}

 
 kvm_set_cr3() will load the PDPTRs in the EPT case (correctly in case 
 the nested guest was able to corrupted the guest's PDPT).

kvm_set_cr3 calls vmx_set_cr3 which calls ept_load_pdptrs which assumes
that vcpu-arch.pdptrs[] is correct. I am guessing (but am not yet completely
sure) that this code tried to avoid assuming that this cache is up-to-date.
Again, I still need to better understand this part of the code before I
can correct it (because, as the saying goes, if it ain't broken, don't fix
it - or at least fix it carefully).

 +kvm_mmu_reset_context(vcpu);
 +kvm_mmu_load(vcpu);

 
 kvm_mmu_load() unneeded, usually.

Again, I'll need to look into this deeper and report back.
In the meantime, attached below is the current version of this patch.

Thanks,
Nadav.

Subject: [PATCH 19/26] nVMX: Exiting from L2 to L1

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  242 ++-
 1 file changed, 241 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2010-09-14 15:02:37.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-09-14 15:02:37.0 +0200
@@ -4970,9 +4970,13 @@ static void vmx_complete_interrupts(stru
int type;
bool idtv_info_valid;
 
+   vmx-exit_reason = vmcs_read32(VM_EXIT_REASON);
+
+   if (vmx-nested.nested_mode)
+   return;
+
exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
 
-   vmx-exit_reason = vmcs_read32(VM_EXIT_REASON);
 
/* Handle machine checks before interrupts are enabled */
if ((vmx-exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
@@ -5961,6 +5965,242 @@ static int nested_vmx_run(struct kvm_vcp
return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
+ * L1 wished to allow its guest

Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-13 Thread Avi Kivity

 On 09/12/2010 09:51 PM, Nadav Har'El wrote:



There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not
be cleared by hardware:
...
If neither of these are valid, the code can be removed.  If only the
second, we might make it conditional.

Again, unless I'm misunderstanding what you mean, the hardware only
modified vmcs02 (the hardware vmcs), not vmcs12. We need to modify vmcs12
as well, to remove the valid bit. If we don't, when L1 enters into the same
L2 again, the same old value will be copied again from vmcs12 to vmcs02,
and cause an injection of the same interrupt again.


Yes, vmcs12 still needs to be updated.  So the code cannot be removed, 
just the vm



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-13 Thread Avi Kivity

 On 09/13/2010 07:53 AM, Sheng Yang wrote:



What can happen is that the contents of the field is transferred to the
IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.

(question: on a failed vmentry, is this field cleared?)

I don't know the answer :-)

Sheng?

According to SDM 23.7 VM-ENTRY FAILURES DURING OR AFTER LOADING
GUEST STATE:

Although this process resembles that of a VM exit, many steps taken during a VM
exit do not occur for these VM-entry failures:
• Most VM-exit information fields are not updated (see step 1 above).
• The valid bit in the VM-entry interruption-information field is *not* cleared.
• The guest-state area is not modified.
• No MSRs are saved into the VM-exit MSR-store area.

So VM entry failure would result in _keep_ valid bit of 
VM_ENTRY_INTR_INFO_FIELD.




Ok.  So if the exit was actually due to a failed vmentry, then we do 
need the vmread... (or alternatively, we can avoid clearing the field in 
the first place).


So the following options should work:

1.  vmcs12-vm_entry_intr_info_field = 
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
2.  if (!(exit_reason  FAILED_ENTRY)) vmcs12-vm_exit_intry_info_field 
= ~VALID;
3.  if (exit_reason  FAILED_ENTRY) vmcs12-vm_entry_intr_info_field = 
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-13 Thread Nadav Har'El
On Mon, Sep 13, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 
to L1:
 So the following options should work:
 
 1.  vmcs12-vm_entry_intr_info_field = 
 vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

Right, this was the original code in the patch.

 2.  if (!(exit_reason  FAILED_ENTRY)) vmcs12-vm_exit_intry_info_field 
 = ~VALID;

I now prefer this code. It doesn't do vmread (but replaces it with a bunch of
extra instructions - which might be even slower overall...).

But the more interesting thing is that it doesn't copy irrelevant bits from
vmcs02 to vmcs12, bits that might not have been set by L1 but rather by L0
which previously injected an interrupt into the same L2. These bits shouldn't
matter (when !valid), but a nosy L1 might notice them...

 3.  if (exit_reason  FAILED_ENTRY) vmcs12-vm_entry_intr_info_field = 
 vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

I think you meant the opposite condition?

if (!(exit_reason  FAILED_ENTRY)) vmcs12-vm_entry_intr_info_field = 
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

-- 
Nadav Har'El|   Monday, Sep 13 2010, 5 Tishri 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Always borrow money from pessimists. They
http://nadav.harel.org.il   |don't expect to be paid back.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-13 Thread Avi Kivity

 On 09/13/2010 11:01 AM, Nadav Har'El wrote:



3.  if (exit_reason  FAILED_ENTRY) vmcs12-vm_entry_intr_info_field =
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

I think you meant the opposite condition?

if (!(exit_reason  FAILED_ENTRY)) vmcs12-vm_entry_intr_info_field =
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);



Dunno, I think both are subtly broken.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-12 Thread Nadav Har'El
Hi,

Continuing to work on the nested VMX patches,

On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 
to L1:
 On 06/13/2010 03:31 PM, Nadav Har'El wrote:
...
 +/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to
 + * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
 + * function updates it to reflect the state of the registers during the 
 exit,
...
 +vmcs12-tsc_offset = vmcs_read64(TSC_OFFSET);

 
 TSC_OFFSET cannot have changed.

Right. I cleaned up this function now, to only copy the fields that could
have changed, namely fields listed as guest-state or exit-information fields
in the spec. Control fields like this TSC_OFFSET and more examples you found
below, indeed could not have changed while L2 was running or during the exit.

 +vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);

 Without msr bitmaps, cannot change.

I added a TODO before this (and a couple of others) for future optimization.
I'm not even convinced how much quicker it is to check the MSR bitmap before
doing vmcs_read64, vs just to going ahead and vmreading it in any case.

 +vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);

 Can this change?

Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state
field, but it is listed as being for future expansion. I guess that with
current hardware, it cannot change, but for future hardware it might. I'm
not sure if it's wiser to ignore this field for now (and shave a bit off
the l2-l1 switch time), or just copy it anyway, as I do now.
What would you prefer?

 +if (vmcs_config.vmentry_ctrl  VM_ENTRY_LOAD_IA32_PAT)
 +vmcs12-guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);

 
 Should check for VM_EXIT_SAVE_IA32_PAT, no?

You're absolutely right. Fixed.

 +vmcs12-vm_entry_intr_info_field =
 +vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

 
 Autocleared, no need to read.

Well, we need to clear the valid bit on exit, so we don't mistakenly inject
the same interrupt twice. There were two ways to do it: 1. clear it ourselves,
or 2. copy the value from vmcs02 where the processor already cleared it.
There are pros and cons for each approach, but I'll change like you suggest,
to clear it ourselves:

vmcs12-vm_entry_intr_info_field = ~INTR_INFO_VALID_MASK;

 +vmcs12-vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);

 We don't want to pass this to the guest?

I didn't quite understand your question, but now that I look at it, I see
that VM_INSTRUCTION_ERROR has nothing to do with exits, but is only modified
when running VMX instructions, and our emulation of VMX instructions already
sets it appropriately, so no sense of copying it here.

 
 +vmcs12-vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
 +vmcs12-vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
 +vmcs12-vm_exit_intr_error_code = 
 vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
 +vmcs12-idt_vectoring_info_field =
 +vmcs_read32(IDT_VECTORING_INFO_FIELD);
 +vmcs12-idt_vectoring_error_code =
 +vmcs_read32(IDT_VECTORING_ERROR_CODE);
 +vmcs12-vm_exit_instruction_len = 
 vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
 +vmcs12-vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);

 
 For the above, if the host handles the exit, we must not clobber guest 
 fields.  A subsequent guest vmread will see the changed values even 
 though from its point of view a vmexit has not occurred.
 
 But no, that can't happen, since a vmread needs to have a vmexit first 
 to happen.  Still, best to delay this.+

All this code is in prepare_vmcs_12, which only gets called if we exit from
L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host
handles the exit).

As long as a certain field gets written to on *every* exit, and not just
on some of them, I believe we can safely copy the current values in vmcs02
to vmcs12, knowing these are current values from the current exit, and not
some old values we shouldn't be copying.

You may have a point (if that was your point?) that some fields are not
always written to - e.g., for most exits vm_exit_intr_info doesn't get
written to and just one valid bit is cleared. As the code is now, we copy
vmcs02's field, which might have been written earlier (e.g., during
an exit to L0) and not now, and an observant L1 might notice this value
it should not have seen. However, I don't see any problem with that, because
the valid bit would be correctly turned off, and the spec says that all
other bits are undefined (for irrelevant exits) and copying-old-values is one
legal setting for undefined bits...

 +/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
 + * have changed some cr0 bits without us ever saving them in the 
 shadow
 + * vmcs. So we need to save these changes now.
...
 +
 +vmcs12-guest_cr4 = vmcs_readl(GUEST_CR4);
 
 Can't we have the same issue

Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-12 Thread Avi Kivity

 On 09/12/2010 04:05 PM, Nadav Har'El wrote:

Hi,

Continuing to work on the nested VMX patches,


Great.  Hopefully you can commit some time to it or you'll spend a lot 
of cycles just catching up.


Joerg just merged nested npt; as far as I can tell it is 100% in line 
with nested ept, but please take a look as well.





+   vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);


Without msr bitmaps, cannot change.

I added a TODO before this (and a couple of others) for future optimization.
I'm not even convinced how much quicker it is to check the MSR bitmap before
doing vmcs_read64, vs just to going ahead and vmreading it in any case.


IIRC we don't support msr bitmaps now, so no need to check anything.

In general, avoid vmcs reads as much as possible.  Just think of your 
code running in a guest!



+vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);

Can this change?

Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state
field, but it is listed as being for future expansion. I guess that with
current hardware, it cannot change, but for future hardware it might. I'm
not sure if it's wiser to ignore this field for now (and shave a bit off
the l2-l1 switch time), or just copy it anyway, as I do now.
What would you prefer?


If it changes in the future, it can only be under the influence of some 
control or at least guest-discoverable capability.  Since we don't 
expose such control or capability, there's no need to copy it.



+   vmcs12-vm_entry_intr_info_field =
+   vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);


Autocleared, no need to read.

Well, we need to clear the valid bit on exit, so we don't mistakenly inject
the same interrupt twice.


I don't think so.  We write this field as part of guest entry (that is, 
after the decision re which vmcs to use, yes?), so guest entry will 
always follow writing this field.  Since guest entry clears the field, 
reading it after an exit will necessarily return 0.


What can happen is that the contents of the field is transferred to the 
IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.


(question: on a failed vmentry, is this field cleared?)


There were two ways to do it: 1. clear it ourselves,
or 2. copy the value from vmcs02 where the processor already cleared it.
There are pros and cons for each approach, but I'll change like you suggest,
to clear it ourselves:

vmcs12-vm_entry_intr_info_field= ~INTR_INFO_VALID_MASK;


That's really a temporary variable, I don't think you need to touch it.


+   vmcs12-vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+   vmcs12-vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+   vmcs12-vm_exit_intr_error_code =
vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+   vmcs12-idt_vectoring_info_field =
+   vmcs_read32(IDT_VECTORING_INFO_FIELD);
+   vmcs12-idt_vectoring_error_code =
+   vmcs_read32(IDT_VECTORING_ERROR_CODE);
+   vmcs12-vm_exit_instruction_len =
vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+   vmcs12-vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);


For the above, if the host handles the exit, we must not clobber guest
fields.  A subsequent guest vmread will see the changed values even
though from its point of view a vmexit has not occurred.

But no, that can't happen, since a vmread needs to have a vmexit first
to happen.  Still, best to delay this.+

All this code is in prepare_vmcs_12, which only gets called if we exit from
L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host
handles the exit).


Ok.  That answers my concern.


As long as a certain field gets written to on *every* exit, and not just
on some of them, I believe we can safely copy the current values in vmcs02
to vmcs12, knowing these are current values from the current exit, and not
some old values we shouldn't be copying.

You may have a point (if that was your point?) that some fields are not
always written to - e.g., for most exits vm_exit_intr_info doesn't get
written to and just one valid bit is cleared. As the code is now, we copy
vmcs02's field, which might have been written earlier (e.g., during
an exit to L0) and not now, and an observant L1 might notice this value
it should not have seen. However, I don't see any problem with that, because
the valid bit would be correctly turned off, and the spec says that all
other bits are undefined (for irrelevant exits) and copying-old-values is one
legal setting for undefined bits...


No, I wasn't worried about that, simply misunderstood the code.


+   /* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
+* have changed some cr0 bits without us ever saving them in the
shadow
+* vmcs. So we need to save these changes now.

...

+
+   vmcs12-guest_cr4 = vmcs_readl(GUEST_CR4);

Can't we have the same issue with cr4?

I guess we can. I didn't think this (giving guest full control over cr4 bits)
was happening in KVM, but 

Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-12 Thread Nadav Har'El
On Sun, Sep 12, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 
to L1:
 Great.  Hopefully you can commit some time to it or you'll spend a lot 
 of cycles just catching up.

Right. I will.

 Joerg just merged nested npt; as far as I can tell it is 100% in line 
 with nested ept, but please take a look as well.

Indeed. Making nested EPT work based on the nested NPT work is one of the
first things I plan to do after the basic nested VMX patches are accepted.
As you know, we already have a version of nested EPT working in our testing
code, but I'll need to tweak it a bit to use the common nested NPT code.

 +  vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
 
 Without msr bitmaps, cannot change.
 I added a TODO before this (and a couple of others) for future 
 optimization.
 I'm not even convinced how much quicker it is to check the MSR bitmap 
 before
 doing vmcs_read64, vs just to going ahead and vmreading it in any case.
 
 IIRC we don't support msr bitmaps now, so no need to check anything.

I do think we support msr bitmaps... E.g., we have
nested_vmx_exit_handled_msr() to check whether L1 requires an exit for a
certain MSR access.  Where don't we support them? (but I'm not denying the
possiblity that this support still has holes or bugs).

 In general, avoid vmcs reads as much as possible.  Just think of your 
 code running in a guest!

Yes. On the other hand, I don't want to be sorry in the future when I want
to support some feature, but because I wanted to shave off 1% of the L2-L1
switching time, and 0.01% of the total run time (and I'm just making
numbers up...), I now need to find a dozen places where things need to change
to support this feature. On the other hand, this will likely happen anyway ;-)

 
 +vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
 Can this change?
..
 If it changes in the future, it can only be under the influence of some 
 control or at least guest-discoverable capability.  Since we don't 
 expose such control or capability, there's no need to copy it.

You convinced me. Removed it.

 +  vmcs12-vm_entry_intr_info_field =
 +  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
 
 Autocleared, no need to read.
 Well, we need to clear the valid bit on exit, so we don't mistakenly 
 inject
 the same interrupt twice.
 
 I don't think so.  We write this field as part of guest entry (that is, 
 after the decision re which vmcs to use, yes?), so guest entry will 
 always follow writing this field.  Since guest entry clears the field, 
 reading it after an exit will necessarily return 0.

Well, you obviously know the KVM code much better than I do, but from what
I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM,
this field only gets written on injection, not on every entry, so the code
relies on the fact that the processor turns off the valid bit during exit,
to avoid the same event being injected when the same field value is used for
another entry. I can only find code which resets this field in vmx_vcpu_reset(),
but that doesn't get called on every entry, right? Or am I missing something?

 What can happen is that the contents of the field is transferred to the 
 IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.
 
 (question: on a failed vmentry, is this field cleared?)

I don't know the answer :-)

 There were two ways to do it: 1. clear it ourselves,
 or 2. copy the value from vmcs02 where the processor already cleared it.
 There are pros and cons for each approach, but I'll change like you 
 suggest,
 to clear it ourselves:
 
  vmcs12-vm_entry_intr_info_field= ~INTR_INFO_VALID_MASK;
 
 That's really a temporary variable, I don't think you need to touch it.

But we need to emulate the correct VMX behavior. According to the spec, the
valid bit on this field needs to be cleared on vmexit, so we need to do it
also on emulated exits from L2 to L1. If we're sure that we already cleared it
earlier, then fine, but if not (and like I said, I failed to find this code),
we need to do it now, on exit - either by explictly clearing the bit or by
copying a value where the processor cleared this bit (arguably, the former is
more correct emulation).

 I didn't mean register independent helper; one function for cr0 and one 
 function for cr4 so the reader can just see the name and pretend to 
 understand what it does, instead of seeing a bunch of incomprehensible 
 bitwise operations.

Ok, done:

/*
 * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
 * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
 * without L0 trapping the change and updating vmcs12.
 * This function returns the value we should put in vmcs12.guest_cr0. It's not
 * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
 * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
 * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
 * to trap

Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-12 Thread Avi Kivity

 On 09/12/2010 07:05 PM, Nadav Har'El wrote:



+   vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);


Without msr bitmaps, cannot change.

I added a TODO before this (and a couple of others) for future
optimization.
I'm not even convinced how much quicker it is to check the MSR bitmap
before
doing vmcs_read64, vs just to going ahead and vmreading it in any case.

IIRC we don't support msr bitmaps now, so no need to check anything.

I do think we support msr bitmaps... E.g., we have
nested_vmx_exit_handled_msr() to check whether L1 requires an exit for a
certain MSR access.  Where don't we support them? (but I'm not denying the
possiblity that this support still has holes or bugs).


I was just talking from memory.  If you do support them, that's great.

Note that kvm itself doesn't support give the guest control of 
DEBUGCTLMSR, so you should just be able to read it from the shadow value 
(which strangely doesn't exist - I'll post a fix).



In general, avoid vmcs reads as much as possible.  Just think of your
code running in a guest!

Yes. On the other hand, I don't want to be sorry in the future when I want
to support some feature, but because I wanted to shave off 1% of the L2-L1
switching time, and 0.01% of the total run time (and I'm just making
numbers up...), I now need to find a dozen places where things need to change
to support this feature. On the other hand, this will likely happen anyway ;-)


Well, with msrs you have two cases: those which the guest controls and 
those which are shadowed.  So all we need is a systematic way for 
dealing with the two types.



+   vmcs12-vm_entry_intr_info_field =
+   vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);


Autocleared, no need to read.

Well, we need to clear the valid bit on exit, so we don't mistakenly
inject
the same interrupt twice.

I don't think so.  We write this field as part of guest entry (that is,
after the decision re which vmcs to use, yes?), so guest entry will
always follow writing this field.  Since guest entry clears the field,
reading it after an exit will necessarily return 0.

Well, you obviously know the KVM code much better than I do, but from what
I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM,
this field only gets written on injection, not on every entry, so the code
relies on the fact that the processor turns off the valid bit during exit,
to avoid the same event being injected when the same field value is used for
another entry.


Correct.


I can only find code which resets this field in vmx_vcpu_reset(),
but that doesn't get called on every entry, right? Or am I missing something?


prepare_vmcs12() is called in response for a 2-1 vmexit which is first 
trapped by 0, yes?  Because it's called immediately after a vmexit, 
VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the 
processor.


There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not 
be cleared by hardware:


1. if we call prepare_vmcs12() between injection and entry.  This cannot 
happen AFAICT.
2. if the vmexit was really a failed 1-2 vmentry, and if the processor 
doesn't clear VM_ENTRY_INTR_INFO_FIELD in response to vm entry failures 
(need to check scripture)


If neither of these are valid, the code can be removed.  If only the 
second, we might make it conditional.



What can happen is that the contents of the field is transferred to the
IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.

(question: on a failed vmentry, is this field cleared?)

I don't know the answer :-)


Sheng?


There were two ways to do it: 1. clear it ourselves,
or 2. copy the value from vmcs02 where the processor already cleared it.
There are pros and cons for each approach, but I'll change like you
suggest,
to clear it ourselves:

vmcs12-vm_entry_intr_info_field= ~INTR_INFO_VALID_MASK;

That's really a temporary variable, I don't think you need to touch it.

But we need to emulate the correct VMX behavior. According to the spec, the
valid bit on this field needs to be cleared on vmexit, so we need to do it
also on emulated exits from L2 to L1. If we're sure that we already cleared it
earlier, then fine, but if not (and like I said, I failed to find this code),
we need to do it now, on exit - either by explictly clearing the bit or by
copying a value where the processor cleared this bit (arguably, the former is
more correct emulation).


Sorry, I misread it as vmx-idt_vectoring_info which is a temporary 
variable used to cache IDT_VECTORING_INFO.  Ignore my remark.



I didn't mean register independent helper; one function for cr0 and one
function for cr4 so the reader can just see the name and pretend to
understand what it does, instead of seeing a bunch of incomprehensible
bitwise operations.

Ok, done:

/*
  * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
  * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
  * without L0 trapping 

Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-12 Thread Nadav Har'El
On Sun, Sep 12, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 
to L1:
 +vmcs12-vm_entry_intr_info_field =
 +vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
...
 prepare_vmcs12() is called in response for a 2-1 vmexit which is first 
 trapped by 0, yes?  Because it's called immediately after a vmexit, 
 VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the 
 processor.

Indeed - it is cleared in the real vmcs, i.e., vmcs02, but we also need
to clear it in the emulated vmcs that L1 sees for L2, i.e., vmcs12.

The original code (quoted above) just copied the vmcs02 value to vmcs12,
which was (mostly) fine because the vmcs02 value has a correctly-cleared
bit - but you asked to avoid the vmread.

So the second option is to just explicitly remove the valid bit from
vmcs12-vm_entry_intro_info_field, which I do now.
vmcs12-vm_entry_intro_info_field was set by L1 before it entered L2, and now
that L2 is exiting back to L1, we need to clear the valid bit.

The more I think about it, the more I become convinced that the second
option is indeed better than the first option (the original code in the
patch).

 There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not 
 be cleared by hardware:
...
 If neither of these are valid, the code can be removed.  If only the 
 second, we might make it conditional.

Again, unless I'm misunderstanding what you mean, the hardware only
modified vmcs02 (the hardware vmcs), not vmcs12. We need to modify vmcs12
as well, to remove the valid bit. If we don't, when L1 enters into the same
L2 again, the same old value will be copied again from vmcs12 to vmcs02,
and cause an injection of the same interrupt again.

And by the way, I haven't said this enough, but thanks for your continued
reviews and all your very useful corrections for these patches!

Nadav.

-- 
Nadav Har'El|   Sunday, Sep 12 2010, 5 Tishri 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I before E except after C. We live in a
http://nadav.harel.org.il   |weird society!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-12 Thread Sheng Yang
On Monday 13 September 2010 01:21:29 Avi Kivity wrote:
   On 09/12/2010 07:05 PM, Nadav Har'El wrote:
  I don't think so.  We write this field as part of guest entry (that is,
  after the decision re which vmcs to use, yes?), so guest entry will
  always follow writing this field.  Since guest entry clears the field,
  reading it after an exit will necessarily return 0.
  
  Well, you obviously know the KVM code much better than I do, but from
  what I saw, I thought (maybe I misunderstood) that in normal
  (non-nested) KVM, this field only gets written on injection, not on
  every entry, so the code relies on the fact that the processor turns off
  the valid bit during exit, to avoid the same event being injected when
  the same field value is used for another entry.
 
 Correct.
 
  I can only find code which resets this field in vmx_vcpu_reset(),
  but that doesn't get called on every entry, right? Or am I missing
  something?
 
 prepare_vmcs12() is called in response for a 2-1 vmexit which is first
 trapped by 0, yes?  Because it's called immediately after a vmexit,
 VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the
 processor.
 
 There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not
 be cleared by hardware:
 
 1. if we call prepare_vmcs12() between injection and entry.  This cannot
 happen AFAICT.
 2. if the vmexit was really a failed 1-2 vmentry, and if the processor
 doesn't clear VM_ENTRY_INTR_INFO_FIELD in response to vm entry failures
 (need to check scripture)
 
 If neither of these are valid, the code can be removed.  If only the
 second, we might make it conditional.
 
  What can happen is that the contents of the field is transferred to the
  IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.
  
  (question: on a failed vmentry, is this field cleared?)
  
  I don't know the answer :-)
 
 Sheng?

According to SDM 23.7 VM-ENTRY FAILURES DURING OR AFTER LOADING
GUEST STATE:

Although this process resembles that of a VM exit, many steps taken during a VM 
exit do not occur for these VM-entry failures:
• Most VM-exit information fields are not updated (see step 1 above).
• The valid bit in the VM-entry interruption-information field is *not* cleared.
• The guest-state area is not modified.
• No MSRs are saved into the VM-exit MSR-store area.

So VM entry failure would result in _keep_ valid bit of 
VM_ENTRY_INTR_INFO_FIELD.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/24] Exiting from L2 to L1

2010-06-13 Thread Nadav Har'El
This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
--- .before/arch/x86/kvm/vmx.c  2010-06-13 15:01:30.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2010-06-13 15:01:30.0 +0300
@@ -5080,9 +5080,13 @@ static void vmx_complete_interrupts(stru
int type;
bool idtv_info_valid;
 
+   vmx-exit_reason = vmcs_read32(VM_EXIT_REASON);
+
+   if (vmx-nested.nested_mode)
+   return;
+
exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
 
-   vmx-exit_reason = vmcs_read32(VM_EXIT_REASON);
 
/* Handle machine checks before interrupts are enabled */
if ((vmx-exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
@@ -5978,6 +5982,278 @@ static int nested_vmx_run(struct kvm_vcp
return 1;
 }
 
+/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the state of the registers during the exit,
+ * and to reflect some changes that happened while L2 was running (and perhaps
+ * made some exits which were handled directly by L0 without going back to L1).
+ */
+void prepare_vmcs_12(struct kvm_vcpu *vcpu)
+{
+   struct shadow_vmcs *vmcs12 = get_shadow_vmcs(vcpu);
+
+   vmcs12-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+   vmcs12-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+   vmcs12-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+   vmcs12-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+   vmcs12-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+   vmcs12-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+   vmcs12-guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+   vmcs12-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+
+   vmcs12-tsc_offset = vmcs_read64(TSC_OFFSET);
+   vmcs12-guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+   vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+   vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+   if (vmcs_config.vmentry_ctrl  VM_ENTRY_LOAD_IA32_PAT)
+   vmcs12-guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+   vmcs12-cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
+   vmcs12-vm_entry_intr_info_field =
+   vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+   vmcs12-vm_entry_exception_error_code =
+   vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
+   vmcs12-vm_entry_instruction_len =
+   vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
+   vmcs12-vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
+   vmcs12-vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+   vmcs12-vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+   vmcs12-vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+   vmcs12-idt_vectoring_info_field =
+   vmcs_read32(IDT_VECTORING_INFO_FIELD);
+   vmcs12-idt_vectoring_error_code =
+   vmcs_read32(IDT_VECTORING_ERROR_CODE);
+   vmcs12-vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+   vmcs12-vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   vmcs12-guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+   vmcs12-guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+   vmcs12-guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+   vmcs12-guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+   vmcs12-guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+   vmcs12-guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+   vmcs12-guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+   vmcs12-guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+   vmcs12-guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+   vmcs12-guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+   vmcs12-guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+   vmcs12-guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+   vmcs12-guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+   vmcs12-guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+   vmcs12-guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+   vmcs12-guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+   vmcs12-guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+   vmcs12-guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+   vmcs12-guest_interruptibility_info =
+