Re: [PATCH 18/24] Exiting from L2 to L1
On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 to L1: +int switch_back_vmcs(struct kvm_vcpu *vcpu) +{ IIUC vpids are not exposed to the guest yet? So the VPID should not change between guest and nested guest. Right. Removed. + +vmcs_write64(IO_BITMAP_A, src-io_bitmap_a); +vmcs_write64(IO_BITMAP_B, src-io_bitmap_b); Why change the I/O bitmap? ... Or the msr bitmap? After all, we're switching the entire vmcs? ... Why write all these? What could have changed them? You were right - most of these copies were utterly useless, and apparently remained in our code since prehistory (when the same hardware vmcs was reused for both L1 and L2). The great thing is that this removes several dozens of vmwrites from the L2-L1 exit path in one fell swoop. In fact, the whole function switch_back_vmcs() is now gone. Thanks for spotting this! +vmx_set_cr0(vcpu, +(vmx-nested.l1_shadow_vmcs-cr0_guest_host_mask +vmx-nested.l1_shadow_vmcs-cr0_read_shadow) | +(~vmx-nested.l1_shadow_vmcs-cr0_guest_host_mask +vmx-nested.l1_shadow_vmcs-guest_cr0)); Helper wanted. Done. The new helper looks like this: static inline unsigned long guest_readable_cr0(struct vmcs_fields *fields) { return (fields-guest_cr0 ~fields-cr0_guest_host_mask) | (fields-cr0_read_shadow fields-cr0_guest_host_mask); } And is used in two places in the code (the above place, and another one). +vmcs_write64(GUEST_PDPTR3, src-guest_pdptr3); +} A kvm_set_cr3(src-host_cr3) should do all that and more, no? + +vmx_set_cr4(vcpu, vmx-nested.l1_state.cr4); + Again, the kvm_set_crx() versions have more meat. I have to admit, I still don't understand this part of the code completely. The fact that kvm_set_cr4 does more than vmx_set_cr4 doesn't always mean that we want (or need) to do those things. In particular: +if (enable_ept) { +vcpu-arch.cr3 = vmx-nested.l1_shadow_vmcs-guest_cr3; +vmcs_write32(GUEST_CR3, vmx-nested.l1_shadow_vmcs-guest_cr3); +} else { +kvm_set_cr3(vcpu, vmx-nested.l1_state.cr3); +} kvm_set_cr3() will load the PDPTRs in the EPT case (correctly in case the nested guest was able to corrupted the guest's PDPT). kvm_set_cr3 calls vmx_set_cr3 which calls ept_load_pdptrs which assumes that vcpu-arch.pdptrs[] is correct. I am guessing (but am not yet completely sure) that this code tried to avoid assuming that this cache is up-to-date. Again, I still need to better understand this part of the code before I can correct it (because, as the saying goes, if it ain't broken, don't fix it - or at least fix it carefully). +kvm_mmu_reset_context(vcpu); +kvm_mmu_load(vcpu); kvm_mmu_load() unneeded, usually. Again, I'll need to look into this deeper and report back. In the meantime, attached below is the current version of this patch. Thanks, Nadav. Subject: [PATCH 19/26] nVMX: Exiting from L2 to L1 This patch implements nested_vmx_vmexit(), called when the nested L2 guest exits and we want to run its L1 parent and let it handle this exit. Note that this will not necessarily be called on every L2 exit. L0 may decide to handle a particular exit on its own, without L1's involvement; In that case, L0 will handle the exit, and resume running L2, without running L1 and without calling nested_vmx_vmexit(). The logic for deciding whether to handle a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(), will appear in the next patch. Signed-off-by: Nadav Har'El n...@il.ibm.com --- arch/x86/kvm/vmx.c | 242 ++- 1 file changed, 241 insertions(+), 1 deletion(-) --- .before/arch/x86/kvm/vmx.c 2010-09-14 15:02:37.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2010-09-14 15:02:37.0 +0200 @@ -4970,9 +4970,13 @@ static void vmx_complete_interrupts(stru int type; bool idtv_info_valid; + vmx-exit_reason = vmcs_read32(VM_EXIT_REASON); + + if (vmx-nested.nested_mode) + return; + exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); - vmx-exit_reason = vmcs_read32(VM_EXIT_REASON); /* Handle machine checks before interrupts are enabled */ if ((vmx-exit_reason == EXIT_REASON_MCE_DURING_VMENTRY) @@ -5961,6 +5965,242 @@ static int nested_vmx_run(struct kvm_vcp return 1; } +/* + * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date + * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK) + * without L0 trapping the change and updating vmcs12. + * This function returns the value we should put in vmcs12.guest_cr0. It's not + * enough to just return the current (vmcs02) GUEST_CR0. This may not be the + * guest cr0 that L1 thought it was giving its L2 guest - it is possible that + * L1 wished to allow its guest
Re: [PATCH 18/24] Exiting from L2 to L1
On 09/12/2010 09:51 PM, Nadav Har'El wrote: There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not be cleared by hardware: ... If neither of these are valid, the code can be removed. If only the second, we might make it conditional. Again, unless I'm misunderstanding what you mean, the hardware only modified vmcs02 (the hardware vmcs), not vmcs12. We need to modify vmcs12 as well, to remove the valid bit. If we don't, when L1 enters into the same L2 again, the same old value will be copied again from vmcs12 to vmcs02, and cause an injection of the same interrupt again. Yes, vmcs12 still needs to be updated. So the code cannot be removed, just the vm -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/24] Exiting from L2 to L1
On 09/13/2010 07:53 AM, Sheng Yang wrote: What can happen is that the contents of the field is transferred to the IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. (question: on a failed vmentry, is this field cleared?) I don't know the answer :-) Sheng? According to SDM 23.7 VM-ENTRY FAILURES DURING OR AFTER LOADING GUEST STATE: Although this process resembles that of a VM exit, many steps taken during a VM exit do not occur for these VM-entry failures: • Most VM-exit information fields are not updated (see step 1 above). • The valid bit in the VM-entry interruption-information field is *not* cleared. • The guest-state area is not modified. • No MSRs are saved into the VM-exit MSR-store area. So VM entry failure would result in _keep_ valid bit of VM_ENTRY_INTR_INFO_FIELD. Ok. So if the exit was actually due to a failed vmentry, then we do need the vmread... (or alternatively, we can avoid clearing the field in the first place). So the following options should work: 1. vmcs12-vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); 2. if (!(exit_reason FAILED_ENTRY)) vmcs12-vm_exit_intry_info_field = ~VALID; 3. if (exit_reason FAILED_ENTRY) vmcs12-vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/24] Exiting from L2 to L1
On Mon, Sep 13, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 to L1: So the following options should work: 1. vmcs12-vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); Right, this was the original code in the patch. 2. if (!(exit_reason FAILED_ENTRY)) vmcs12-vm_exit_intry_info_field = ~VALID; I now prefer this code. It doesn't do vmread (but replaces it with a bunch of extra instructions - which might be even slower overall...). But the more interesting thing is that it doesn't copy irrelevant bits from vmcs02 to vmcs12, bits that might not have been set by L1 but rather by L0 which previously injected an interrupt into the same L2. These bits shouldn't matter (when !valid), but a nosy L1 might notice them... 3. if (exit_reason FAILED_ENTRY) vmcs12-vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); I think you meant the opposite condition? if (!(exit_reason FAILED_ENTRY)) vmcs12-vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); -- Nadav Har'El| Monday, Sep 13 2010, 5 Tishri 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Always borrow money from pessimists. They http://nadav.harel.org.il |don't expect to be paid back. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/24] Exiting from L2 to L1
On 09/13/2010 11:01 AM, Nadav Har'El wrote: 3. if (exit_reason FAILED_ENTRY) vmcs12-vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); I think you meant the opposite condition? if (!(exit_reason FAILED_ENTRY)) vmcs12-vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); Dunno, I think both are subtly broken. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/24] Exiting from L2 to L1
Hi, Continuing to work on the nested VMX patches, On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 to L1: On 06/13/2010 03:31 PM, Nadav Har'El wrote: ... +/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to + * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this + * function updates it to reflect the state of the registers during the exit, ... +vmcs12-tsc_offset = vmcs_read64(TSC_OFFSET); TSC_OFFSET cannot have changed. Right. I cleaned up this function now, to only copy the fields that could have changed, namely fields listed as guest-state or exit-information fields in the spec. Control fields like this TSC_OFFSET and more examples you found below, indeed could not have changed while L2 was running or during the exit. +vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); Without msr bitmaps, cannot change. I added a TODO before this (and a couple of others) for future optimization. I'm not even convinced how much quicker it is to check the MSR bitmap before doing vmcs_read64, vs just to going ahead and vmreading it in any case. +vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); Can this change? Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state field, but it is listed as being for future expansion. I guess that with current hardware, it cannot change, but for future hardware it might. I'm not sure if it's wiser to ignore this field for now (and shave a bit off the l2-l1 switch time), or just copy it anyway, as I do now. What would you prefer? +if (vmcs_config.vmentry_ctrl VM_ENTRY_LOAD_IA32_PAT) +vmcs12-guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT); Should check for VM_EXIT_SAVE_IA32_PAT, no? You're absolutely right. Fixed. +vmcs12-vm_entry_intr_info_field = +vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); Autocleared, no need to read. Well, we need to clear the valid bit on exit, so we don't mistakenly inject the same interrupt twice. There were two ways to do it: 1. clear it ourselves, or 2. copy the value from vmcs02 where the processor already cleared it. There are pros and cons for each approach, but I'll change like you suggest, to clear it ourselves: vmcs12-vm_entry_intr_info_field = ~INTR_INFO_VALID_MASK; +vmcs12-vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR); We don't want to pass this to the guest? I didn't quite understand your question, but now that I look at it, I see that VM_INSTRUCTION_ERROR has nothing to do with exits, but is only modified when running VMX instructions, and our emulation of VMX instructions already sets it appropriately, so no sense of copying it here. +vmcs12-vm_exit_reason = vmcs_read32(VM_EXIT_REASON); +vmcs12-vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); +vmcs12-vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE); +vmcs12-idt_vectoring_info_field = +vmcs_read32(IDT_VECTORING_INFO_FIELD); +vmcs12-idt_vectoring_error_code = +vmcs_read32(IDT_VECTORING_ERROR_CODE); +vmcs12-vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); +vmcs12-vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); For the above, if the host handles the exit, we must not clobber guest fields. A subsequent guest vmread will see the changed values even though from its point of view a vmexit has not occurred. But no, that can't happen, since a vmread needs to have a vmexit first to happen. Still, best to delay this.+ All this code is in prepare_vmcs_12, which only gets called if we exit from L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host handles the exit). As long as a certain field gets written to on *every* exit, and not just on some of them, I believe we can safely copy the current values in vmcs02 to vmcs12, knowing these are current values from the current exit, and not some old values we shouldn't be copying. You may have a point (if that was your point?) that some fields are not always written to - e.g., for most exits vm_exit_intr_info doesn't get written to and just one valid bit is cleared. As the code is now, we copy vmcs02's field, which might have been written earlier (e.g., during an exit to L0) and not now, and an observant L1 might notice this value it should not have seen. However, I don't see any problem with that, because the valid bit would be correctly turned off, and the spec says that all other bits are undefined (for irrelevant exits) and copying-old-values is one legal setting for undefined bits... +/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may + * have changed some cr0 bits without us ever saving them in the shadow + * vmcs. So we need to save these changes now. ... + +vmcs12-guest_cr4 = vmcs_readl(GUEST_CR4); Can't we have the same issue
Re: [PATCH 18/24] Exiting from L2 to L1
On 09/12/2010 04:05 PM, Nadav Har'El wrote: Hi, Continuing to work on the nested VMX patches, Great. Hopefully you can commit some time to it or you'll spend a lot of cycles just catching up. Joerg just merged nested npt; as far as I can tell it is 100% in line with nested ept, but please take a look as well. + vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); Without msr bitmaps, cannot change. I added a TODO before this (and a couple of others) for future optimization. I'm not even convinced how much quicker it is to check the MSR bitmap before doing vmcs_read64, vs just to going ahead and vmreading it in any case. IIRC we don't support msr bitmaps now, so no need to check anything. In general, avoid vmcs reads as much as possible. Just think of your code running in a guest! +vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); Can this change? Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state field, but it is listed as being for future expansion. I guess that with current hardware, it cannot change, but for future hardware it might. I'm not sure if it's wiser to ignore this field for now (and shave a bit off the l2-l1 switch time), or just copy it anyway, as I do now. What would you prefer? If it changes in the future, it can only be under the influence of some control or at least guest-discoverable capability. Since we don't expose such control or capability, there's no need to copy it. + vmcs12-vm_entry_intr_info_field = + vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); Autocleared, no need to read. Well, we need to clear the valid bit on exit, so we don't mistakenly inject the same interrupt twice. I don't think so. We write this field as part of guest entry (that is, after the decision re which vmcs to use, yes?), so guest entry will always follow writing this field. Since guest entry clears the field, reading it after an exit will necessarily return 0. What can happen is that the contents of the field is transferred to the IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. (question: on a failed vmentry, is this field cleared?) There were two ways to do it: 1. clear it ourselves, or 2. copy the value from vmcs02 where the processor already cleared it. There are pros and cons for each approach, but I'll change like you suggest, to clear it ourselves: vmcs12-vm_entry_intr_info_field= ~INTR_INFO_VALID_MASK; That's really a temporary variable, I don't think you need to touch it. + vmcs12-vm_exit_reason = vmcs_read32(VM_EXIT_REASON); + vmcs12-vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + vmcs12-vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE); + vmcs12-idt_vectoring_info_field = + vmcs_read32(IDT_VECTORING_INFO_FIELD); + vmcs12-idt_vectoring_error_code = + vmcs_read32(IDT_VECTORING_ERROR_CODE); + vmcs12-vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); + vmcs12-vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); For the above, if the host handles the exit, we must not clobber guest fields. A subsequent guest vmread will see the changed values even though from its point of view a vmexit has not occurred. But no, that can't happen, since a vmread needs to have a vmexit first to happen. Still, best to delay this.+ All this code is in prepare_vmcs_12, which only gets called if we exit from L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host handles the exit). Ok. That answers my concern. As long as a certain field gets written to on *every* exit, and not just on some of them, I believe we can safely copy the current values in vmcs02 to vmcs12, knowing these are current values from the current exit, and not some old values we shouldn't be copying. You may have a point (if that was your point?) that some fields are not always written to - e.g., for most exits vm_exit_intr_info doesn't get written to and just one valid bit is cleared. As the code is now, we copy vmcs02's field, which might have been written earlier (e.g., during an exit to L0) and not now, and an observant L1 might notice this value it should not have seen. However, I don't see any problem with that, because the valid bit would be correctly turned off, and the spec says that all other bits are undefined (for irrelevant exits) and copying-old-values is one legal setting for undefined bits... No, I wasn't worried about that, simply misunderstood the code. + /* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may +* have changed some cr0 bits without us ever saving them in the shadow +* vmcs. So we need to save these changes now. ... + + vmcs12-guest_cr4 = vmcs_readl(GUEST_CR4); Can't we have the same issue with cr4? I guess we can. I didn't think this (giving guest full control over cr4 bits) was happening in KVM, but
Re: [PATCH 18/24] Exiting from L2 to L1
On Sun, Sep 12, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 to L1: Great. Hopefully you can commit some time to it or you'll spend a lot of cycles just catching up. Right. I will. Joerg just merged nested npt; as far as I can tell it is 100% in line with nested ept, but please take a look as well. Indeed. Making nested EPT work based on the nested NPT work is one of the first things I plan to do after the basic nested VMX patches are accepted. As you know, we already have a version of nested EPT working in our testing code, but I'll need to tweak it a bit to use the common nested NPT code. + vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); Without msr bitmaps, cannot change. I added a TODO before this (and a couple of others) for future optimization. I'm not even convinced how much quicker it is to check the MSR bitmap before doing vmcs_read64, vs just to going ahead and vmreading it in any case. IIRC we don't support msr bitmaps now, so no need to check anything. I do think we support msr bitmaps... E.g., we have nested_vmx_exit_handled_msr() to check whether L1 requires an exit for a certain MSR access. Where don't we support them? (but I'm not denying the possiblity that this support still has holes or bugs). In general, avoid vmcs reads as much as possible. Just think of your code running in a guest! Yes. On the other hand, I don't want to be sorry in the future when I want to support some feature, but because I wanted to shave off 1% of the L2-L1 switching time, and 0.01% of the total run time (and I'm just making numbers up...), I now need to find a dozen places where things need to change to support this feature. On the other hand, this will likely happen anyway ;-) +vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); Can this change? .. If it changes in the future, it can only be under the influence of some control or at least guest-discoverable capability. Since we don't expose such control or capability, there's no need to copy it. You convinced me. Removed it. + vmcs12-vm_entry_intr_info_field = + vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); Autocleared, no need to read. Well, we need to clear the valid bit on exit, so we don't mistakenly inject the same interrupt twice. I don't think so. We write this field as part of guest entry (that is, after the decision re which vmcs to use, yes?), so guest entry will always follow writing this field. Since guest entry clears the field, reading it after an exit will necessarily return 0. Well, you obviously know the KVM code much better than I do, but from what I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM, this field only gets written on injection, not on every entry, so the code relies on the fact that the processor turns off the valid bit during exit, to avoid the same event being injected when the same field value is used for another entry. I can only find code which resets this field in vmx_vcpu_reset(), but that doesn't get called on every entry, right? Or am I missing something? What can happen is that the contents of the field is transferred to the IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. (question: on a failed vmentry, is this field cleared?) I don't know the answer :-) There were two ways to do it: 1. clear it ourselves, or 2. copy the value from vmcs02 where the processor already cleared it. There are pros and cons for each approach, but I'll change like you suggest, to clear it ourselves: vmcs12-vm_entry_intr_info_field= ~INTR_INFO_VALID_MASK; That's really a temporary variable, I don't think you need to touch it. But we need to emulate the correct VMX behavior. According to the spec, the valid bit on this field needs to be cleared on vmexit, so we need to do it also on emulated exits from L2 to L1. If we're sure that we already cleared it earlier, then fine, but if not (and like I said, I failed to find this code), we need to do it now, on exit - either by explictly clearing the bit or by copying a value where the processor cleared this bit (arguably, the former is more correct emulation). I didn't mean register independent helper; one function for cr0 and one function for cr4 so the reader can just see the name and pretend to understand what it does, instead of seeing a bunch of incomprehensible bitwise operations. Ok, done: /* * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK) * without L0 trapping the change and updating vmcs12. * This function returns the value we should put in vmcs12.guest_cr0. It's not * enough to just return the current (vmcs02) GUEST_CR0. This may not be the * guest cr0 that L1 thought it was giving its L2 guest - it is possible that * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked * to trap
Re: [PATCH 18/24] Exiting from L2 to L1
On 09/12/2010 07:05 PM, Nadav Har'El wrote: + vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); Without msr bitmaps, cannot change. I added a TODO before this (and a couple of others) for future optimization. I'm not even convinced how much quicker it is to check the MSR bitmap before doing vmcs_read64, vs just to going ahead and vmreading it in any case. IIRC we don't support msr bitmaps now, so no need to check anything. I do think we support msr bitmaps... E.g., we have nested_vmx_exit_handled_msr() to check whether L1 requires an exit for a certain MSR access. Where don't we support them? (but I'm not denying the possiblity that this support still has holes or bugs). I was just talking from memory. If you do support them, that's great. Note that kvm itself doesn't support give the guest control of DEBUGCTLMSR, so you should just be able to read it from the shadow value (which strangely doesn't exist - I'll post a fix). In general, avoid vmcs reads as much as possible. Just think of your code running in a guest! Yes. On the other hand, I don't want to be sorry in the future when I want to support some feature, but because I wanted to shave off 1% of the L2-L1 switching time, and 0.01% of the total run time (and I'm just making numbers up...), I now need to find a dozen places where things need to change to support this feature. On the other hand, this will likely happen anyway ;-) Well, with msrs you have two cases: those which the guest controls and those which are shadowed. So all we need is a systematic way for dealing with the two types. + vmcs12-vm_entry_intr_info_field = + vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); Autocleared, no need to read. Well, we need to clear the valid bit on exit, so we don't mistakenly inject the same interrupt twice. I don't think so. We write this field as part of guest entry (that is, after the decision re which vmcs to use, yes?), so guest entry will always follow writing this field. Since guest entry clears the field, reading it after an exit will necessarily return 0. Well, you obviously know the KVM code much better than I do, but from what I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM, this field only gets written on injection, not on every entry, so the code relies on the fact that the processor turns off the valid bit during exit, to avoid the same event being injected when the same field value is used for another entry. Correct. I can only find code which resets this field in vmx_vcpu_reset(), but that doesn't get called on every entry, right? Or am I missing something? prepare_vmcs12() is called in response for a 2-1 vmexit which is first trapped by 0, yes? Because it's called immediately after a vmexit, VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the processor. There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not be cleared by hardware: 1. if we call prepare_vmcs12() between injection and entry. This cannot happen AFAICT. 2. if the vmexit was really a failed 1-2 vmentry, and if the processor doesn't clear VM_ENTRY_INTR_INFO_FIELD in response to vm entry failures (need to check scripture) If neither of these are valid, the code can be removed. If only the second, we might make it conditional. What can happen is that the contents of the field is transferred to the IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. (question: on a failed vmentry, is this field cleared?) I don't know the answer :-) Sheng? There were two ways to do it: 1. clear it ourselves, or 2. copy the value from vmcs02 where the processor already cleared it. There are pros and cons for each approach, but I'll change like you suggest, to clear it ourselves: vmcs12-vm_entry_intr_info_field= ~INTR_INFO_VALID_MASK; That's really a temporary variable, I don't think you need to touch it. But we need to emulate the correct VMX behavior. According to the spec, the valid bit on this field needs to be cleared on vmexit, so we need to do it also on emulated exits from L2 to L1. If we're sure that we already cleared it earlier, then fine, but if not (and like I said, I failed to find this code), we need to do it now, on exit - either by explictly clearing the bit or by copying a value where the processor cleared this bit (arguably, the former is more correct emulation). Sorry, I misread it as vmx-idt_vectoring_info which is a temporary variable used to cache IDT_VECTORING_INFO. Ignore my remark. I didn't mean register independent helper; one function for cr0 and one function for cr4 so the reader can just see the name and pretend to understand what it does, instead of seeing a bunch of incomprehensible bitwise operations. Ok, done: /* * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK) * without L0 trapping
Re: [PATCH 18/24] Exiting from L2 to L1
On Sun, Sep 12, 2010, Avi Kivity wrote about Re: [PATCH 18/24] Exiting from L2 to L1: +vmcs12-vm_entry_intr_info_field = +vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); ... prepare_vmcs12() is called in response for a 2-1 vmexit which is first trapped by 0, yes? Because it's called immediately after a vmexit, VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the processor. Indeed - it is cleared in the real vmcs, i.e., vmcs02, but we also need to clear it in the emulated vmcs that L1 sees for L2, i.e., vmcs12. The original code (quoted above) just copied the vmcs02 value to vmcs12, which was (mostly) fine because the vmcs02 value has a correctly-cleared bit - but you asked to avoid the vmread. So the second option is to just explicitly remove the valid bit from vmcs12-vm_entry_intro_info_field, which I do now. vmcs12-vm_entry_intro_info_field was set by L1 before it entered L2, and now that L2 is exiting back to L1, we need to clear the valid bit. The more I think about it, the more I become convinced that the second option is indeed better than the first option (the original code in the patch). There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not be cleared by hardware: ... If neither of these are valid, the code can be removed. If only the second, we might make it conditional. Again, unless I'm misunderstanding what you mean, the hardware only modified vmcs02 (the hardware vmcs), not vmcs12. We need to modify vmcs12 as well, to remove the valid bit. If we don't, when L1 enters into the same L2 again, the same old value will be copied again from vmcs12 to vmcs02, and cause an injection of the same interrupt again. And by the way, I haven't said this enough, but thanks for your continued reviews and all your very useful corrections for these patches! Nadav. -- Nadav Har'El| Sunday, Sep 12 2010, 5 Tishri 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I before E except after C. We live in a http://nadav.harel.org.il |weird society! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/24] Exiting from L2 to L1
On Monday 13 September 2010 01:21:29 Avi Kivity wrote: On 09/12/2010 07:05 PM, Nadav Har'El wrote: I don't think so. We write this field as part of guest entry (that is, after the decision re which vmcs to use, yes?), so guest entry will always follow writing this field. Since guest entry clears the field, reading it after an exit will necessarily return 0. Well, you obviously know the KVM code much better than I do, but from what I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM, this field only gets written on injection, not on every entry, so the code relies on the fact that the processor turns off the valid bit during exit, to avoid the same event being injected when the same field value is used for another entry. Correct. I can only find code which resets this field in vmx_vcpu_reset(), but that doesn't get called on every entry, right? Or am I missing something? prepare_vmcs12() is called in response for a 2-1 vmexit which is first trapped by 0, yes? Because it's called immediately after a vmexit, VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the processor. There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not be cleared by hardware: 1. if we call prepare_vmcs12() between injection and entry. This cannot happen AFAICT. 2. if the vmexit was really a failed 1-2 vmentry, and if the processor doesn't clear VM_ENTRY_INTR_INFO_FIELD in response to vm entry failures (need to check scripture) If neither of these are valid, the code can be removed. If only the second, we might make it conditional. What can happen is that the contents of the field is transferred to the IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field. (question: on a failed vmentry, is this field cleared?) I don't know the answer :-) Sheng? According to SDM 23.7 VM-ENTRY FAILURES DURING OR AFTER LOADING GUEST STATE: Although this process resembles that of a VM exit, many steps taken during a VM exit do not occur for these VM-entry failures: • Most VM-exit information fields are not updated (see step 1 above). • The valid bit in the VM-entry interruption-information field is *not* cleared. • The guest-state area is not modified. • No MSRs are saved into the VM-exit MSR-store area. So VM entry failure would result in _keep_ valid bit of VM_ENTRY_INTR_INFO_FIELD. -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/24] Exiting from L2 to L1
This patch implements nested_vmx_vmexit(), called when the nested L2 guest exits and we want to run its L1 parent and let it handle this exit. Note that this will not necessarily be called on every L2 exit. L0 may decide to handle a particular exit on its own, without L1's involvement; In that case, L0 will handle the exit, and resume running L2, without running L1 and without calling nested_vmx_vmexit(). The logic for deciding whether to handle a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(), will appear in the next patch. Signed-off-by: Nadav Har'El n...@il.ibm.com --- --- .before/arch/x86/kvm/vmx.c 2010-06-13 15:01:30.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:30.0 +0300 @@ -5080,9 +5080,13 @@ static void vmx_complete_interrupts(stru int type; bool idtv_info_valid; + vmx-exit_reason = vmcs_read32(VM_EXIT_REASON); + + if (vmx-nested.nested_mode) + return; + exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); - vmx-exit_reason = vmcs_read32(VM_EXIT_REASON); /* Handle machine checks before interrupts are enabled */ if ((vmx-exit_reason == EXIT_REASON_MCE_DURING_VMENTRY) @@ -5978,6 +5982,278 @@ static int nested_vmx_run(struct kvm_vcp return 1; } +/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to + * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this + * function updates it to reflect the state of the registers during the exit, + * and to reflect some changes that happened while L2 was running (and perhaps + * made some exits which were handled directly by L0 without going back to L1). + */ +void prepare_vmcs_12(struct kvm_vcpu *vcpu) +{ + struct shadow_vmcs *vmcs12 = get_shadow_vmcs(vcpu); + + vmcs12-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR); + vmcs12-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR); + vmcs12-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR); + vmcs12-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR); + vmcs12-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR); + vmcs12-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR); + vmcs12-guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR); + vmcs12-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR); + + vmcs12-tsc_offset = vmcs_read64(TSC_OFFSET); + vmcs12-guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS); + vmcs12-vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER); + vmcs12-guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); + if (vmcs_config.vmentry_ctrl VM_ENTRY_LOAD_IA32_PAT) + vmcs12-guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT); + vmcs12-cr3_target_count = vmcs_read32(CR3_TARGET_COUNT); + vmcs12-vm_entry_intr_info_field = + vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); + vmcs12-vm_entry_exception_error_code = + vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE); + vmcs12-vm_entry_instruction_len = + vmcs_read32(VM_ENTRY_INSTRUCTION_LEN); + vmcs12-vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR); + vmcs12-vm_exit_reason = vmcs_read32(VM_EXIT_REASON); + vmcs12-vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + vmcs12-vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE); + vmcs12-idt_vectoring_info_field = + vmcs_read32(IDT_VECTORING_INFO_FIELD); + vmcs12-idt_vectoring_error_code = + vmcs_read32(IDT_VECTORING_ERROR_CODE); + vmcs12-vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); + vmcs12-vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); + vmcs12-guest_es_limit = vmcs_read32(GUEST_ES_LIMIT); + vmcs12-guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT); + vmcs12-guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT); + vmcs12-guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT); + vmcs12-guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT); + vmcs12-guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT); + vmcs12-guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT); + vmcs12-guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT); + vmcs12-guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT); + vmcs12-guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT); + vmcs12-guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES); + vmcs12-guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES); + vmcs12-guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES); + vmcs12-guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES); + vmcs12-guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES); + vmcs12-guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES); + vmcs12-guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES); + vmcs12-guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES); + vmcs12-guest_interruptibility_info = +