Re: Monotonic clock with KVM pv-clock
On Mon, Jan 20, 2014, Marcelo Tosatti wrote about "Re: Monotonic clock with KVM pv-clock": > On Mon, Jan 20, 2014 at 11:56:56AM +0200, Nadav Har'El wrote: > > Hi, > > > > I'm trying to figure out how a guest OS can get a monotonic clock using > > KVM's paravirtual clock. > > > > At first, I thought that the clock I get using KVM_SYSTEM_TIME is a > > monotonic clock, based on the host's monotonic clock. > > It is. However, it is used in conjunction with TSC delta, part > of the structure which is written at KVM_SYSTEM_TIME GPA. See > pvclock_clocksource_read at arch/x86/kernel/pvclock.c. Sure, I'm using the whole protocol for KVM_SYSTEM_TIME. I just wondered if it has a meaningful definition (namely, of being a monotonic clock) when being used not in conjunction with the KVM_WALL_CLOCK protocol. > > 2. What happens when the wall-clock time is set on the host? I was > >hoping that only KVM_WALL_CLOCK changes and KVM_SYSTEM_TIME doesn't, > >but am no longer sure this is actually the case. > > Yes, it is the case. The host clock which backs system_timestamp field > of pvclock structure is > > CLOCK_MONOTONIC > Clock that cannot be set and represents monotonic time > since some unspecified starting point. This clock is not affected by > discontinuous jumps in the system time (e.g., if the system > administrator manually changes the clock), but is affected by the > incremental adjustments per???formed by adjtime(3) and NTP. Excellent. > > > If KVM_SYSTEM_TIME is not a correct way to get a monotonic paravirtual clock > > from KVM, is there a correct way? > > Inside a Linux guest? Can use sched_clock(). > > If not a Linux guest, either implement kvmclock-like driver > (kvm-unit-test contains one). This is actually an OSv guest (http://osv.io/). Until now we were using kvmclock by adding up the wallclock and systemtime protocol, and this resulted with a good wall-time clock (CLOCK_REALTIME), but I really wanted to also have a monotonic clock (CLOCK_MONOTONIC) for the guest. I took a look and how a Linux guest uses the KVM pv clock, and thought that the "system time" part of the kvm clock protocol will be a good monotonic clock, but wasn't sure I was understanding this right. Thanks for the clarifications! Nadav. -- Nadav Har'El| Tuesday, Jan 21 2014, 20 Shevat 5774 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |"The average person thinks he isn't." - http://nadav.harel.org.il |Father Larry Lorenzoni -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Monotonic clock with KVM pv-clock
Hi, I'm trying to figure out how a guest OS can get a monotonic clock using KVM's paravirtual clock. At first, I thought that the clock I get using KVM_SYSTEM_TIME is a monotonic clock, based on the host's monotonic clock. But I'm no longer sure this is actually done correctly: 1. What happens on live migration? On the new host, will the KVM_SYSTEM_TIME continue from the same point? Or will it jump because KVM_WALL_CLOCK (the wall time during boot) is different on the new host? 2. What happens when the wall-clock time is set on the host? I was hoping that only KVM_WALL_CLOCK changes and KVM_SYSTEM_TIME doesn't, but am no longer sure this is actually the case. If KVM_SYSTEM_TIME is not a correct way to get a monotonic paravirtual clock from KVM, is there a correct way? Thanks, Nadav. -- Nadav Har'El| Monday, Jan 20 2014, 19 Shevat 5774 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Diplomat: A man who always remembers a http://nadav.harel.org.il |woman's birthday but never her age. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nested migration issues
On Tue, Aug 27, 2013, Jan Kiszka wrote about "Re: nested migration issues": > > Can anybody tell me whether kvm supports this kind of nested migration > > or anybody is working on the relating field? > > No, KVM does not support migration while nested VMX (or SVM) is enabled. > We lack userspace interfaces to export/import the required state, and > that mostly because we do not really know yet what state needs to be > made available. Specifically nVMX is still under development. Migration > support will officially follow once we feel we reached a mature state > with nesting support. Also check out https://bugzilla.kernel.org/show_bug.cgi?id=53851 for some earlier ideas about the issue of migrating L1 complete with its L2s. > That said, if you want to play with adding such an interface yourself, > no one will stop you. It may serve as a reference for us what is > required, but you should not expect it to be merged soon. > > BTW, not only migration is broken so far. Even reset does not work while > VMX is enabled. For the same reason: userspace cannot load a VCPU state > that resets all VMX states properly. > > Jan > > -- > Siemens AG, Corporate Technology, CT RTC ITP SES-DE > Corporate Competence Center Embedded Linux > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Nadav Har'El| Tuesday, Aug 27 2013, 21 Elul 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Committee: A group of people that keeps http://nadav.harel.org.il |minutes and wastes hours. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/11] KVM: nVMX: Synchronize VMCS12 content with the shadow vmcs
Hi Abel, very nice patches. On Sun, Mar 10, 2013, Abel Gordon wrote about "[PATCH 10/11] KVM: nVMX: Synchronize VMCS12 content with the shadow vmcs": > nested_vmx_vmexit(vcpu); > + if (enable_shadow_vmcs) > + copy_vmcs12_to_shadow(to_vmx(vcpu)); I was curious why your patch adds this call to copy_vmcs12_to_shadow after every nested_vmx_vmexit (3 times), instead of making this call inside nested_vmx_vmexit(), say right after prepare_vmcs12(). Until I saw: > nested_vmx_vmexit(vcpu); > vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT; > vmcs12->vm_exit_intr_info = 0; > + if (enable_shadow_vmcs) > + copy_vmcs12_to_shadow(to_vmx(vcpu)); where you need to copy this exit-reason override as well... I wonder if there isn't a less ugly and repetitive way to do this :( -- Nadav Har'El|Monday, Mar 11 2013, 29 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Fame: when your name is in everything but http://nadav.harel.org.il |the phone book. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS
On Sun, Mar 03, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS": > /* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported. */ > #ifdef CONFIG_X86_64 > nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE; > #else > nested_vmx_exit_ctls_high = 0; > #endif > + nested_vmx_exit_ctls_high |= 0x36dff; Can you please compose this 0x36dff out of constants? Is VM_EXIT_HOST_ADDR_SPACE_SIZE one of them? It's important to verify that we actually support all these bits - even if we *should* support them, it doesn't mean we actually do (but if we do, we should say we do). > - nested_vmx_entry_ctls_low = 0; > + /* If bit 55 of VMX_BASIC is off, bits 0-8 and 12 must be 1. */ > + nested_vmx_entry_ctls_low = 0x11ff; Setting nested_vmx_entry_ctls_low = 0 just means that although the spec says only 1 setting is supported, we *also* support 0 setting. I'm not sure why this is a bad thing. Our VMX will be even better than the real processors' ;-) -- Nadav Har'El| Monday, Mar 4 2013, 22 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |My opinions may have changed, but not the http://nadav.harel.org.il |fact that I am right. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix setting of CR0 and CR4 in guest mode
On Mon, Mar 04, 2013, Jan Kiszka wrote about "Re: [PATCH] KVM: nVMX: Fix setting of CR0 and CR4 in guest mode": > >>>> if (is_guest_mode(vcpu)) { > >>>> -/* > >>>> - * We get here when L2 changed cr0 in a way that did > >>>> not change > >>>> - * any of L1's shadowed bits (see > >>>> nested_vmx_exit_handled_cr), > >>>> - * but did change L0 shadowed bits. This can currently > >>>> happen > >>>> - * with the TS bit: L0 may want to leave TS on (for > >>>> lazy fpu > >>>> - * loading) while pretending to allow the guest to > >>>> change it. > >>>> - */ > >>> Can't say I understand this patch yet, but it looks like the comment is > >>> still valid. Why have you removed it? > >> > >> L0 allows L1 or L2 at most to own TS, the rest is host-owned. I think > >> the comment was always misleading. > >> > > I do not see how it is misleading. For everything but TS we will not get > > here (if L1 is kvm). For TS we will get here if L1 allows L2 to change > > it, but L0 does not. > > For everything *but guest-owned* we will get here, thus for most CR0 > accesses (bit-wise, not regarding frequency). For most CR0 bits, L1 (at least, a KVM one) will shadow (trap) them, so we won't get to this point you modified at all... Instead, nested_vmx_exit_handled_cr() would notice that a shadowed-by-L1 bit was modified so an exit to L1 is required. We only get to that code you changed if a bit was modified that L1 did *not* want to trap, but L0 did. This is definitely not the bit-wise majority of the cases - unless you have an L1 that does not trap most of the CR0 bits. But I'm more worried about the actual code change :-) I didn't understand if there's a situation where the existing code did something wrong, or why it was wrong. Did you check the lazy-FPU-loading (TS bit) aspect of your new code? To effectively check this, what I had to do is to run on all of L0, L1, and L2, long runs of parallel "make" (make -j3) - concurrently. Even code which doesn't do floating-point calculations uses the FPU sometimes for its wide registers, so all these processes, guests and guest's guests, compete for the FPU, exercising very well this code path. If the TS bit is handled wrongly, some of these make processes will die, when one of the compilations dies of SIGSEGV (forgetting to set the FPU registers leads to some uninitialized pointers being used), so it's quite easy to exercise this. -- Nadav Har'El| Monday, Mar 4 2013, 22 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |"A witty saying proves nothing." -- http://nadav.harel.org.il |Voltaire -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Reset RFLAGS on VM-exit
On Sun, Mar 03, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Reset RFLAGS on VM-exit": > From: Jan Kiszka > > Ouch, how could this work so well that far? We need to clear RFLAGS to > the reset value as specified by the SDM. Particularly, IF must be off > after VM-exit! nested_vmx_succeed() or nested_vmx_fail*() were already clearing some of the fields that I understood was necessary to clear. But they did not clear the IF - I never realised (and didn't verify now) that this is part of the spec. And since L1 KVM anyways enters L2 with interrupts disabled, nested KVM would not see a difference. > + vmx_set_rflags(vcpu, 0x02); There's a macro X86_EFLAGS_BIT1 which you can use for this 0x02. -- Nadav Har'El| Sunday, Mar 3 2013, 22 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |"I don't use drugs, my dreams are http://nadav.harel.org.il |frightening enough." -- M. C. Escher -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 53611] New: nVMX: Add nested EPT
On Thu, Feb 14, 2013, Nakajima, Jun wrote about "Re: [Bug 53611] New: nVMX: Add nested EPT": > We have started looking at the pataches first. But I couldn't > reproduce the results by simply applying the original patches to v3.6: > - L2 Ubuntu 12.04 (64-bit) (smp 2) > - L1 Ubuntu 12.04 (64-bit) KVM (smp 2) > - L0 Ubuntu 12.04 (64-bit)-based. kernel/KVM is v3.6 + patches (the > ones in nept-v2.tgz). > https://bugzilla.kernel.org/attachment.cgi?id=93101 > > Without the patches, the L2 guest works. With it, it hangs at boot > time (just black screen): > - EPT was detected by L1 KVM. > - UP L2 didn't help. > - Looks like it's looping at EPT_walk_add_generic at the same address in L0. > > Will take a closer look. It would be helpful if the test configuration > (e.g kernel/commit id used, L1/L2 guests) was documented as well. I sent the patches in August 1st, and they applied to commit ade38c311a0ad8c32e902fe1d0ae74d0d44bc71e from a week earlier. In most of my tests, L1 and L2 were old images - L1 had Linux 2.6.33, while L2 had Linux 2.6.28. In most of my tests both L1 and L2 were UP. I've heard another report of my patch not working with newer L1/L2 - the report said that L2 failed to boot (like you reported), and also that L1 became unstable (running anything in it gave a memory fault). So it is very likely that this code still has bugs - but since I already know of errors and holes that need to be plugged (see the announcement file together with the patches), it's not very surprising :( These patches definitely need some lovin', but it's easier than starting from scratch. Nadav. -- Nadav Har'El| Tuesday, Feb 26 2013, 16 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I think therefore I am. My computer http://nadav.harel.org.il |thinks for me, therefore I am not. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nSVM/nVMX: Implement vmexit on INIT assertion
On Sun, Feb 24, 2013, Jan Kiszka wrote about "[PATCH] KVM: nSVM/nVMX: Implement vmexit on INIT assertion": > From: Jan Kiszka > > On Intel, raising INIT causing an unconditional vmexit. On AMD, this is > controlled by the interception mask. Hi, I never tried to closely follow the KVM code paths related this code, but I do have one question: The VMX spec says: "The INIT signal is blocked whenever a logical processor is in VMX root operation. It is not blocked in VMX non-root operation. Instead, INITs cause VM exits (see Section 22.3, Other Causes of VM Exits)." So when running a non-nested L1 guest, or an L2 guest, the new behavior appears correct. However, it looks like if L1 is running in root mode (i.e., did VMXON once but not running L2 now), the INIT signal needs to be blocked, not do what it does now. It appears (but I'm not sure...) that right now, it causes the L1 guest to lock up, and is not ignored? Nadav. -- Nadav Har'El|Monday, Feb 25 2013, 15 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If con is the opposite of pro, is http://nadav.harel.org.il |congress the opposite of progress? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Sat, Feb 23, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state": > - kvm_set_cr0(vcpu, vmcs12->host_cr0); > + vmx_set_cr0(vcpu, vmcs12->host_cr0); I don't remember now why I did this (and I'm not looking at the code), but this you'll need to really test carefully, including shadow-on-shadow mode (ept=0 in L0), to verify you're not missing any important side-effect of kvm_set_cr0. Also, if I remember correctly, during nVMX's review, Avi Kivity asked in several places that when I called vmx_set_cr0, I should instead call kvm_set_cr0(), because it does some extra stuff and does some extra checks. Hmm, see, see this: http://markmail.org/message/hhidqyhbo2mrgxxc where Avi asked for the reverse patch you're attempting now. -- Nadav Har'El| Saturday, Feb 23 2013, 14 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |It is better to be thought a fool, then http://nadav.harel.org.il |to open your mouth and remove all doubt. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Rework event injection and recovery
On Thu, Feb 21, 2013, Jan Kiszka wrote about "Re: [PATCH] KVM: nVMX: Rework event injection and recovery": > That generally does not help to inject/report an external IRQ to L1 as > L1 runs with IRQs disabled around VMLAUNCH/RESUME. Good point, I forgot that :( So it looks like nested_run_pending was necessary, after all. -- Nadav Har'El| Thursday, Feb 21 2013, 11 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |An egotist is a person of low taste, more http://nadav.harel.org.il |interested in himself than in me. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Rework event injection and recovery
On Thu, Feb 21, 2013, Gleb Natapov wrote about "Re: [PATCH] KVM: nVMX: Rework event injection and recovery": > will not be called while there is an event to reinject. 51cfe38ea5 still > does not explain why nested_run_pending is needed. We cannot #vmexit > without entering L2, but we can undo VMLAUNCH/VMRESUME emulation leaving > rip pointing to the instruction. We can start by moving > skip_emulated_instruction() from nested_vmx_run() to nested_vmx_vmexit(). This is a very interesting idea! Don't forget to also skip_emulated_instruction() in nested_vmx_entry_failure(). And please expand the comment at the end of nested_vmx_run(), saying that also skipping the instruction is done on exit, unless the instruction needs to be retried because we needed to inject an interrupt into L1 before running it. Whether this is actually clearer than the "nested_run_pending" approach I don't know. -- Nadav Har'El| Thursday, Feb 21 2013, 11 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Cat rule #2: Bite the hand that won't http://nadav.harel.org.il |feed you fast enough. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Rework event injection and recovery
Hi, By the way, if you haven't seen my description of why the current code did what it did, take a look at http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html Another description might also come in handy: http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event injection and recovery": > This aligns VMX more with SVM regarding event injection and recovery for > nested guests. The changes allow to inject interrupts directly from L0 > to L2. > > One difference to SVM is that we always transfer the pending event > injection into the architectural state of the VCPU and then drop it from > there if it turns out that we left L2 to enter L1. Last time I checked, if I'm remembering correctly, the nested SVM code did something a bit different: After the exit from L2 to L1 and unnecessarily queuing the pending interrupt for injection, it skipped one entry into L1, and as usual after the entry the interrupt queue is cleared so next time around, when L1 one is really entered, the wrong injection is not attempted. > VMX and SVM are now identical in how they recover event injections from > unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD > still contains a valid event and, if yes, transfer the content into L1's > idt_vectoring_info_field. > To avoid that we incorrectly leak an event into the architectural VCPU > state that L1 wants to inject, we skip cancellation on nested run. I didn't understand this last point. > @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct > vmcs12 *vmcs12) > vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); > vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); > > - /* clear vm-entry fields which are to be cleared on exit */ > - if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)) > + /* drop what we picked up for L0 via vmx_complete_interrupts */ > + vcpu->arch.nmi_injected = false; > + kvm_clear_exception_queue(vcpu); > + kvm_clear_interrupt_queue(vcpu); It would be nice to move these lines out of prepare_vmcs12(), since they don't really do anything with vmcs12, and move it into nested_vmx_vmexit() (which is the one which called prepare_vmcs12()). Did you test this both with PIN_BASED_EXT_INTR_MASK (the usual case) and !PIN_BASED_EXT_INTR_MASK (the case which interests you)? We need to make sure that in the former case, this doesn't clear the interrupt queue after we put an interrupt to be injected in it (at first glance it seems fine, but these code paths are so convoluted, it's hard to be sure). > + if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) && > + vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) { > + /* > + * Preserve the event that was supposed to be injected > + * by emulating it would have been returned in > + * IDT_VECTORING_INFO_FIELD. > + */ > + if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & > + INTR_INFO_VALID_MASK) { > + vmcs12->idt_vectoring_info_field = > + vmcs12->vm_entry_intr_info_field; > + vmcs12->idt_vectoring_error_code = > + vmcs12->vm_entry_exception_error_code; > + vmcs12->vm_exit_instruction_len = > + vmcs12->vm_entry_instruction_len; > + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); I'm afraid I'm missing what you are trying to do here. Why would vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK ever be true? After all, the processor clears it after each sucessful exit so last if() will only succeed on failed entries - but this is NOT the case if we're in the enclosing if (note that vmcs12->vm_exit_reason = vmcs_read32(VM_EXIT_REASON)). Maybe I'm missing something? Nadav. -- Nadav Har'El| Wednesday, Feb 20 2013, 10 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Change is inevitable, except from a http://nadav.harel.org.il |vending machine. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Fix injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1
On Sat, Feb 16, 2013, Jan Kiszka wrote about "Re: [PATCH] KVM: nVMX: Fix injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1": > No, this is wrong. I first wrote a patch that ignored enable_irq_window > when the guest is not interested in external IRQs. But then I thought > that wasn't correct. I tend to believe now my first idea was better. If you want to make the !PIN_BASED_EXT_INTR_MASK case work correctly, please also see: https://bugzilla.kernel.org/show_bug.cgi?id=53711 Nadav. -- Nadav Har'El| Saturday, Feb 16 2013, 6 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Always keep your words soft and sweet, http://nadav.harel.org.il |just in case you have to eat them. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: nVMX: Improve I/O exit handling
On Thu, Feb 14, 2013, Gleb Natapov wrote about "Re: [PATCH v2] KVM: nVMX: Improve I/O exit handling": > > >> Not sure how to map a failure on real HW behaviour. I guess it's best to > > > Exit to L1 with nested_vmx_failValid() may be? > > > > To my understanding, nested_vmx_failValid/Invalid are related to errors > > directly related to vm instruction execution. This one is triggered by > > the guest later on. > > > You are right. We need some kind of error vmexit here, but nothing > appropriate is defined by the spec, so lets just assume that exit to L1 > is needed if we cannot read permission bitmaps. On real hardware, note that the MSR-bitmap address specified in the VMCS is a physical address, so there can never be an error - if I understand correctly, there is no such thing as a non-existant physical address - reading from non-existant physical memory returns random garbage (please correct me if I'm wrong here!). So if I'm correct, using a non-existent address for an MSR-bitmap will give you an undefined behavior - not any sort of entry error to special type of exit. The current code, using a random value on the stack, fits with the "undefined behavior" definition, but you're right the more logical behavior is to, indeed, always exit on the MSR access when the bitmap cannot be read. This will make an unreadable bitmap equivalent to no bitmap at all - which I think makes most sense. You're also right that this case is identical in both MSR and I/O bitmap cases, and should be fixed in both. -- Nadav Har'El| Thursday, Feb 14 2013, 4 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |A fanatic is one who can't change his http://nadav.harel.org.il |mind and won't change the subject. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug 53611] New: nVMX: Add nested EPT
Hi, On Mon, Feb 11, 2013, Jan Kiszka wrote about "Re: [Bug 53611] New: nVMX: Add nested EPT": > On 2013-02-11 13:49, bugzilla-dae...@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=53611 > >Summary: nVMX: Add nested EPT Yikes, I didn't realize that these bugzilla edits all get spammed to the entire mailing list :( Sorry about those... > I suppose they do not apply anymore as well. Do you have a recent tree > around somewhere or plan to resume work on it? Unfortunately, no - I did not have time to work on these patches since August. The reason I'm now stuffing these things into the bug tracker is that at the end of this month I am leaving IBM to a new job, so I'm pretty sure I won't have time myself to continue any work on nested VMX, and would like for the missing nested-VMX features to be documented in case someone else comes along and wants to improve it. So unfortunately, you should expect more of this bugzilla spam on the mailing list... Nadav. -- Nadav Har'El| Monday, Feb 11 2013, 1 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |The message above is just this http://nadav.harel.org.il |signature's way of propagating itself. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Improve I/O exit handling
On Sun, Feb 10, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Improve I/O exit handling": > +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, > +struct vmcs12 *vmcs12) > +{ > + unsigned long exit_qualification; > + gpa_t bitmap, last_bitmap; > + bool string, rep; > + u16 port; > + int size; > + u8 b; > + > + if (nested_cpu_has(get_vmcs12(vcpu), CPU_BASED_UNCOND_IO_EXITING)) > + return 1; Instead of calling get_vmcs12(vcpu), you can just use "vmcs12" variable which you already have. I see I left the same redundant call also in nested_vmx_exit_handled_msr :( > + if (port < 0x8000) > + bitmap = vmcs12->io_bitmap_a; > + else > + bitmap = vmcs12->io_bitmap_b; > + bitmap += port / 8; In the port >= 0x8000, I believe need to subtract 0x8000 from the port number before using it as an offset into io_bitmap_b? Nadav. -- Nadav Har'El| Monday, Feb 11 2013, 1 Adar 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Attention: There will be a rain dance http://nadav.harel.org.il |Friday night, weather permitting. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: what does SVM_EXIT_VMRUN mean?
On Wed, Sep 19, 2012, Steven wrote about "what does SVM_EXIT_VMRUN mean?": > hi, > I am wondering what the exit code SVM_EXIT_VMRUN means? Its name is > not as clear as SVM_EXIT_READ_CR0. > Thanks. This exit happens when the guest ran the VMRUN instruction. In other words, the guest is a hypervisor and running its own guest. This is known as nested virtualization. In an "ordinary" (non-hypervisor) guest, you'll never see this exit reason. -- Nadav Har'El| Thursday, Sep 20 2012, 4 Tishri 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Sign in pool: "Welcome to our OOL. Notice http://nadav.harel.org.il |there is no P, please keep it that way." -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Guest mode question
On Wed, Sep 19, 2012, Steven wrote about "Re: Guest mode question": > thanks for your reply. My question is regarding the the normal > single-level virtualization. > Suppose that the physical machine has only one core and 2 VMs. > Each VM has one vcpu, so there are two vcpu threads running with the > hypervisor. > Does your answer mean these two VMs can not be in the guest mode at > the same time? Like I said, there is no such thing (in non-nested context) as a VM being in "guest mode". Rather, it is the CPU which is in guest mode. When the CPU is at the moment running guest instructions (of any of the guests) it is in guest mode. When it isn't, it's in so-called "root mode". So your question becomes moot. > For example, when the hypervisor schedules one VM, the > running VM must have a vmexit call first and then the hypervisor can > issue vmrun call to the other VM. Indeed. Virtualization hardware doesn't magically create additional hardware threads (cores/hyperthreads). The CPU can only run one thread of execution, and in particular cannot run two VMs concurrently. To switch between two VMs, the CPU needs to first exit from running the first VM (go from guest mode back to root mode), then decide which other VM to run, and VMLAUNCH/VMRUN it. The first VM might exit because of "natural causes" (i.e., a sensitive instruction), or it might be forced to exit because of, e.g., a physical timer interrupt (normally, any physical interrupt causes a guest to exit, so the host can handle it). After each exit, the hypervisor resumes running at might decide (using Linux's normal scheduling policies) to schedule a different process or VM instead of the one that just exited. -- Nadav Har'El|Wednesday, Sep 19 2012, 3 Tishri 5773 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Evening news begins with 'Good evening', http://nadav.harel.org.il |and then proceeds to say why it isn't. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Guest mode question
On Sat, Sep 15, 2012, Steven wrote about "Guest mode question": > Hi, > I have a question about vm execution in the guest mode. Suppose I have > two VMs on a machine. > Is it possible that both VMs are in the guest mode? Like this > vm1 enters guest mode > vm2 enters guest mode > // at this point, two VMs are in guest mode > > Or the hypervisor will first force one VM to vmexit before the second > VM enters the guest mode? Your description, of a *VM* being in guest mode, makes it sound like you're describing *nested* virtualization, i.e., running another hypervisor (like KVM) as the guest of the parent KVM. If this is what you actually meant, then, yes, indeed, several VMs may be in guest mode because they are each running a nested guest (aka an L2, a guest of a guest). But if you didn't ask about nested virtualization, but rather normal, single-level, virtualization, there's a different answer: It is the CPU thread (one core, or even one hyperthread), *not* the VM, which can be in root or guest mode. It is in root mode when the hypervisor runs, and in guest mode when a specific guest (VM) runs. One CPU thread can only run one code at a time, so of course it cannot be running two VMs at the same time - if it is running VM B now, it's because VM A is not running now. Of course, if you have a muticore machine, both A and B might be running, on two separate cores. In that case, both cores (not VMs) may be in guest mode concurrently. -- Nadav Har'El|Sunday, Sep 16 2012, 29 Elul 5772 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Does replacing myself with a shell-script http://nadav.harel.org.il |make me impressive or insignificant? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multi-dimensional Paging in Nested virtualization
On Tue, Sep 11, 2012, siddhesh phadke wrote about "Multi-dimensional Paging in Nested virtualization": > I read turtles project paper where they have explained how > multi-dimensional page tables are built on L0. L2 is launched with > empty EPT 0->2 and EPT 0->2 is built on-the-fly. > I tried to find out how this is done in kvm code but i could not find > where EPT 0->2 is built. Nested EPT is not yet included in the mainline KVM. The original nested EPT code that we had written as part of the Turtles paper became obsolete when much of KVM's MMU code has been rewritten. I have since rewritten the nested EPT code for the modern KVM. I sent the second (latest) version of these patches to the KVM mailing list in August, and you can find them in, for example, http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95395 These patches were not yet accepted into KVM. They have bugs in various setups (which I have not yet found the time to fix, unfortunately), and some known issues found by Avi Kivity on this mailing lest. > Does L1 handle ept violation first and then L0 updates its EPT0->2? > How this is done? This is explained in the turtles paper, but here's the short story: L1 defines an EPT table for L2 which we call EPT12. L0 builds from this an EPT02, with L1 addresses changed to L0. Now, when L2 runs and we get an EPT violation, we exit to L0 (in nested vmx, any exit first gets to L0). L0 checks if the translation is missing already in EPT12, and if it isn't it emulates an exit into L1 - and inject the EPT violation into L1. But if the translation wasn't missing in EPT12, then it's L0's problem, and we just need to update EPT02. > Can anybody give me some pointers about where to look into the code? Please look at the patches above. Each patch is also documented. Nadav. -- Nadav Har'El| Thursday, Sep 13 2012, 26 Elul 5772 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |error compiling committee.c: too many http://nadav.harel.org.il |arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h
On Thu, Aug 02, 2012, Xiao Guangrong wrote about "Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h": > > + #ifdef CONFIG_X86_64 > > + #define PT_MAX_FULL_LEVELS 4 > > + #define CMPXCHG cmpxchg > > + #else > > + #define CMPXCHG cmpxchg64 > > + #define PT_MAX_FULL_LEVELS 2 > > + #endif > > Missing the case of FULL_LEVELS == 3? Oh, you mentioned it > as PAE case in the PATCH 0. I understood this differently (and it would not be surprising if wrongly...): With nested EPT, we only deal with two *EPT* tables - the shadowed page table and shadow page table are both EPT. And EPT tables cannot have three levels - even if PAE is used. Or at least, that's what I thought... > Note A/D bits are supported on new intel cpus, this function should be > reworked > for nept. I know you did not export this feather to guest, but we can reduce > the difference between nept and other mmu models if A/D are supported. I'm not sure what you meant: If the access/dirty bits are supported in newer cpus, do you think we *should* support them also in the processor L1 processor, or are you saying that it would be easier to support them because this is what the shadow page table code normally does anyway, so *not* supporting them will take effort? > > +#if PTTYPE != PTTYPE_EPT > > static int FNAME(walk_addr_nested)(struct guest_walker *walker, > >struct kvm_vcpu *vcpu, gva_t addr, > >u32 access) > > @@ -335,6 +395,7 @@ static int FNAME(walk_addr_nested)(struc > > return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu, > > addr, access); > > } > > +#endif > > > > Hmm, you do not need the special walking functions? Since these functions are static, the compiler warns me on every function that is never used, so I had to #if them out... -- Nadav Har'El| Thursday, Aug 2 2012, 15 Av 5772 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |It's fortunate I have bad luck - without http://nadav.harel.org.il |it I would have no luck at all! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nested kvm_intel broken on pre 3.3 hosts
On Wed, Aug 01, 2012, Avi Kivity wrote about "Re: Nested kvm_intel broken on pre 3.3 hosts": > Right - it's not just kvm-as-a-guest that will trip on this. But > there's no point in everyone backporting it on their own. If you're > doing the backport, please post it here and we'll forward it to the > stable branch. If I understand correctly, the failure occurs because new versions of KVM refuse to work if the processor doesn't support CPU_BASED_RDPMC_EXITING - which older versions of nested VMX didn't say that they did. But must the KVM guest refuse to work if this feature isn't supported? I.e., why not move in setup_vmcs_config() the CPU_BASED_RDPMC_EXITING from "min" to "opt"? Isn't losing the PMU feature a lesser evil than not working at all? In any case, perhaps the original reporter can use this as a workaround, at least, because it requires modifying the (L1) guest, not the host. -- Nadav Har'El|Wednesday, Aug 1 2012, 13 Av 5772 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |May you live as long as you want - and http://nadav.harel.org.il |never want as long as you live. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] nEPT: Miscelleneous cleanups
Some trivial code cleanups not really related to nested EPT. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) --- .before/arch/x86/kvm/vmx.c 2012-08-01 17:22:47.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2012-08-01 17:22:47.0 +0300 @@ -616,7 +616,6 @@ static void nested_release_page_clean(st static u64 construct_eptp(unsigned long root_hpa); static void kvm_cpu_vmxon(u64 addr); static void kvm_cpu_vmxoff(void); -static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3); static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -895,8 +894,7 @@ static inline bool nested_cpu_has2(struc (vmcs12->secondary_vm_exec_control & bit); } -static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12, - struct kvm_vcpu *vcpu) +static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12) { return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS; } @@ -6135,7 +6133,7 @@ static int vmx_handle_exit(struct kvm_vc if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked && !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis( - get_vmcs12(vcpu), vcpu { + get_vmcs12(vcpu) { if (vmx_interrupt_allowed(vcpu)) { vmx->soft_vnmi_blocked = 0; } else if (vmx->vnmi_blocked_time > 10LL && -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] nEPT: Documentation
Update the documentation to no longer say that nested EPT is not supported. Signed-off-by: Nadav Har'El --- Documentation/virtual/kvm/nested-vmx.txt |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- .before/Documentation/virtual/kvm/nested-vmx.txt2012-08-01 17:22:47.0 +0300 +++ .after/Documentation/virtual/kvm/nested-vmx.txt 2012-08-01 17:22:47.0 +0300 @@ -38,8 +38,8 @@ The current code supports running Linux Only 64-bit guest hypervisors are supported. Additional patches for running Windows under guest KVM, and Linux under -guest VMware server, and support for nested EPT, are currently running in -the lab, and will be sent as follow-on patchsets. +guest VMware server, are currently running in the lab, and will be sent as +follow-on patchsets. Running nested VMX -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] nEPT: Nested INVEPT
If we let L1 use EPT, we should probably also support the INVEPT instruction. In our current nested EPT implementation, when L1 changes its EPT table for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course of this modification already calls INVEPT. Therefore, when L1 calls INVEPT, we don't really need to do anything. In particular we *don't* need to call the real INVEPT again. All we do in our INVEPT is verify the validity of the call, and its parameters, and then do nothing. In KVM Forum 2010, Dong et al. presented "Nested Virtualization Friendly KVM" and classified our current nested EPT implementation as "shadow-like virtual EPT". He recommended instead a different approach, which he called "VTLB-like virtual EPT". If we had taken that alternative approach, INVEPT would have had a bigger role: L0 would only rebuild the shadow EPT table when L1 calls INVEPT. Signed-off-by: Nadav Har'El --- arch/x86/include/asm/vmx.h |2 arch/x86/kvm/vmx.c | 87 +++ 2 files changed, 89 insertions(+) --- .before/arch/x86/include/asm/vmx.h 2012-08-01 17:22:47.0 +0300 +++ .after/arch/x86/include/asm/vmx.h 2012-08-01 17:22:47.0 +0300 @@ -280,6 +280,7 @@ enum vmcs_field { #define EXIT_REASON_APIC_ACCESS 44 #define EXIT_REASON_EPT_VIOLATION 48 #define EXIT_REASON_EPT_MISCONFIG 49 +#define EXIT_REASON_INVEPT 50 #define EXIT_REASON_WBINVD 54 #define EXIT_REASON_XSETBV 55 #define EXIT_REASON_INVPCID58 @@ -406,6 +407,7 @@ enum vmcs_field { #define VMX_EPTP_WB_BIT(1ull << 14) #define VMX_EPT_2MB_PAGE_BIT (1ull << 16) #define VMX_EPT_1GB_PAGE_BIT (1ull << 17) +#define VMX_EPT_INVEPT_BIT (1ull << 20) #define VMX_EPT_AD_BIT (1ull << 21) #define VMX_EPT_EXTENT_INDIVIDUAL_BIT (1ull << 24) #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25) --- .before/arch/x86/kvm/vmx.c 2012-08-01 17:22:47.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2012-08-01 17:22:47.0 +0300 @@ -2026,6 +2026,10 @@ static __init void nested_vmx_setup_ctls /* nested EPT: emulate EPT also to L1 */ nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT; nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT; + nested_vmx_ept_caps |= + VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT | + VMX_EPT_EXTENT_CONTEXT_BIT | + VMX_EPT_EXTENT_INDIVIDUAL_BIT; nested_vmx_ept_caps &= vmx_capability.ept; } else nested_vmx_ept_caps = 0; @@ -5702,6 +5706,87 @@ static int handle_vmptrst(struct kvm_vcp return 1; } +/* Emulate the INVEPT instruction */ +static int handle_invept(struct kvm_vcpu *vcpu) +{ + u32 vmx_instruction_info; + unsigned long type; + gva_t gva; + struct x86_exception e; + struct { + u64 eptp, gpa; + } operand; + + if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) || + !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + /* According to the Intel VMX instruction reference, the memory +* operand is read even if it isn't needed (e.g., for type==global) +*/ + vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), + vmx_instruction_info, &gva)) + return 1; + if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand, + sizeof(operand), &e)) { + kvm_inject_page_fault(vcpu, &e); + return 1; + } + + type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf); + + switch (type) { + case VMX_EPT_EXTENT_GLOBAL: + if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT)) + nested_vmx_failValid(vcpu, + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + else { + /* +* Do nothing: when L1 changes EPT12, we already +* update EPT02 (the shadow EPT table) and call INVEPT. +* So when L1 calls INVEPT, there's nothing left to do. +*/
[PATCH 07/10] nEPT: Advertise EPT to L1
Advertise the support of EPT to the L1 guest, through the appropriate MSR. This is the last patch of the basic Nested EPT feature, so as to allow bisection through this patch series: The guest will not see EPT support until this last patch, and will not attempt to use the half-applied feature. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) --- .before/arch/x86/kvm/vmx.c 2012-08-01 17:22:47.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2012-08-01 17:22:47.0 +0300 @@ -1946,6 +1946,7 @@ static u32 nested_vmx_secondary_ctls_low static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high; static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high; static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high; +static u32 nested_vmx_ept_caps; static __init void nested_vmx_setup_ctls_msrs(void) { /* @@ -2021,6 +2022,14 @@ static __init void nested_vmx_setup_ctls nested_vmx_secondary_ctls_low = 0; nested_vmx_secondary_ctls_high &= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; + if (enable_ept) { + /* nested EPT: emulate EPT also to L1 */ + nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT; + nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT; + nested_vmx_ept_caps &= vmx_capability.ept; + } else + nested_vmx_ept_caps = 0; + } static inline bool vmx_control_verify(u32 control, u32 low, u32 high) @@ -2120,8 +2129,8 @@ static int vmx_get_vmx_msr(struct kvm_vc nested_vmx_secondary_ctls_high); break; case MSR_IA32_VMX_EPT_VPID_CAP: - /* Currently, no nested ept or nested vpid */ - *pdata = 0; + /* Currently, no nested vpid support */ + *pdata = nested_vmx_ept_caps; break; default: return 0; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/10] nEPT: Some additional comments
Some additional comments to preexisting code: Explain who (L0 or L1) handles EPT violation and misconfiguration exits. Don't mention "shadow on either EPT or shadow" as the only two options. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) --- .before/arch/x86/kvm/vmx.c 2012-08-01 17:22:47.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2012-08-01 17:22:47.0 +0300 @@ -5952,7 +5952,20 @@ static bool nested_vmx_exit_handled(stru return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); case EXIT_REASON_EPT_VIOLATION: + /* +* L0 always deals with the EPT violation. If nested EPT is +* used, and the nested mmu code discovers that the address is +* missing in the guest EPT table (EPT12), the EPT violation +* will be injected with nested_ept_inject_page_fault() +*/ + return 0; case EXIT_REASON_EPT_MISCONFIG: + /* +* L2 never uses directly L1's EPT, but rather L0's own EPT +* table (shadow on EPT) or a merged EPT table that L0 built +* (EPT on EPT). So any problems with the structure of the +* table is L0's fault. +*/ return 0; case EXIT_REASON_WBINVD: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING); @@ -6881,7 +6894,12 @@ static void prepare_vmcs02(struct kvm_vc vmx_set_cr4(vcpu, vmcs12->guest_cr4); vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12)); - /* shadow page tables on either EPT or shadow page tables */ + /* +* Note that kvm_set_cr3() and kvm_mmu_reset_context() will do the +* right thing, and set GUEST_CR3 and/or EPT_POINTER in all supported +* settings: 1. shadow page tables on shadow page tables, 2. shadow +* page tables on EPT, 3. EPT on EPT. +*/ kvm_set_cr3(vcpu, vmcs12->guest_cr3); kvm_mmu_reset_context(vcpu); @@ -7220,7 +7238,6 @@ void load_vmcs12_host_state(struct kvm_v if (nested_cpu_has_ept(vmcs12)) nested_ept_uninit_mmu_context(vcpu); - /* shadow page tables on either EPT or shadow page tables */ kvm_set_cr3(vcpu, vmcs12->host_cr3); kvm_mmu_reset_context(vcpu); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical address. The problem is that with nested EPT, cr3 is an *L2* physical address, not an L1 physical address as this test expects. As the comment above this test explains, it isn't necessary, and doesn't correspond to anything a real processor would do. So this patch removes it. Note that this wrong test could have also theoretically caused problems in nested NPT, not just in nested EPT. However, in practice, the problem was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus circumventing the problem. Additional potential calls to the buggy function are avoided in that we don't trap cr3 modifications when nested NPT is enabled. However, because in nested VMX we did want to use kvm_set_cr3() (as requested in Avi Kivity's review of the original nested VMX patches), we can't avoid this problem and need to fix it. Signed-off-by: Nadav Har'El --- arch/x86/kvm/x86.c | 11 --- 1 file changed, 11 deletions(-) --- .before/arch/x86/kvm/x86.c 2012-08-01 17:22:47.0 +0300 +++ .after/arch/x86/kvm/x86.c 2012-08-01 17:22:47.0 +0300 @@ -659,17 +659,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, u */ } - /* -* Does the new cr3 value map to physical memory? (Note, we -* catch an invalid cr3 even in real-mode, because it would -* cause trouble later on when we turn on paging anyway.) -* -* A real CPU would silently accept an invalid cr3 and would -* attempt to use it - with largely undefined (and often hard -* to debug) behavior on the guest side. -*/ - if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT))) - return 1; vcpu->arch.cr3 = cr3; __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail); vcpu->arch.mmu.new_cr3(vcpu); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/10] nEPT: Fix cr3 handling in nested exit and entry
The existing code for handling cr3 and related VMCS fields during nested exit and entry wasn't correct in all cases: If L2 is allowed to control cr3 (and this is indeed the case in nested EPT), during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and we forgot to do so. This patch adds this copy. If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and whoever does control cr3 (L1 or L2) is using PAE, the processor might have saved PDPTEs and we should also save them in vmcs12 (and restore later). Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 30 ++ 1 file changed, 30 insertions(+) --- .before/arch/x86/kvm/vmx.c 2012-08-01 17:22:46.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2012-08-01 17:22:46.0 +0300 @@ -6885,6 +6885,17 @@ static void prepare_vmcs02(struct kvm_vc kvm_set_cr3(vcpu, vmcs12->guest_cr3); kvm_mmu_reset_context(vcpu); + /* +* Additionally, except when L0 is using shadow page tables, L1 or +* L2 control guest_cr3 for L2, so they may also have saved PDPTEs +*/ + if (enable_ept) { + vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0); + vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1); + vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2); + vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); + } + kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp); kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip); } @@ -7116,6 +7127,25 @@ void prepare_vmcs12(struct kvm_vcpu *vcp vmcs12->guest_pending_dbg_exceptions = vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); + /* +* In some cases (usually, nested EPT), L2 is allowed to change its +* own CR3 without exiting. If it has changed it, we must keep it. +* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined +* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12. +*/ + if (enable_ept) + vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3); + /* +* Additionally, except when L0 is using shadow page tables, L1 or +* L2 control guest_cr3 for L2, so save their PDPTEs +*/ + if (enable_ept) { + vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); + vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); + vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); + vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); + } + /* TODO: These cannot have changed unless we have MSR bitmaps and * the relevant bit asks not to trap the change */ vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] nEPT: MMU context for nested EPT
KVM's existing shadow MMU code already supports nested TDP. To use it, we need to set up a new "MMU context" for nested EPT, and create a few callbacks for it (nested_ept_*()). This context should also use the EPT versions of the page table access functions (defined in the previous patch). Then, we need to switch back and forth between this nested context and the regular MMU context when switching between L1 and L2 (when L1 runs this L2 with EPT). Signed-off-by: Nadav Har'El --- arch/x86/kvm/mmu.c | 38 +++ arch/x86/kvm/mmu.h |1 arch/x86/kvm/vmx.c | 52 +++ 3 files changed, 91 insertions(+) --- .before/arch/x86/kvm/mmu.h 2012-08-01 17:22:46.0 +0300 +++ .after/arch/x86/kvm/mmu.h 2012-08-01 17:22:46.0 +0300 @@ -52,6 +52,7 @@ int kvm_mmu_get_spte_hierarchy(struct kv void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask); int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm) { --- .before/arch/x86/kvm/mmu.c 2012-08-01 17:22:46.0 +0300 +++ .after/arch/x86/kvm/mmu.c 2012-08-01 17:22:46.0 +0300 @@ -3616,6 +3616,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu } EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu); +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context) +{ + ASSERT(vcpu); + ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa)); + + context->shadow_root_level = kvm_x86_ops->get_tdp_level(); + + context->nx = is_nx(vcpu); /* TODO: ? */ + context->new_cr3 = paging_new_cr3; + context->page_fault = EPT_page_fault; + context->gva_to_gpa = EPT_gva_to_gpa; + context->sync_page = EPT_sync_page; + context->invlpg = EPT_invlpg; + context->update_pte = EPT_update_pte; + context->free = paging_free; + context->root_level = context->shadow_root_level; + context->root_hpa = INVALID_PAGE; + context->direct_map = false; + + /* TODO: reset_rsvds_bits_mask() is not built for EPT, we need + something different. +*/ + reset_rsvds_bits_mask(vcpu, context); + + + /* TODO: I copied these from kvm_init_shadow_mmu, I don't know why + they are done, or why they write to vcpu->arch.mmu and not context +*/ + vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu); + vcpu->arch.mmu.base_role.cr0_wp = is_write_protection(vcpu); + vcpu->arch.mmu.base_role.smep_andnot_wp = + kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) && + !is_write_protection(vcpu); + + return 0; +} +EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu); + static int init_kvm_softmmu(struct kvm_vcpu *vcpu) { int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu); --- .before/arch/x86/kvm/vmx.c 2012-08-01 17:22:46.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2012-08-01 17:22:46.0 +0300 @@ -901,6 +901,11 @@ static inline bool nested_cpu_has_virtua return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS; } +static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12) +{ + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT); +} + static inline bool is_exception(u32 intr_info) { return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -6591,6 +6596,46 @@ static void vmx_set_supported_cpuid(u32 entry->ecx |= bit(X86_FEATURE_VMX); } +/* Callbacks for nested_ept_init_mmu_context: */ + +static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu) +{ + /* return the page table to be shadowed - in our case, EPT12 */ + return get_vmcs12(vcpu)->ept_pointer; +} + +static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu, + struct x86_exception *fault) +{ + struct vmcs12 *vmcs12; + nested_vmx_vmexit(vcpu); + vmcs12 = get_vmcs12(vcpu); + /* +* Note no need to set vmcs12->vm_exit_reason as it is already copied +* from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION. +*/ + vmcs12->exit_qualification = fault->error_code; + vmcs12->guest_physical_address = fault->address; +} + +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) +{ + int r = kvm_init_shadow_EPT_mmu(vcpu, &vcpu->arch.mmu); + + vcpu->arch.mmu.set_cr3 = vmx_set_cr3; + vcpu->arch.mmu.get_cr3 = nested_ept_get_cr3; + vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault; + + vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; + + return r; +} + +static void nested_ept_uninit_
[PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h
This is the first patch in a series which adds nested EPT support to KVM's nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set its own cr3 and take its own page faults without either of L0 or L1 getting involved. This often significanlty improves L2's performance over the previous two alternatives (shadow page tables over EPT, and shadow page tables over shadow page tables). This patch adds EPT support to paging_tmpl.h. paging_tmpl.h contains the code for reading and writing page tables. The code for 32-bit and 64-bit tables is very similar, but not identical, so paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once with PTTYPE=64, and this generates the two sets of similar functions. There are subtle but important differences between the format of EPT tables and that of ordinary x86 64-bit page tables, so for nested EPT we need a third set of functions to read the guest EPT table and to write the shadow EPT table. So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed with "EPT") which correctly read and write EPT tables. Signed-off-by: Nadav Har'El --- arch/x86/kvm/mmu.c | 14 + arch/x86/kvm/paging_tmpl.h | 98 --- 2 files changed, 96 insertions(+), 16 deletions(-) --- .before/arch/x86/kvm/mmu.c 2012-08-01 17:22:46.0 +0300 +++ .after/arch/x86/kvm/mmu.c 2012-08-01 17:22:46.0 +0300 @@ -1971,15 +1971,6 @@ static void shadow_walk_next(struct kvm_ return __shadow_walk_next(iterator, *iterator->sptep); } -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) -{ - u64 spte; - - spte = __pa(sp->spt) - | PT_PRESENT_MASK | PT_ACCESSED_MASK - | PT_WRITABLE_MASK | PT_USER_MASK; - mmu_spte_set(sptep, spte); -} static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned direct_access) @@ -3427,6 +3418,11 @@ static bool sync_mmio_spte(u64 *sptep, g return false; } +#define PTTYPE_EPT 18 /* arbitrary */ +#define PTTYPE PTTYPE_EPT +#include "paging_tmpl.h" +#undef PTTYPE + #define PTTYPE 64 #include "paging_tmpl.h" #undef PTTYPE --- .before/arch/x86/kvm/paging_tmpl.h 2012-08-01 17:22:46.0 +0300 +++ .after/arch/x86/kvm/paging_tmpl.h 2012-08-01 17:22:46.0 +0300 @@ -50,6 +50,22 @@ #define PT_LEVEL_BITS PT32_LEVEL_BITS #define PT_MAX_FULL_LEVELS 2 #define CMPXCHG cmpxchg +#elif PTTYPE == PTTYPE_EPT + #define pt_element_t u64 + #define guest_walker guest_walkerEPT + #define FNAME(name) EPT_##name + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl) + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl) + #define PT_INDEX(addr, level) PT64_INDEX(addr, level) + #define PT_LEVEL_BITS PT64_LEVEL_BITS + #ifdef CONFIG_X86_64 + #define PT_MAX_FULL_LEVELS 4 + #define CMPXCHG cmpxchg + #else + #define CMPXCHG cmpxchg64 + #define PT_MAX_FULL_LEVELS 2 + #endif #else #error Invalid PTTYPE value #endif @@ -78,6 +94,7 @@ static gfn_t gpte_to_gfn_lvl(pt_element_ return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT; } +#if PTTYPE != PTTYPE_EPT static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, pt_element_t __user *ptep_user, unsigned index, pt_element_t orig_pte, pt_element_t new_pte) @@ -100,15 +117,22 @@ static int FNAME(cmpxchg_gpte)(struct kv return (ret != orig_pte); } +#endif static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte, bool last) { unsigned access; +#if PTTYPE == PTTYPE_EPT + /* We rely here that ACC_WRITE_MASK==VMX_EPT_WRITABLE_MASK */ + access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK | + ((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0); +#else access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK; if (last && !is_dirty_gpte(gpte)) access &= ~ACC_WRITE_MASK; +#endif #if PTTYPE == 64 if (vcpu->arch.mmu.nx) @@ -135,6 +159,30 @@ static bool FNAME(is_last_gpte)(struct g return false; } +static inline int FNAME(is_present_gpte)(unsigned long pte) +{ +#if PTTYPE == PTTYPE_EPT + return pte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | + VMX_EPT_EXECUTABLE_MASK); +#else + return is_present_gpte(pte); +#endif +} + +static inline int FNAME(check_write_user_access)(struct kvm_vcpu *vcpu, + bool write_fau
[PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) --- .before/arch/x86/kvm/vmx.c 2012-08-01 17:22:46.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2012-08-01 17:22:46.0 +0300 @@ -1976,6 +1976,7 @@ static __init void nested_vmx_setup_ctls #else nested_vmx_exit_ctls_high = 0; #endif + nested_vmx_exit_ctls_high |= VM_EXIT_LOAD_IA32_EFER; /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -1983,6 +1984,7 @@ static __init void nested_vmx_setup_ctls nested_vmx_entry_ctls_low = 0; nested_vmx_entry_ctls_high &= VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; + nested_vmx_entry_ctls_high |= VM_ENTRY_LOAD_IA32_EFER; /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, @@ -6768,10 +6770,18 @@ static void prepare_vmcs02(struct kvm_vc vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl); - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls | + /* L2->L1 exit controls are emulated - the hardware exit is to L0 so +* we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER +* bits are further modified by vmx_set_efer() below. +*/ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); + + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are +* emulated by vmx_set_efer(), below. +*/ + vmcs_write32(VM_ENTRY_CONTROLS, + (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER & + ~VM_ENTRY_IA32E_MODE) | (vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE)); if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX
The following patches add nested EPT support to Nested VMX. This is the second version of this patch set. Most of the issues from the previous reviews were handled, and in particular there is now a new variant of paging_tmpl for EPT page tables. However, while this version does work in my tests, there are still some known problems/bugs with this version and unhandled issues from the previous review: 1. 32-bit *PAE* L2s currently don't work. non-PAE 32-bit L2s do work (and so do, of course, 64-bit L2s). 2. nested_ept_inject_page_fault() assumes vm_exit_reason is already set to EPT_VIOLATION. However, it is conceivable that L0 emulates some L2 instruction, and during this emulation we read some L2 memory causing a need to exit (from L2 to L1) with an EPT violation. 3. Moreover, now nested_ept_inject_page_fault() always causes an EPT_VIOLATION, with vmcs12->exit_qualification = fault->error_code. This is wrong: first fault->error code is not in exit qualification format but in PFERR_* format. Moreover, PFERR_RSVD_MASK would mean we need to cause an EPT_MISCONFIG, NOT EPT_VIOLATION. Instead of trying to fix this by translating PFERR to exit_qualification, we should calculate and remember in walk_addr() the exit qualification (and and an additional bit: whether it's an EPT VIOLATION or MISCONFIGURATION). This will be remembered in new fields in x86_exception. Avi suggested: "[add to x86_exception] another bool, to distinguish between EPT VIOLATION and EPT_QUALIFICATION. The error_code field should be extended to 64 bits for EXIT_QUALIFICATION (though only bits 0-12 are defined). You need another field for the guest linear address. EXIT_QUALIFICATION has to be calculated, it cannot be derived from the original exit. Look at kvm_propagate_fault()." He also added: "If we're injecting an EPT VIOLATION to L1 (because we weren't able to resolve it; say L1 write-protected the page), then we need to compute EXIT_QUALIFICATION. Bits 3-5 of EXIT_QUALIFICATION are computed from EPT12 paging structure entries (easy to derive them from pt_access/pte_access)." 4. Also, nested_ept_inject_page_fault() doesn't set guest linear address. 5. There are several "TODO"s left in the code. If there's any volunteer willing to help me with some of these issues, it would be great :-) About nested EPT: - Nested EPT means emulating EPT for an L1 guest, allowing it to use EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set its own cr3 and take its own page faults without either of L0 or L1 getting involved. In many workloads this significanlty improves L2's performance over the previous two alternatives (shadow page tables over ept, and shadow page tables over shadow page tables). Our paper [1] described these three options, and the advantages of nested EPT ("multidimensional paging" in the paper). Nested EPT is enabled by default (if the hardware supports EPT), so users do not have to do anything special to enjoy the performance improvement that this patch gives to L2 guests. L1 may of course choose not to use nested EPT, by simply not using EPT (e.g., a KVM in L1 may use the "ept=0" option). Just as a non-scientific, non-representative indication of the kind of dramatic performance improvement you may see in workloads that have a lot of context switches and page faults, here is a measurement of the time an example single-threaded "make" took in L2 (kvm over kvm): shadow over shadow: 105 seconds ("ept=0" in L0 forces this) shadow over EPT: 87 seconds (the previous default; Can be forced with "ept=0" in L1) EPT over EPT: 29 seconds (the default after this patch) Note that the same test on L1 (with EPT) took 25 seconds, so for this example workload, performance of nested virtualization is now very close to that of single-level virtualization. [1] "The Turtles Project: Design and Implementation of Nested Virtualization", http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf Patch statistics: - Documentation/virtual/kvm/nested-vmx.txt |4 arch/x86/include/asm/vmx.h |2 arch/x86/kvm/mmu.c | 52 +++- arch/x86/kvm/mmu.h |1 arch/x86/kvm/paging_tmpl.h | 98 - arch/x86/kvm/vmx.c | 227 +++++++-- arch/x86/kvm/x86.c | 11 - 7 files changed, 354 insertions(+), 41 deletions(-) -- Nadav Har'El IBM Haifa Research Lab -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tracking nested guest ioctl in L0 hypervisor
On Mon, Jul 30, 2012, siddhesh phadke wrote about "Tracking nested guest ioctl in L0 hypervisor": > I am trying to understand KVM code for nested virtualization and my > goal is to find whether an ioctl performed by L2 guest can be > intercepted in L0. > > Hence just for experimental purpose I wrote an blank ioctl in L2 > guest. When that ioctl is received by L1 KVM hypervisor ,it uses > kvm_hypercall0() mentioned in kvm_para.h to notify L0. Am I doing this > correct or is there any other method to do the same or I am completely > off the track? > > Can anyone please help me with this? Do you really mean an *ioctl* in L2 - which is just a system call in L2 (and never intercepted by L0 or L1), or a *hypercall*? From the mention of kvm_hypercall0() it sounds like you mean a hypercall. As you can see in vmx.c, nested_vmx_exit_handled(), when L0 receives a VMCALL exit (i.e., a hypercall) from L2, we return 1 - meaning that we exit to L1 so that it can handle this hypercall. I believe that this is this is the more sensible behavior, but if you want L0 to handle hypercalls, you can, in the EXIT_REASON_VMCALL case in that function, return 0, which would cause L0 to handle this exit. -- Nadav Har'El|Wednesday, Aug 1 2012, 13 Av 5772 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Despite the cost of living, have you http://nadav.harel.org.il |noticed how it remains so popular? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: what is this function mean "is_no_device"?
On Wed, Jun 20, 2012, sheng qiu wrote about "what is this function mean "is_no_device"?": > Hi all, > > does anyone can explain what "is_no_device()" and vmx_fpu_activate(vcpu) do? This is a relatively complicated and delicate issue, so I suggest you also refer to Intel's spec for more details, and dig deeper into the relevant Linux code (not just in KVM). But I'll try to give you the gist of it: When an OS does a context switch (we're talking about an OS now, not yet a hypervisor), it needs to save the old process's registers, and load saved registers of the new process. The original 80386 had few enough registers to make this switch short enough. But with the math coprocessor (the 80387 - aka the FPU, Floating Point Unit), there was a problem: It had a bunch of registers, long (80 bits and more) registers, and switching them all the time was a serious performance problem. This was even more of a waste since most processes didn't actually use the FPU - so all this switching was usually done for nothing. So a trick was invented - "lazy FPU loading". The idea is that when the OS switches to a different process, it does NOT load the new process's saved FPU registers. If this process never uses the FPU, we saved the cost of this load. But what if it *does* use the FPU? A new bit was added to the CR0 register, the "TS" (task switch) bit. When TS is 1, the processor "pretends" that there is no FPU (just like the original 8086 had no FPU), so every floating-point operation throws an #NM ("no math") exception. The OS catches this exception, now finally loads the current task's FPU registers, and zeros the TS bit. The floating-point operation now restarts, and since the TS bit is off, it succeeds (and uses the right content in the registers). The is_no_device() function you notices checks if an exception is the NM exception. "device not available" or "no device" are alternative names of the #NM exception. To understand vmx_fpu_activate() you need to understand now what KVM does when both the host processes (after all, KVM is Linux) and processes in the guest, use the FPU. To make a really long story short, once any process in the guest uses the FPU, KVM uses "vmx_fpu_activate()" to say that this guest now has full control of the TS bit, and the NM exception. This will allow the guest OS to play its usual "lazy FPU loading" tricks without the host needing to get involved - the host will only care about the FPU when we later switch tasks from this guest to another guest or Linux process. And vice versa - the guest might think it wants TS to be 0 (because it had already set the FPU registers correctly for the current task) but the host needs it to be 1 (because unknown to the guest, Linux switched to a different host process and loaded its FPU registers). All of these games are done using VMX's CR0 shadowing features (which you can read about in the VMX spec). If you think all of this was complicated, just think what it takes to do all of this correctly in *nested* virtualization (where the host, guest, and guest's guest all want to delay loading the FPU registers) - it took me about a month to get that working without bugs ;-) -- Nadav Har'El| Thursday, Jun 21 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |From the Linux getopt(3) manpage: "BUGS: http://nadav.harel.org.il |This manpage is confusing." -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM handling external interrupts
> - You discuss interrupt delivery without stating that you have MSIs in >mind. Some aspects may be helpful for legacy interrupts as well, but >you obviously can't achieve exit-less operation there. Not an issue, >should just be made clear. Can you eleborate on why exit-less operation cannot be achieved without MSI? Doesn't the VMCS flag to avoid exiting on external interrupts apply to any interrupts? Or something else won't work? In any case, you're right that our implementation and tests all used MSI. > > need paravirtualize the guest: no if you have x2APIC. > > ...and the guest makes use of it. This excludes older OSes. When did > Windows start to use it? Iff you can't use x2APIC, and don't want to paravirtualize the guest, you still get exit-less interrupt *delivery*, which as we showed in the benchmarks, gets you more than half of the performance improvement (although with newer KVM's improvement in EOI emulation performance, the over-half improvement should be somewhat less pronounced). Nadav. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: x86: Implement PCID/INVPCID for guests with EPT
On Mon, May 14, 2012, Marcelo Tosatti wrote about "Re: [PATCH v2] KVM: x86: Implement PCID/INVPCID for guests with EPT": > > +* Enable INVPCID for non-ept guests may cause performance regression, > > +* and without INVPCID, PCID has little benefits. So disable them all > > +* for non-ept guests. > > +* > > +* PCID is not supported for nested guests yet. > > +*/ > > + return enable_ept && (boot_cpu_data.x86_capability[4] & > > bit(X86_FEATURE_PCID)) && !cpu_has_hypervisor; > > +} > > The comment Avi made was regarding running a nested guest, not running > _as_ a nested guest (which is what cpu_has_hypervisor is about). > > You can disable INVPCID exec control (which #UDs), if its in Level-2 > guest mode (see if_guest_mode()), and restore the Level-1 value when > leaving nested mode. I also don't understand the !cpu_has_hypervisor thing. Regarding running a nested guest - the code in prepare_vmcs02() sets (among other things) secondary exec controls used to run the L2 guest. It mostly "or"s the bits requested by KVM (L0) and the guest (L1). So: 1. If you think L0 mustn't run L2 with a certain bit, despite running L1 with it (I didn't follow the reasoning why this is the case), you can specificially turn it off in that function (like we already do for SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES). 2. The other thing that could theoretically happen is that L1 will ask to turn on this bit for L2 (I think this is the case you wanted to avoid). This *won't* happen, because we tell L1 via MSR which bits it is allowed to set on secondary controls (see nested_vmx_secondary_ctls_high), and currently this only includes SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES and SECONDARY_EXEC_ENABLE_EPT (the latter only for nested EPT), and enforce that L1 doesn't try to turn on more bits. -- Nadav Har'El| Tuesday, May 15 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |A bird in the hand is safer than one http://nadav.harel.org.il |overhead. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Networking performance on a KVM Host (with no guests)
On Sat, Apr 21, 2012, Chegu Vinod wrote about "Re: Networking performance on a KVM Host (with no guests)": > some traffic through between the hosts... then we can't expect to get line > rate > on that private NIC. If you're not using device assignment, then just disable intel_iommu. If you *are* using device assignment, try setting intel_iommu to something other then on. Maybe intel_iommu=pt will work (see http://lwn.net/Articles/329174/). But I actually never tried this myself - so let me know if it works ;-) Is anyone else aware of a different "best practice", on how to enable VT-d for device assignment, without any changes (in performance or security) in the host? -- Nadav Har'El|Sunday, Apr 22 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Sign in zoo: Do not feed the animals. If http://nadav.harel.org.il |you have food give it to the guard on duty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Networking performance on a KVM Host (with no guests)
On Fri, Apr 20, 2012, Chegu Vinod wrote about "Re: Networking performance on a KVM Host (with no guests)": > Removing the "intel_iommu=on" boot time parameter in the Config 1 > case seemed to help "intel_iommu=on" is essential with you're mostly running guests *and* using device assignment. However, unfortunately, it also has a serious performance penalty for I/O in *host* processes: When intel_iommu=on, Linux (completely unrelated to KVM) adds a new level of protection which didn't exist without an IOMMU - the network card, which without an IOMMU could write (via DMA) to any memory location, now is not allowed - the card can only write to memory locates which the OS wanted it to write. Theoretically, this can protect the OS against various kinds of attacks. But what happens now is that every time that Linux passes a new buffer to the card, it needs to change the IOMMU mappings. This noticably slows down I/O, unfortunately. -- Nadav Har'El|Friday, Apr 20 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |A bird in the hand is safer than one http://nadav.harel.org.il |overhead. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump
On Wed, Apr 18, 2012, Avi Kivity wrote about "Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump": > Right; they're also not required to be in memory at all - the processor > can cache them, even for VMCSs that are not active at this time. > Running VMXOFF at panic time can fix that, but you have to broadcast it > to all processors, probably using NMI... I believe that a VMCLEAR ensures that the VMCS is written back to memory. KVM uses this fact when migrating a VMCS between two separate physical CPUs - it runs VMCLEAR on the old CPU, to write the VMCS to memory, and then VMPTRLD on the new CPU. So you don't need to VMXOFF, but do need to VMCLEAR. But there's still the complication that you mention - you need to do the VMCLEAR on the right processor... -- Nadav Har'El| Wednesday, Apr 18 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |The only "intuitive" interface is the http://nadav.harel.org.il |nipple. After that, it's all learned. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Has any work 3.3 kvm-kmod for rhel 6.2 kernel successfully?
On Mon, Apr 16, 2012, ya su wrote about "Has any work 3.3 kvm-kmod for rhel 6.2 kernel successfully?": > Hi,all: > > I try 3.3 kvm-kmod to compile against redhat 2.6.32-220.7.1, after > change some macros in external-module-compat-comm.h , > external-module-compat.h, and in some C files, finally I can compile > and run qemu-kvm(0.12 with rhel release) with 3.3 module, everything > looks fine except that screen-display can not flush correctly, it > looks like the display-card memory can not get updated in time when it > changes. > >Is there anyone that can give me some clues, thanks. I don't have an answer to your specific question (about kmod on 2.6.32), but perhaps you can provide more information - what guest were you using (also Redhat Linux with 2.6.32? something different?)? Does the problem also happen for different guests? And what do you mean by "screen display cannot flush correctly"? Are you using an emulated serial console, emulated VGA (on X? on VNC?), or what? -- Nadav Har'El|Monday, Apr 16 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If Barbie is so popular, why do you have http://nadav.harel.org.il |to buy her friends? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nested virtualization on Intel does not work - second level freezes when third level is starting
On Wed, Apr 11, 2012, Guido Winkelmann wrote about "Re: Nested virtualization on Intel does not work - second level freezes when third level is starting": > No, even 2-level nesting is broken. I can run Host->Guest, but not > Host->Guest->2nd Level Guest. I haven't even tried with a third virtualized > level. I see. I guess I completely misunderstood what you reported. Sorry. I think Orit was right. 3.3rc5 had a regression in the nested support, which I discovered and Avi Kivity fixed; I didn't notice this before now, but unfortunately the fix only got to 3.4rc1 and never made it into 3.3 (I just verified, it's not in 3.3.1 but it is in 3.4). This bug displayed itself similarly to what you saw (L1 would hang when running L2). If you can run a later kernel, I hope the problem will be solved. Otherwise, perhaps you can patch your kernel with the following patch and try again? --- .before/arch/x86/kvm/vmx.c 2012-03-19 18:34:24.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2012-03-19 18:34:24.0 +0200 @@ -2210,6 +2210,10 @@ static int vmx_set_msr(struct kvm_vcpu * msr = find_msr_entry(vmx, msr_index); if (msr) { msr->data = data; + if (msr - vmx->guest_msrs < vmx->save_nmsrs) + kvm_set_shared_msr(msr->index, msr->data, + msr->mask); break; } -- Nadav Har'El| Thursday, Apr 12 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If you tell the truth, you don't have to http://nadav.harel.org.il |remember anything. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nested virtualization on Intel does not work - second level freezes when third level is starting
On Wed, Apr 11, 2012, Guido Winkelmann wrote about "Nested virtualization on Intel does not work - second level freezes when third level is starting": > Nested virtualization on Intel does not work for me with qemu-kvm. As soon as > the third layer OS (second virtualised) is starting the Linux kernel, the > entire second layer freezes up. The last thing I can see console of the third Hi, >From your description, I understand that "ordinary" (2-level) nested virtualization working for you (host, guest and 2nd-level guest), and it's the third nesting level (guest's guest's guest) which is broken? This is the second report of this nature in a week (see the previous report in https://bugzilla.kernel.org/show_bug.cgi?id=43068 - the details there are different), so I guess I'll need to find the time to give this issue some attention. L3 did work for me when the nested VMX patches were included in KVM, so either something broke since, or (perhaps more likely) your slightly different setups have features that my setup didn't. But in any case, like I explain in the aforementioned URL, even if L3 would work, in the current implementation it would be extremenly slow - perhaps to the point of being unusable (I think you saw this with grub performance in L3). So I wonder if you'd really want to use it, even if it worked... Just curious, what were you thinking of doing with L3? Nadav. -- Nadav Har'El| Wednesday, Apr 11 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If I were two-faced, would I be wearing http://nadav.harel.org.il |this one? Abraham Lincoln -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: source for virt io backend driver
On Sun, Apr 08, 2012, Steven wrote about "Re: source for virt io backend driver": > > vhost-net. This you can find in drivers/vhost/net.c (and > > drivers/vhost/vhost.c for the generic vhost infrastructure for virtio > > drivers in the host kernel). > > Yes, I am looking for the backend code in the kernel. I found the file > for net backend drivers/vhost/net.c. However, the backend code for blk > is not there. > Could you point out this? Thanks. Indeed, vhost-block is not (yet) part of the mainline kernel. Maybe someone else more experienced in vhost-block can recommend where to get the latest version. Nadav. -- Nadav Har'El| Monday, Apr 9 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Diplomat: A man who always remembers a http://nadav.harel.org.il |woman's birthday but never her age. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: source for virt io backend driver
On Sat, Apr 07, 2012, Steven wrote about "source for virt io backend driver": > Hi, > I would like to know where I can find the source for backend driver of > virtio device. For example, I know that the front-end net driver is >... > Can anyone help to point out where the source of net virtio backend in > the host kernel? If you use "normal" virtio, the backend is in qemu, i.e., host user space, *not* in the host kernel. So you need to look for it in qemu, not kvm. If you want the backend to be in the kernel, then you probably mean vhost-net. This you can find in drivers/vhost/net.c (and drivers/vhost/vhost.c for the generic vhost infrastructure for virtio drivers in the host kernel). Of course, also with vhost-net, qemu is involved in setting up this device. But qemu doesn't need to get involved in the data path (data, interrupts, etc.) which is done completely in the kernel, and therefore much more efficiently. -- Nadav Har'El| Sunday, Apr 8 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Lottery: You need a dollar and a dream. http://nadav.harel.org.il |We'll take the dollar, you keep the dream. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VT-X Locking Bit Flip in Real Mode?
On Fri, Mar 30, 2012, Jake Thomas wrote about "VT-X Locking Bit Flip in Real Mode?": > The assumption here is that once the locking bit is in the > "locked" position, you can't unlock it until a power cycle unlocks it. This is indeed what the Intel spec say - unless I'm misunderstanding something. > But I can't help but wonder if you can flip the locking bit from > the "locked" position to the "unlocked" position if you're still in > real mode. What makes you think that being in real mode makes a difference? As far as I know, it doesn't. Any reason why you think it does? -- Nadav Har'El| Saturday, Mar 31 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Take my advice, I don't use it anyway. http://nadav.harel.org.il | -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling
The existing code emulated the guest's use of the IA32_FEATURE_CONTROL MSR in a way that was enough to run nested VMX guests, but did not fully conform to the VMX specification, and in particular did not allow a guest BIOS to prevent the guest OS from using VMX by setting the lock bit on this MSR. This patch emulates this MSR better, allowing the guest to lock it, and verifying its setting on VMXON. Also make sure that this MSR (and of course, VMXON state) is reset on guest vcpu reset (via SIPI). Signed-off-by: Nadav Har'El Reported-by: Julian Stecklina --- arch/x86/kvm/vmx.c | 43 +++ arch/x86/kvm/x86.c |3 ++- 2 files changed, 25 insertions(+), 21 deletions(-) --- .before/arch/x86/kvm/vmx.c 2012-03-19 18:34:24.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2012-03-19 18:34:24.0 +0200 @@ -352,6 +352,7 @@ struct nested_vmx { * we must keep them pinned while L2 runs. */ struct page *apic_access_page; + u64 msr_ia32_feature_control; }; struct vcpu_vmx { @@ -1998,9 +1999,6 @@ static int vmx_get_vmx_msr(struct kvm_vc } switch (msr_index) { - case MSR_IA32_FEATURE_CONTROL: - *pdata = 0; - break; case MSR_IA32_VMX_BASIC: /* * This MSR reports some information about VMX support. We @@ -2072,21 +2070,6 @@ static int vmx_get_vmx_msr(struct kvm_vc return 1; } -static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) -{ - if (!nested_vmx_allowed(vcpu)) - return 0; - - if (msr_index == MSR_IA32_FEATURE_CONTROL) - /* TODO: the right thing. */ - return 1; - /* -* No need to treat VMX capability MSRs specially: If we don't handle -* them, handle_wrmsr will #GP(0), which is correct (they are readonly) -*/ - return 0; -} - /* * Reads an msr value (of 'msr_index') into 'pdata'. * Returns 0 on success, non-0 otherwise. @@ -2129,6 +2112,9 @@ static int vmx_get_msr(struct kvm_vcpu * case MSR_IA32_SYSENTER_ESP: data = vmcs_readl(GUEST_SYSENTER_ESP); break; + case MSR_IA32_FEATURE_CONTROL: + data = to_vmx(vcpu)->nested.msr_ia32_feature_control; + break; case MSR_TSC_AUX: if (!to_vmx(vcpu)->rdtscp_enabled) return 1; @@ -2197,6 +2183,12 @@ static int vmx_set_msr(struct kvm_vcpu * } ret = kvm_set_msr_common(vcpu, msr_index, data); break; + case MSR_IA32_FEATURE_CONTROL: + if (to_vmx(vcpu)->nested.msr_ia32_feature_control + & FEATURE_CONTROL_LOCKED) + return 1; + to_vmx(vcpu)->nested.msr_ia32_feature_control = data; + break; case MSR_TSC_AUX: if (!vmx->rdtscp_enabled) return 1; @@ -2205,8 +2197,6 @@ static int vmx_set_msr(struct kvm_vcpu * return 1; /* Otherwise falls through */ default: - if (vmx_set_vmx_msr(vcpu, msr_index, data)) - break; msr = find_msr_entry(vmx, msr_index); if (msr) { msr->data = data; @@ -3807,6 +3797,8 @@ static int vmx_vcpu_setup(struct vcpu_vm return 0; } +static void free_nested(struct vcpu_vmx *vmx); + static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -3920,6 +3912,9 @@ static int vmx_vcpu_reset(struct kvm_vcp /* HACK: Don't enable emulation on guest boot/reset */ vmx->emulation_required = 0; + /* Reset nested-VMX settings: */ + vmx->nested.msr_ia32_feature_control = 0; + free_nested(vmx); out: return ret; } @@ -5031,6 +5026,14 @@ static int handle_vmon(struct kvm_vcpu * return 1; } +#define VMXON_NEEDED_FEATURES \ + (FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX) + if ((vmx->nested.msr_ia32_feature_control & VMXON_NEEDED_FEATURES) + != VMXON_NEEDED_FEATURES) { + kvm_inject_gp(vcpu, 0); + return 1; + } + INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool)); vmx->nested.vmcs02_num = 0; --- .before/arch/x86/kvm/x86.c 2012-03-19 18:34:24.0 +0200 +++ .after/arch/x86/kvm/x86.c 2012-03-19 18:34:24.0 +0200 @@ -799,7 +799,8 @@ static u32 msrs_to_save[] = { #ifdef CONFIG_X86_64 MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR, #endif - MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA + MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA, + MSR_IA32_FEATURE_CONTROL }; static unsigned num_m
Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling
Hi, in a minute I'll send a new version of the MSR_IA32_FEATURE_CONTROL patch for nested VMX; I just wanted to reply first to your comments so you'll know what to expect: On Wed, Mar 07, 2012, Avi Kivity wrote about "Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling": > On 03/07/2012 05:58 PM, Nadav Har'El wrote: > > + u64 msr_ia32_feature_control; > > }; > > Need to add to the list of exported MSRs so it can be live migrated > (msrs_to_save). Did this. > The variable itself should live in vcpu->arch, even if > some bits are vendor specific. But not this. I understand what you explained about vmx.c being for Intel *hosts*, not about emulating Intel *guests*, but I do think that since none of the bits in this MSR are relevant on AMD hosts (which don't do nested VMX), it isn't useful to support this MSR outside vmx.c. So I left this variable it in vmx->nested. As I noted earlier, svm.c did exactly the same thing (nested.vm_cr_msr), so at least there's symmetry here. > > @@ -1999,7 +2000,7 @@ static int vmx_get_vmx_msr(struct kvm_vc > > > > switch (msr_index) { > > case MSR_IA32_FEATURE_CONTROL: > > - *pdata = 0; > > + *pdata = to_vmx(vcpu)->nested.msr_ia32_feature_control; > > break; > > In a separate patch, please move this outside vmx_get_vmx_msr(). It's > not a vmx msr. Done, but not split into two patches: The patch removes the old case in vmx_get_vmx_msr() (and also removes vmx_set_vmx_msr() entirely) and instead adds the case in vmx_get_msr() and vmx_set_msr(). > > +#define VMXON_NEEDED_FEATURES \ > > + (FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX) > > Use const u64 instead of #define please, it jars my eyes. I would, if Linux coding style allowed to declare variables in the middle of blocks. Unfortunately it doesn't, so I left this #define. I don't think it's that bad. -- Nadav Har'El|Monday, Mar 19 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |A conscience does not prevent sin. It http://nadav.harel.org.il |only prevents you from enjoying it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: don't forget to schedule()
On Sun, Mar 04, 2012, Michael S. Tsirkin wrote about "Re: [PATCH] vhost: don't forget to schedule()": > On Sun, Mar 04, 2012 at 11:10:01AM +0200, Nadav Har'El wrote: > > On Tue, Feb 28, 2012, Avi Kivity wrote about "Re: [PATCH] vhost: don't > > forget to schedule()": > > > > > > + if (need_resched()) > > > > > > + schedule(); > > > > > This is cond_resched(), no? >... > > Hi. This discussion is already getting several orders of magnitude longer > > than the patch :-) >... > > The argument is mostly about what's faster, right? > You can push the argument along if you run some benchmarks to show the > performance impact of the proposed variants. Hi, I did some tests using Netperf TCP-STREAM, message size 1K, over a 10Gbps network. Each of the numbers below are an average of several runs. Test 1: "First, Do no harm" - verify that with just one guest, when these extra reschedule attempts can't help, they at least don't hurt. * Unmodified vhost-net: 3135 Mbps. * My patch, with if(need_sched())schedule(): 3088 Mbps. The 1.5% average performance decrease is very small, and close the to measurement variation (1.4%) so I'd say it's negligable. * Avi's proposed alternative, with cond_sched() had, suprisingly, a very high variation in the results. Two measurments yielded an average of 3038 Mbps (close, but slightly lower than above), but 6 other measurements yielded significantly lower numbers, between 2300 and 2875. So it seems my patch does not hurt performance, so we don't need to find a "better" version for performance's sake. The cond_resched() version was worse - had a slightly worse maximum performance, and a high variance, so I don't recommend it. That being said, I don't have a good explanation on why cond_resched() is bad. Test 2: Demonstrate that the patch is actually important. Here we run two or three guests doing the same Netperf to a different server, with all 2 or 3 vhost threads' affinity set to the same core. * Behaviour of the unmodified vhost-net was terrible, and demonstrates what this bug fixes: - For two guests, one yielded a throughput of 3172 Mbps, but the second yielded only 0.5 Mbps (!!), because its vhost thread hardly ran at all. For three guests, again, one guest yielded 3202 Mbps, but the second and third each yielded less than 1Mbps. - "top" and similar utilities did not show the vhost threads at all, because CPU time is not accounted if you never schedule(). * Performance with our patch (if(need_sched())schedule()), fixes both problems: - For two guests, one yields 1557 Mbps, the second 1531 Mbps. Fair, and efficient (1557+1531=3088), as expected. For three guests, the number sare 1032+1028+1045=3105. - "top" shows each vhost thread to take close to the expect half or third of the core running both. * With the alternative cond_resched(), again the problems are fixed, but performance is less consistent, varying by almost 20% from run to run. Again, I can't explaine why. All it all, I think this proves that the patch I sent, as is, is important, and does no harm, and definitely not worse (and perhaps better) than the proposed alternative implementation with cond_resched. So please apply. Thanks, Nadav. -- Nadav Har'El|Sunday, Mar 18 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If glory comes after death, I'm not in a http://nadav.harel.org.il |hurry. (Latin proverb) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling
On Wed, Mar 07, 2012, Avi Kivity wrote about "Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling": > > struct page *apic_access_page; > > + u64 msr_ia32_feature_control; > > }; >... > (msrs_to_save). The variable itself should live in vcpu->arch, even if > some bits are vendor specific. Does this MSR exist in AMD? I was under the impression that it is an Intel-only MSR, and that AMD has something different, the VM_CR MSR, so it didn't make sense to put this in vcpu->arch. Is my impression wrong? I seems, by the way, that svm.c has vm_cr_msr in svm->nested, basically the same what I did, not in vcpu->arch. Why is this bad? Also, it seems that VM_CR is also not on the list on msrs_to_save. A bug? > > @@ -1999,7 +2000,7 @@ static int vmx_get_vmx_msr(struct kvm_vc > > > > switch (msr_index) { > > case MSR_IA32_FEATURE_CONTROL: > > - *pdata = 0; > > + *pdata = to_vmx(vcpu)->nested.msr_ia32_feature_control; > > break; > > In a separate patch, please move this outside vmx_get_vmx_msr(). It's > not a vmx msr. I agree, I'll move it. But if it's a VMX-only MSR, I want to leave it in vmx.c, and not move it to x86.c. -- Nadav Har'El| Thursday, Mar 15 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If I am not for myself, who will be for http://nadav.harel.org.il |me? If I am only for myself, who am I? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling
The existing code emulates the guest's use of the IA32_FEATURE_CONTROL MSR in a way that was enough to run nested VMX guests, but did not fully conform to the VMX specification, and in particular did not allow a guest BIOS to prevent the guest OS from using VMX by setting the lock bit on this MSR. This patch emulates this MSR better, allowing the guest to lock it, and verifying its setting on VMXON. Also make sure that this MSR (and of course, VMXON state) is reset on guest vcpu reset (via SIPI). Signed-off-by: Nadav Har'El Reported-by: Julian Stecklina --- arch/x86/kvm/vmx.c | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) --- .before/arch/x86/kvm/vmx.c 2012-03-07 17:52:02.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2012-03-07 17:52:02.0 +0200 @@ -352,6 +352,7 @@ struct nested_vmx { * we must keep them pinned while L2 runs. */ struct page *apic_access_page; + u64 msr_ia32_feature_control; }; struct vcpu_vmx { @@ -1999,7 +2000,7 @@ static int vmx_get_vmx_msr(struct kvm_vc switch (msr_index) { case MSR_IA32_FEATURE_CONTROL: - *pdata = 0; + *pdata = to_vmx(vcpu)->nested.msr_ia32_feature_control; break; case MSR_IA32_VMX_BASIC: /* @@ -2077,9 +2078,13 @@ static int vmx_set_vmx_msr(struct kvm_vc if (!nested_vmx_allowed(vcpu)) return 0; - if (msr_index == MSR_IA32_FEATURE_CONTROL) - /* TODO: the right thing. */ + if (msr_index == MSR_IA32_FEATURE_CONTROL) { + if (to_vmx(vcpu)->nested.msr_ia32_feature_control + & FEATURE_CONTROL_LOCKED) + return 0; + to_vmx(vcpu)->nested.msr_ia32_feature_control = data; return 1; + } /* * No need to treat VMX capability MSRs specially: If we don't handle * them, handle_wrmsr will #GP(0), which is correct (they are readonly) @@ -3807,6 +3812,8 @@ static int vmx_vcpu_setup(struct vcpu_vm return 0; } +static void free_nested(struct vcpu_vmx *vmx); + static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -3920,6 +3927,9 @@ static int vmx_vcpu_reset(struct kvm_vcp /* HACK: Don't enable emulation on guest boot/reset */ vmx->emulation_required = 0; + /* Reset nested-VMX settings: */ + vmx->nested.msr_ia32_feature_control = 0; + free_nested(vmx); out: return ret; } @@ -5031,6 +5041,14 @@ static int handle_vmon(struct kvm_vcpu * return 1; } +#define VMXON_NEEDED_FEATURES \ + (FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX) + if ((vmx->nested.msr_ia32_feature_control & VMXON_NEEDED_FEATURES) + != VMXON_NEEDED_FEATURES) { + kvm_inject_gp(vcpu, 0); + return 1; + } + INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool)); vmx->nested.vmcs02_num = 0; -- Nadav Har'El| Wednesday, Mar 7 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |In Fortran, God is real unless declared http://nadav.harel.org.il |an integer. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.
On Wed, Mar 07, 2012, Avi Kivity wrote about "Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.": > On 03/06/2012 07:33 PM, Nadav Har'El wrote: > > By the way, am I right in my understanding that KVM doesn't support > > SMX in the guest? > > Isn't nested vmx crazy enough? :-) > btw, any updates on nested EPT? Nested vmx is pointless without it. Isn't "pointless" a bit strong? :-) I have done most of the changes that you asked for, but still need to finish the EPT violation overhaul. I was meaning to work on this now, when I got sidetracked by all these other small issues. Hopefully, I'll have a version ready for submittion by next week. -- Nadav Har'El| Wednesday, Mar 7 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |"[I'm] so full of action, my name should http://nadav.harel.org.il |be a verb" -- Big Daddy Kane ("Raw", 1987) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.
On Tue, Mar 06, 2012, Nadav Har'El wrote about "Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.": > 2. handle_vmon() does not check the previous setting of this MSR. > If the guest (or its BIOS) doesn't set both FEATURE_CONTROL_LOCKED > and FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX, it should get a By the way, am I right in my understanding that KVM doesn't support SMX in the guest? I didn't see mention of CR4.SMXE or a GETSEC exit handler, which is why I'm assuming that it doesn't... If this assumption isn't true, I'll also need to worry about the ..._INSIDE_SMX case. -- Nadav Har'El|Tuesday, Mar 6 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |When you handle yourself, use your head; http://nadav.harel.org.il |when you handle others, use your heart. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.
On Tue, Mar 06, 2012, Avi Kivity wrote about "Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.": > > case MSR_IA32_FEATURE_CONTROL: > > - *pdata = 0; > > +/* > > + * If nested VMX is enabled, set the lock bit (bit 0) > > + * and the "Enable VMX outside SMX" bit (bit 2) in the > > + * FEATURE_CONTROL MSR. > > + */ > > + *pdata = nested_vmx_allowed(vcpu) ? 0x5 : 0; 0x5 can be written as FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX > > break; > > case MSR_IA32_VMX_BASIC: > > /* > > The way I read it, it should be done by the guest, not the host. This is also how I understand it. Check out KVM's own hardware_enable() to see how a guest does turn on these bits before using VMXON - it doesn't need to rely on the BIOS to have done it earlier (the BIOS, can, however, prevent the guest from doing this by setting only the lock bit). What is true, however, is that the existing code is probably incomplete in three ways (see section 20.7 of the SDM): 1. It always returns 0 for this MSR, even if the guest previously set it to something else. 2. handle_vmon() does not check the previous setting of this MSR. If the guest (or its BIOS) doesn't set both FEATURE_CONTROL_LOCKED and FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX, it should get a #GP on an attempt to VMXON. This will allow L1's BIOS to disable nested VMX if it wishes (not that I think this is a very useful usecase...). 3. vmx_set_vmx_msr to MSR_IA32_FEATURE_CONTROL should throw a #GP if FEATURE_CONTROL_LOCKED is on. I'll create a patch to do this shortly. -- Nadav Har'El|Tuesday, Mar 6 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Thousands of years ago, cats were http://nadav.harel.org.il |worshipped as gods. They never forgot. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] nVMX: Fix erroneous exception bitmap check
The code which checks whether to inject a pagefault to L1 or L2 (in nested VMX) was wrong, incorrect in how it checked the PF_VECTOR bit. Thanks to Dan Carpenter for spotting this. Signed-off-by: Nadav Har'El Reported-by: Dan Carpenter --- arch/x86/kvm/vmx.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- .before/arch/x86/kvm/vmx.c 2012-03-06 16:31:25.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2012-03-06 16:31:25.0 +0200 @@ -1664,7 +1664,7 @@ static int nested_pf_handled(struct kvm_ struct vmcs12 *vmcs12 = get_vmcs12(vcpu); /* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */ - if (!(vmcs12->exception_bitmap & PF_VECTOR)) + if (!(vmcs12->exception_bitmap & (1u << PF_VECTOR))) return 0; nested_vmx_vmexit(vcpu); -- Nadav Har'El|Tuesday, Mar 6 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |This box was intentionally left blank. http://nadav.harel.org.il | -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: VMX: Fix delayed load of shared MSRs
On Tue, Mar 06, 2012, Avi Kivity wrote about "[PATCH] KVM: VMX: Fix delayed load of shared MSRs": > Shared MSRs (MSR_*STAR and related) are stored in both vmx->guest_msrs > and in the CPU registers, but vmx_set_msr() only updated memory. Prior > to 46199f33c2953, this didn't matter, since we called vmx_load_host_state(), > which scheduled a vmx_save_host_state(), which re-synchronized the CPU > state, but now we don't, so the CPU state will not be synchronized until > the next exit to host userspace. This mostly affects nested vmx workloads, > which play with these MSRs a lot. Thanks. I checked this patch, and indeed it solves the bug with nested VMX. -- Nadav Har'El|Tuesday, Mar 6 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I think therefore I am. My computer http://nadav.harel.org.il |thinks for me, therefore I am not. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Heads-up: Nested VMX got broken by commit
Hi, I just noticed that Nested VMX got broken (at least in my tests) by commit 46199f33c29533e7ad2a7d2128dc30175d1d4157. The specific change causing the problem was: @@ -2220,7 +2216,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) break; msr = find_msr_entry(vmx, msr_index); if (msr) { - vmx_load_host_state(vmx); msr->data = data; break; } And if anyone wants a quick workaround to making nested VMX work again, returning this line fixes the problem. I'm still trying to figure out why this line, which indeed seems unrelated and unnecessary, is necessary for the correct functioning of nested VMX. My (unsubstantiated) guess is that it isn't that it is actually necessary in this point - it's just that it does something that should have been more properly done in another place, but I've yet to figure out exactly what. I'll send a patch when I have this figured out. If anybody else has any guess, I'd love to hear. Nadav. -- Nadav Har'El|Tuesday, Mar 6 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Sign in pool: "Welcome to our OOL. Notice http://nadav.harel.org.il |there is no P, please keep it that way." -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: don't forget to schedule()
On Tue, Feb 28, 2012, Avi Kivity wrote about "Re: [PATCH] vhost: don't forget to schedule()": > > > > + if (need_resched()) > > > > + schedule(); > > > This is cond_resched(), no? > > > > It indeed looks similar, but it appears there are some slightly > > different things happening in both cases, especially for a preemptive >... > I'd have expected that cond_resched() is a no-op with preemptible > kernels, but I see this is not the case. Hi. This discussion is already getting several orders of magnitude longer than the patch :-) Would you like me to send a new one-line patch calling "cond_resched()" instead of need_resched/schedule? Or anything else that I can do? > > But I now see that in kvm_main.c, there's also this: > > > > if (!need_resched()) > > return; > > cond_resched(); >... > > It's bogus. Look at commit 3fca03653010: >... > + if (!need_resched()) > + return; > vcpu_put(vcpu); > cond_resched(); > vcpu_load(vcpu); > > at that time, it made sense to do the extra check to avoid the expensive > vcpu_put/vcpu_load. Later preempt notifiers made them redundant > (15ad71460d75), and they were removed, but the extra check remained. Do you want a patch to remove this extra check? Or you can just remove it yourself? Thanks, Nadav. -- Nadav Har'El| Sunday, Mar 4 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |As every cat owner knows, nobody owns a http://nadav.harel.org.il |cat. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: don't forget to schedule()
On Tue, Feb 28, 2012, Avi Kivity wrote about "Re: [PATCH] vhost: don't forget to schedule()": > > + if (need_resched()) > > + schedule(); > > This is cond_resched(), no? It indeed looks similar, but it appears there are some slightly different things happening in both cases, especially for a preemptive kernel... Unfortunately, I am not astute (or experienced) enough to tell which of the two idioms are better or more appropriate for this case. The idiom that I used seemed right, and seemed to work in my tests. Moreover I also noticed it was used in vmx.c. Also, vhost.c was already calling schedule(), not cond_resched(), so I thought it made sense to call the same thing... But I now see that in kvm_main.c, there's also this: if (!need_resched()) return; cond_resched(); Which seems to combine both idioms ;-) Can anybody shed a light on what is the right way to do it? -- Nadav Har'El| Tuesday, Feb 28 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I am logged in, therefore I am. http://nadav.harel.org.il | -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: don't forget to schedule()
This is a tiny, but important, patch to vhost. Vhost's worker thread only called schedule() when it had no work to do, and it wanted to go to sleep. But if there's always work to do, e.g., the guest is running a network-intensive program like netperf with small message sizes, schedule() was *never* called. This had several negative implications (on non-preemptive kernels): 1. Passing time was not properly accounted to the "vhost" process (ps and top would wrongly show it using zero CPU time). 2. Sometimes error messages about RCU timeouts would be printed, if the core running the vhost thread didn't schedule() for a very long time. 3. Worst of all, a vhost thread would "hog" the core. If several vhost threads need to share the same core, typically one would get most of the CPU time (and its associated guest most of the performance), while the others hardly get any work done. The trivial solution is to add if (need_resched()) schedule(); After doing every piece of work. This will not do the heavy schedule() all the time, just when the timer interrupt decided a reschedule is warranted (so need_resched returns true). Thanks to Abel Gordon for this patch. Signed-off-by: Nadav Har'El --- vhost.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c14c42b..ae66278 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -222,6 +222,8 @@ static int vhost_worker(void *data) if (work) { __set_current_state(TASK_RUNNING); work->fn(work); + if (need_resched()) + schedule(); } else schedule(); -- Nadav Har'El|Monday, Feb 27 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Ways to Relieve Stress #10: Make up a http://nadav.harel.org.il |language and ask people for directions. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX
On Sun, Nov 13, 2011, Avi Kivity wrote about "Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX": > > I also believed that the fault injection part was also correct: I > > thought that the code already knows when to handle the fault in L2 (when > > the address is missing in cr3), in L1 (when the translation is missing > > in EPT12) or else, in L0. > > It does, but it needs to propagate the fault code correctly. The exit > reason (ept violation vs ept misconfiguration) is meaningless, since we > don't encode anything about it from ept12 into ept02. In particular an > ept violation could lead to > > - no fault, ept02 updated, instruction retried > - no fault, instruction emulated > - L2 fault > - ept violation, need to compute ept12 permissions for exit qualification > - ept misconfiguration > > (the second and third cases occur when it is impossible to create an > ept02 mapping - when L0 emulates a gpa that L1 assigns to L2 via ept12). I'm now trying to figure out this part, and I think I am beginning to understand the mess you are referring to: In nested_ept_inject_page_fault I now assume the exit reason is always EPT VIOLATION and have vmcs12->exit_qualification = fault->error_code; But fault->error_code is not in the exit qualification format but in the PFERR_* format, which has different meanings for the bits... Moreover, PFERR_RSVD_MASK should cause an EPT MISCONFIG, not EPT VIOLATION. Is this what you meant above? I didn't quite understand what you meant in the 4th case about needing to compute ept12 permissions. I'm assuming that if the EPT violation was caused because L0 decreased permissions from what L1 thought, then L0 will solve the problem itself and not inject it to L1. So if we are injecting the fault to L1, don't we already know the correct fault reason and don't need to compute it? There's another complication: when the fault comes from an EPT violation in L2, handle_ept_violation() calls mmu.page_fault() with an error_code of exit_qualification & 0x3. This means that the error_code in this case is *not* in the expected PFERR_* format, and we need to know that in nested_ept_inject_page_fault. Moreover, in the original EPT visolation's exit qualification, there were various other bits which we lose (and don't have a direct parallel in PFERR_* anyway), so when we reinject the fault, L1 doesn't get them. What a mess :( -- Nadav Har'El|Monday, Dec 12 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Hardware, n.: The parts of a computer http://nadav.harel.org.il |system that can be kicked. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/10] nEPT: Nested INVEPT
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 08/10] nEPT: Nested INVEPT": > On 11/10/2011 12:01 PM, Nadav Har'El wrote: > > If we let L1 use EPT, we should probably also support the INVEPT > > instruction. >.. > > + if (vmcs12 && nested_cpu_has_ept(vmcs12) && > > + (vmcs12->ept_pointer == operand.eptp) && > > + vmx->nested.last_eptp02) > > + ept_sync_context(vmx->nested.last_eptp02); > > + else > > + ept_sync_global(); > > Are either of these needed? Won't a write to a shadowed EPT table cause > them anyway? This is very good point... You're right that as it stands, any changes to the guest EPT table (EPT12) will cause changes to the shadow EPT table (EPT02), and these already cause KVM to do an INVEPT, so no point to do this again when the guest asks. So basically, I can have INVEPT emulated by doing absolutely nothing (after checking all the checks), right? I wonder if I am missing any reason why a hypervisor might want to do INVEPT without changing the EPT12 table first. -- Nadav Har'El|Sunday, Dec 11 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Why do programmers mix up Christmas and http://nadav.harel.org.il |Halloween? Because DEC 25 = OCT 31 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Mon, Nov 14, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": > > >> +#if PTTYPE == EPT > > >> +real_gfn = mmu->translate_gpa(vcpu, > > >> gfn_to_gpa(table_gfn), > > >> + EPT_WRITABLE_MASK); > > >> +#else > > >> real_gfn = mmu->translate_gpa(vcpu, > > >> gfn_to_gpa(table_gfn), > > >> > > >> PFERR_USER_MASK|PFERR_WRITE_MASK); > > >> +#endif > > >> + > > > > > > Unneeded, I think. > > > > Is it because translate_nested_gpa always set USER_MASK ? > > Yes... maybe that function needs to do something like > >access |= mmu->default_access; Unless I'm misunderstanding something, translate_nested_gpa, and gva_to_gpa, take as their "access" parameter a bitmask of PFERR_*, so it's fine for PFERR_USER_MASK to be enabled in translate_nested_gpa; It just shouldn't cause PT_USER_MASK to be used. The only additional problem I can find is in walk_addr_generic which does if (!check_write_user_access(vcpu, write_fault, user_fault, pte)) eperm = true; and that checks pte & PT_USER_MASK, which it shouldn't if PTTYPE==PTTYPE_EPT. It's really confusing that we now have in mmu.c no less than 4 (!) access bit schemes, similar in many ways but different in many others: 1. page fault error codes (PFERR_*_MASK) 2. x86 page tables acess bits (PT_*_MASK) 3. KVM private access bits (ACC_*_MASK) 4. EPT access bits (VMX_EPT_*_MASK). I just have to try hard not to confuse them. -- Nadav Har'El| Thursday, Dec 8 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Sorry, but my karma just ran over your http://nadav.harel.org.il |dogma. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Sun, Nov 13, 2011, Orit Wasserman wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": > +++ b/arch/x86/kvm/mmu.h > @@ -48,6 +48,11 @@ > #define PFERR_RSVD_MASK (1U << 3) > #define PFERR_FETCH_MASK (1U << 4) > > +#define EPT_WRITABLE_MASK 2 > +#define EPT_EXEC_MASK 4 This is another example of the "unclean" movement of VMX-specific things into x86 :( We already have VMX_EPT_WRITABLE_MASK and friends in vmx.h. I'll need to think what is less ugly: to move them to mmu.h, or to include vmx.h in mmu.c, or perhaps even create a new include file, ept.h. Avi, do you have a preference? The last thing I want to do is to repeat the same definitions in two places. -- Nadav Har'El| Wednesday, Dec 7 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |"A witty saying proves nothing." -- http://nadav.harel.org.il |Voltaire -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Sun, Nov 13, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": > On 11/13/2011 01:30 PM, Orit Wasserman wrote: > > Maybe this patch can help, this is roughly what Avi wants (I hope) done > > very quickly. > > I'm sorry I don't have setup to run nested VMX at the moment so i can't > > test it. >... > > +#define PTTYPE EPT > > +#include "paging_tmpl.h" > > +#undef PTTYPE > > Yes, that's the key. I'm now preparing a patch based on such ideas. One downside of this approach is that mmu.c (and therefore the x86 module) will now include EPT-specific functions that are of no use or relevance to the SVM code. It's not a terrible disaster, but it's "unclean". I'll try to think if there's a cleaner way. Nadav. -- Nadav Har'El|Tuesday, Dec 6 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Writing software is like sex: One mistake http://nadav.harel.org.il |and you have to support it forever. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Wed, Nov 23, 2011, Nadav Har'El wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": > > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) > > +{ > > + int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu); > > + > > + vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested; > > + > > + return r; > > +} >.. > I didn't see you actually call this function anywhere - how is it > supposed to work? >.. > It seems we need a fifth case in that function. >.. On second thought, why is this modifying nested_mmu.gva_to_gpa, and not mmu.gva_to_gpa? Isn't the nested_mmu the L2 CR3, which is *not* in EPT format, and what we really want to change is the outer mmu, which is EPT12 and is indeed in EPT format? Or am I missing something? Thanks, Nadav. -- Nadav Har'El| Wednesday, Nov 23 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |My password is my dog's name. His name is http://nadav.harel.org.il |a#j!4@h, but I change it every month. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Sun, Nov 13, 2011, Orit Wasserman wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": > Maybe this patch can help, this is roughly what Avi wants (I hope) done very > quickly. > I'm sorry I don't have setup to run nested VMX at the moment so i can't test > it. Hi Orit, thanks for the code - I'm now working on incorporating something based on this into my patch. However, I do have a question: > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) > +{ > + int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu); > + > + vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested; > + > + return r; > +} I didn't see you actually call this function anywhere - how is it supposed to work? The way I understand the current code, kvm_mmu_reset_context() calls init_kvm_mmu() which (in our case) calls init_kvm_nested_mmu(). I think the above gva_to_gpa setting should be there - right? It seems we need a fifth case in that function. But at that point in mmu.c, how will I be able to check if this is the nested EPT case? Do you have any suggestion? Thanks, Nadav. -- Nadav Har'El| Wednesday, Nov 23 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |This message contains 100% recycled http://nadav.harel.org.il |characters. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.
On Wed, Nov 23, 2011, Avi Kivity wrote about "Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.": > On 11/22/2011 06:09 PM, Julian Stecklina wrote: > > This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the > > ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write > > consistent with ioapic_mmio_read, which also allows byte and word accesses. > > > > Your patch indents with spaces, while Linux uses tabs for indents. You might want to run scripts/checkpatch.pl on your patches before sending them - this will catch this and many other stylistic faux pas. -- Nadav Har'El| Wednesday, Nov 23 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Why do doctors call what they do http://nadav.harel.org.il |practice? Think about it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nested virtualization on Intel and needed cpu flags to pass
On Tue, Nov 22, 2011, Gianluca Cecchi wrote about "nested virtualization on Intel and needed cpu flags to pass": > I'm going to test nested virtualization on Intel with Fedora 16 host. >... > [root at f16 ~]# uname -r > 3.1.1-2.fc16.x86_64 This Linux version indeed has nested VMX, while > # uname -r > 2.6.40.6-0.fc15.x86_64 doesn't. > Based on docs, the "nested" option is disabled by default on Intel Indeed. You need to load the kvm_intel module with the "nested=1" option. You also need to tell qemu that the emulated CPU will advertise VMX - the simplest way to do this is with "-cpu host" option to qemu. It seems you did all of this correctly. > Preliminary results are not so good. > I created an F16 guest (f16vm), with default options >.. > Inside the guest I create a windows xp with default values proposed by >.. > Until now not able to complete installation Unfortunately, this is a known bug - which I promised to work on, but haven't yet got around to :( nested-vmx.txt explictly lists under "known limitations" that: "The current code supports running Linux guests under KVM guests." > more tests to come with newer kernel 3.1.1-2.fc16.x86_64 > But before proceeding, probably I need to adjust particular features > to enable/disable about > cpu to pass to f16vm guest... I don't think this is the problem. The underlying problem is that the VMX spec is very complex, and ideally nested VMX should correctly emulate each and every bit and each and every corner of it. Because all our testing was done with KVM L1s and Linux L2s, all the paths that get exercised in that case were tested and their bugs ironed out - but it is possible that Windows L2s end up excercising slightly different code paths that still have bugs, and those need to be fixed. Unfortunately, I doubt a newer kernel will solve your problem. I think there are genuine bugs that will have to be fixed. > What are current advises about cpu flags to pass to optimally use > nested virtualization with intel at this time? I don't think there is any such guidelines. The only thing you really need is "-cpu qemu64,+vmx" (replace qemu64 by whatever you want) to advertise the exisance of VMX. -- Nadav Har'El| Wednesday, Nov 23 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |When you lose, don't lose the lesson. http://nadav.harel.org.il | -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM VMExit
On Mon, Nov 14, 2011, Xin Tong wrote about "Re: KVM VMExit": > x86 has a Undefined Instruction, its opcode is 0F 0B and it generates > an invalid opcode exception. This instruction is provided for software > testing to explicitly generate an invalid opcode exception. The opcode > for this instruction is reserved for this purpose. Seeing your recent questions on this list aren't really about KVM development, but rather about VMX (apparently) basics, I suggest you also grab yourself a copy of the Intel Software Manual - see volume 3 of: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html which answers all of your recent questions. About this question - as far as I know there is no way to specifically cause exit only on the UD2 instruction. What you can do, however, is to cause exit on any #UD exception, including the one generated by UD2. The VMCS has an "exception bitmap" which defines which exceptions cause an exit, and you can turn on the #UD bit to ask for this exit. -- Nadav Har'El|Monday, Nov 14 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Willpower: The ability to eat only one http://nadav.harel.org.il |salted peanut. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Invalidate TLB
Hi, On Sun, Nov 13, 2011, Xin Tong wrote about "Invalidate TLB": > I have 2 questions regard TLB and PageTable in KVM. > > 1. I do not really get how the TLB gets saved and restored on when the > guest os vmexits and vmresumes. Or maybe the TLB is not saved at all ( > TLB flushes when vmexit and vmenter happen). The TLB isn't saved and restored... Because the TLB translations of the host aren't relevant in the guest (and vice versa), in the first generation of VMX (Intel's hardware virtualization), it was necessary to flush the TLB on each VMX transion (vmexit and vmenter). In the second generation, VMX added TLB tags known as "VPID" (virtual processor id). The hypervisor allocates a unique id for each guest, and itself uses id 0, and translations cached in the TLB are tagged with the VPID value. Therefore, when VPID is being used, you no longer need to flush the TLB on every entry and exit. > 2. Say i have multiple vCPUs and each of them running some guest OS > processes. suddenly one of the vCPU vmexits due to a trapping > instruction. This trapping instruction modifies the page table of a > currently running process ( running on one of the other vCPUs - vCPU X > ). how does the vCPU X gets notified ( i.e. TLB invalidated). some > kind of apic needs to be sent, right ? I didn't quite follow your example, but there is indeed a remote tlb flush IPI. -- Nadav Har'El|Sunday, Nov 13 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I would give my right arm to be http://nadav.harel.org.il |ambidextrous. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX
Hi, On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX": > This patchset is missing a fairly hairy patch that makes reading L2 > virtual addresses work. This was supposed to be part of the nested TDP code that is already in the code. To read an L2 virtual address, the code is supposed, if I understand correctly, to walk the "walk" mmu (EPT01 and guest_cr3) and then use the EPT table - just like the normal EPT case which uses the EPT table and the guest_cr3. I even believed that this inner "walk mmu" will work fine without any rewrite needed for ia32/ept differences, because it works (or so I believed) just like normal EPT, with the first table being an EPT table, and the second table being a normal page table. I also believed that the fault injection part was also correct: I thought that the code already knows when to handle the fault in L2 (when the address is missing in cr3), in L1 (when the translation is missing in EPT12) or else, in L0. So what is the "hairy" missing part? > The standard example is L1 passing a bit of > hardware (emulated in L0) to a L2; when L2 accesses it, the instruction > will fault and need to be handled in L0, transparently to L1. The > emulation can cause a fault to be injected to L2, or and EPT violation > or misconfiguration injected to L1. I don't understand the example. You are refering to nested device assignment from L1 to L2 (so L1 stops caring about the device)? Since we don't emulate an IOMMU for L1, how can that be done? Thanks, Nadav. -- Nadav Har'El|Sunday, Nov 13 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |An error? Impossible! My modem is error http://nadav.harel.org.il |correcting. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Sat, Nov 12, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": > host may write-protect a page. Second, the shadow and guest ptes may be > in different formats (ept vs ia32). I'm afraid I've lost you here... The shadow table and the to-be-shadowed table are both ia32 (this is the normal shadow table code), or both ept (the nested tdp code). When are they supposed to be in different formats (ept vs ia32)? I'm also puzzled in what situation will the host will write-protect an EPT02 (shadow EPT) page? > In fact that happens to accidentally work, no? Intermediate ptes are > always present/write/user, which translates to read/write/execute in EPT. It didn't work because it also used to set the "accessed" bit, bit 5, which on EPT is reserved and caused EPT misconfiguration. So I had to fix link_shadow_page, or nested EPT would not work at all. > Don't optimize for least changes, optimize for best result afterwards. As I'm sure you remember, two years ago, in September 6 2009, you wrote in your blog about the newly contributed nested VMX patch set, and in particular its nested EPT (which predated the nested NPT contribution). Nested EPT was, for some workloads, a huge performance improvement, but you (if I understand correctly) did not want that code in KVM because it, basically, optimized for getting the job done, in the most correct and most efficient manner - but without regard of how cleanly this fit with other types of shadowing (normal shadow page tables, and nested NPT), or how much of the code was being duplicated or circumvented. So this time around, I couldn't really "not optimize for least changes". This time, the nested EPT had to fit (like a square peg in a round hole ;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write the most correct and most efficient code (which Orit Wasserman already did, two years earlier). This time I needed to figure out the least obtrusive way of changing the existing code. The hardest thing about doing this was trying to understand all the complexities and subtleties of the existing MMU code in KVM, which already does 101 different cases in one overloaded piece of code, which is not commented or documented. And of course, add to that all the complexities (some might even say "cruft") which the underlying x86 architecture itself has acrued over the years. So it's not surprising I've missed some of the important subtleties which didn't have any effect in the typical case I've tried. Like I said, in my tests nested EPT *did* work. And even getting to that point was hard enough :-) > We need a third variant of walk_addr_generic that parses EPT format > PTEs. Whether that's best done by writing paging_ept.h or modifying > paging_tmpl.h, I don't know. Thanks. I'll think about everything you've said in this thread (I'm still not convinced I understood all your points, so just understanding them will be the first step). I'll see what I can do to improve the patch. But I have to be honest - I'm not sure how quickly I can finish this. I really appreciate all your comments about nested VMX in the last two years - most of them have been spot-on, 100% correct, and really helpful for making me understand things which I had previously misunderstood. However, since you are (of course) extremely familiar with every nook and cranny of KVM, what normally happens is that every comment which took you 5 minutes to figure out, takes me 5 days to fully understand, and to actually write, debug and test the fixed code. Every review that takes you two days to go through (and is very much appreciated!) takes me several months to fix each and every thing you asked for. Don't get me wrong, I *am* planning to continue working (part-time) on nested VMX, and nested EPT in particular. But if you want it to pick up the pace, I could use some help with actual coding from people who have much more intimate knowledge of the non-nested-VMX parts of KVM than I have. In the meantime, if anybody wants to experiment with a much faster Nested VMX than we had before, you can try my current patch. It may not be perfect, but in many ways it is better than the old shadow-on-ept code. And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it works well. Nadav. -- Nadav Har'El| Saturday, Nov 12 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |What's tiny, yellow and very dangerous? A http://nadav.harel.org.il |canary with the super-user password. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": > This is all correct, but the code in question parses the EPT12 table > using the ia32 page table format. They're sufficiently similar so that > it works, but it isn't correct. > > Bit 0: EPT readable, ia32 present > Bit 1: Writable; ia32 meaning dependent on cr0.wp > Bit 2: EPT executable, ia32 user (so, this implementation will interpret > a non-executable EPT mapping, if someone could find a use for it, as a > L2 kernel only mapping) > This is a very good point. I was under the mistaken (?) impression that the page-table shadowing code will just copy these bits as-is from the shadowed table (EPT12) to the shadow table (EPT02), without caring what they actually mean. I knew we had a problem when building, not copying, PTEs, and hence the patch to link_shadow_page). Also I realized we sometimes need to actually walk the TDP EPT12+cr3 (e.g., to see if an EPT violation is L1's fault), but I thought this was just the normal TDP walk, which already knows how to correctly read the EPT table. > walk_addr() will also write to bits 6/7, which the L1 won't expect. I didn't notice this :( Back to the drawing board, I guess. I need to figure out exactly what needs to be fixed, and how to do this with the least obtrusive changes to the existing use case (normal shadow page tables, and nested EPT). -- Nadav Har'El| Thursday, Nov 10 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Learn from mistakes of others; you won't http://nadav.harel.org.il |live long enough to make them all yourself -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/10] nEPT: Module option
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module option": > > By "this", do you mean without the "nested_ept" option, or without the > > hypothetical "EPT on shadow page tables" feature? > > Er, both. The feature should be controlled on a per-guest basis, not > per host. >.. > It's just redundant, since we do need a per-guest control. I agreed that per-guest control would have been nicer, but since we don't have an API for specifying that per guest since EPT is not, unfortunately, a CPUID feature, I thought that at least a host-level flag would be useful. Why would it be useful? I agree it isn't the most important option since sliced bread, but if, for example, one day we discover a bug with nested EPT, L0 can disable it for all L1 guests and basically force them to use shadow page tables on EPT. It was also useful for me to have this option for benchmarking, because I can force back the old shadow-on-EPT method with just a single option in L0 (instead of needing to give "ept=0" option in L1s). If you really don't like the existance of this option, I can easily remove it of course. -- Nadav Har'El| Thursday, Nov 10 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Guarantee: this email is 100% free of http://nadav.harel.org.il |magnetic monopoles, or your money back! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU context for nested EPT": > > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) > > +{ > > + int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu); >... > > + vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; >... > > kvm_init_shadow_mmu() will cause ->page_fault to be set to something > like paging64_page_fault(), which is geared to reading EPT ptes. How > does this work? Hi, I'm afraid I didn't understand the problem. Nested EPT's merging of two EPT tables (EPT01 and EPT12) works just like normal shadow page tables' merging of two CR3s (host cr3 and guest cr3): When L0 receives a "page fault" from L2 (actually an EPT violation - real guest #PF don't cause exits), L0 first looks it up in the shadowed table, which is basically EPT12. If the address is there, L0 handles the fault itself (updating the shadow EPT table, EPT02 using the normal shadow pte building code). But if the address wasn't in the shadowed page table (EPT12), mmu->inject_page_fault() is called, which in our case actually causes L1 to get an EPT-violation (not #PF - see kvm_propagate_fault()). Please note that all this logic is shared with the existing nested NPT code (which itself shared most of the code with the preexisting shadow page tables code). All this code sharing makes it really difficult to understand at first glance why the code is really working, but once you understood why one of these cases works, the others work similarly. And it does in fact work - in typical cases which I tried, at least. If you still think I'm missing something, I won't be entirely surprised ( :-) ), so let me know. Nadav. -- Nadav Har'El| Thursday, Nov 10 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I put a dollar in one of those change http://nadav.harel.org.il |machines. Nothing changed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/10] nEPT: Module option
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module option": > On 11/10/2011 11:58 AM, Nadav Har'El wrote: > > Add a module option "nested_ept" determining whether to enable Nested EPT. >... > > In the future, we can support emulation of EPT for L1 *always*, even when L0 > > itself doesn't have EPT. This so-called "EPT on shadow page tables" mode > > has some theoretical advantages over the baseline "shadow page tables on > > shadow page tables" mode typically used when EPT is not available to L0 - > > namely that L2's cr3 changes and page faults can be handled in L0 and do not > > need to be propagated to L1. However, currently we do not support this mode, > > and it is becoming less interesting as newer processors all support EPT. > > > > > > I think we can live without this. By "this", do you mean without the "nested_ept" option, or without the hypothetical "EPT on shadow page tables" feature? If the former, then I agree we can "live" without it, but since it was trivial to add, I don't see what harm it can do, and its nice that we can return with a single L0 option to the old shadow-on-ept paging. Is there anything specific you don't like about having this option? About the latter, I agree - as I said, there isn't much point to go and write this (quite complicated) 3-level shadowing when all new processors have EPT anyway. So I didn't. > But we do need a way to control what > features are exposed to the guest, for compatibility and live migration > purposes, as we do with cpuid. So we need some way for host userspace > to write to the vmx read-only feature reporting MSRs. I think this is a general issue (which we already discussed earlier), of nested VMX and not specific to nested EPT. I already put all the capabilities which the MSR report in variables initialized in a single function, nested_vmx_setup_ctls_msrs(), so once we devise an appropriate userspace interface to set these, we can do so easily. Does nested SVM also have a similar problem, of whether or not it advertises new or optional SVM features to L1? If it does have this problem, how was it solved there? -- Nadav Har'El| Thursday, Nov 10 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I considered atheism but there weren't http://nadav.harel.org.il |enough holidays. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT": > > @@ -287,6 +287,7 @@ struct kvm_mmu { > > bool nx; > > > > u64 pdptrs[4]; /* pae */ > > + u64 link_shadow_page_set_bits; >... > > +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 > > set_bits) > > { > > - u64 spte; > > - > > - spte = __pa(sp->spt) > > - | PT_PRESENT_MASK | PT_ACCESSED_MASK > > - | PT_WRITABLE_MASK | PT_USER_MASK; > > - mmu_spte_set(sptep, spte); > > + mmu_spte_set(sptep, __pa(sp->spt) | set_bits); > > } > > > > Minor nit: you can just use link_shadow_page_set_bits here instead of > passing it around (unless later you have a different value for the > parameter?) The problem was that link_shadow_page did not take an kvm_mmu parameter, so I don't know where to find this link_shadow_page_set_bits. So either I pass the pointer to the entire kvm_mmu to link_shadow_page, or I just pass the only field which I need... I thought that passing the single field I need was cleaner - but I can easily change it if you prefer to pass the kvm_mmu. Thanks, Nadav. -- Nadav Har'El| Thursday, Nov 10 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I had a lovely evening. Unfortunately, http://nadav.harel.org.il |this wasn't it. - Groucho Marx -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] nEPT: Documentation
Update the documentation to no longer say that nested EPT is not supported. Signed-off-by: Nadav Har'El --- Documentation/virtual/kvm/nested-vmx.txt |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- .before/Documentation/virtual/kvm/nested-vmx.txt2011-11-10 11:33:59.0 +0200 +++ .after/Documentation/virtual/kvm/nested-vmx.txt 2011-11-10 11:33:59.0 +0200 @@ -38,8 +38,8 @@ The current code supports running Linux Only 64-bit guest hypervisors are supported. Additional patches for running Windows under guest KVM, and Linux under -guest VMware server, and support for nested EPT, are currently running in -the lab, and will be sent as follow-on patchsets. +guest VMware server, are currently running in the lab, and will be sent as +follow-on patchsets. Running nested VMX -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] nEPT: Miscelleneous cleanups
Some trivial code cleanups not really related to nested EPT. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:34:00.0 +0200 @@ -611,7 +611,6 @@ static void nested_release_page_clean(st static u64 construct_eptp(unsigned long root_hpa); static void kvm_cpu_vmxon(u64 addr); static void kvm_cpu_vmxoff(void); -static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3); static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); static DEFINE_PER_CPU(struct vmcs *, vmxarea); @@ -875,8 +874,7 @@ static inline bool nested_cpu_has2(struc (vmcs12->secondary_vm_exec_control & bit); } -static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12, - struct kvm_vcpu *vcpu) +static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12) { return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS; } @@ -6020,7 +6018,7 @@ static int vmx_handle_exit(struct kvm_vc if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked && !(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis( - get_vmcs12(vcpu), vcpu { + get_vmcs12(vcpu) { if (vmx_interrupt_allowed(vcpu)) { vmx->soft_vnmi_blocked = 0; } else if (vmx->vnmi_blocked_time > 10LL && -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/10] nEPT: Advertise EPT to L1
Advertise the support of EPT to the L1 guest, through the appropriate MSR. This is the last patch of the basic Nested EPT feature, so as to allow bisection through this patch series: The guest will not see EPT support until this last patch, and will not attempt to use the half-applied feature. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 @@ -1908,6 +1908,7 @@ static u32 nested_vmx_secondary_ctls_low static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high; static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high; static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high; +static u32 nested_vmx_ept_caps; static __init void nested_vmx_setup_ctls_msrs(void) { /* @@ -1980,6 +1981,16 @@ static __init void nested_vmx_setup_ctls nested_vmx_secondary_ctls_low = 0; nested_vmx_secondary_ctls_high &= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; + if (nested_ept) + nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT; + + /* ept capabilities */ + if (nested_ept) { + nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT; + nested_vmx_ept_caps &= vmx_capability.ept; + } else + nested_vmx_ept_caps = 0; + } static inline bool vmx_control_verify(u32 control, u32 low, u32 high) @@ -2079,8 +2090,8 @@ static int vmx_get_vmx_msr(struct kvm_vc nested_vmx_secondary_ctls_high); break; case MSR_IA32_VMX_EPT_VPID_CAP: - /* Currently, no nested ept or nested vpid */ - *pdata = 0; + /* Currently, no nested vpid support */ + *pdata = nested_vmx_ept_caps; break; default: return 0; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] nEPT: Nested INVEPT
If we let L1 use EPT, we should probably also support the INVEPT instruction. Signed-off-by: Nadav Har'El --- arch/x86/include/asm/vmx.h |2 arch/x86/kvm/vmx.c | 112 +++ 2 files changed, 114 insertions(+) --- .before/arch/x86/include/asm/vmx.h 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/include/asm/vmx.h 2011-11-10 11:33:59.0 +0200 @@ -279,6 +279,7 @@ enum vmcs_field { #define EXIT_REASON_APIC_ACCESS 44 #define EXIT_REASON_EPT_VIOLATION 48 #define EXIT_REASON_EPT_MISCONFIG 49 +#define EXIT_REASON_INVEPT 50 #define EXIT_REASON_WBINVD 54 #define EXIT_REASON_XSETBV 55 @@ -404,6 +405,7 @@ enum vmcs_field { #define VMX_EPTP_WB_BIT(1ull << 14) #define VMX_EPT_2MB_PAGE_BIT (1ull << 16) #define VMX_EPT_1GB_PAGE_BIT (1ull << 17) +#define VMX_EPT_INVEPT_BIT (1ull << 20) #define VMX_EPT_EXTENT_INDIVIDUAL_BIT (1ull << 24) #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25) #define VMX_EPT_EXTENT_GLOBAL_BIT (1ull << 26) --- .before/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 @@ -351,6 +351,8 @@ struct nested_vmx { struct list_head vmcs02_pool; int vmcs02_num; u64 vmcs01_tsc_offset; + /* Remember last EPT02, for single-context INVEPT optimization */ + u64 last_eptp02; /* L2 must run next, and mustn't decide to exit to L1. */ bool nested_run_pending; /* @@ -1987,6 +1989,10 @@ static __init void nested_vmx_setup_ctls /* ept capabilities */ if (nested_ept) { nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT; + nested_vmx_ept_caps |= + VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT | + VMX_EPT_EXTENT_CONTEXT_BIT | + VMX_EPT_EXTENT_INDIVIDUAL_BIT; nested_vmx_ept_caps &= vmx_capability.ept; } else nested_vmx_ept_caps = 0; @@ -5568,6 +5574,105 @@ static int handle_vmptrst(struct kvm_vcp return 1; } +/* Emulate the INVEPT instruction */ +static int handle_invept(struct kvm_vcpu *vcpu) +{ + u32 vmx_instruction_info; + unsigned long type; + gva_t gva; + struct x86_exception e; + struct { + u64 eptp, gpa; + } operand; + + + if (!nested_ept || !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + /* According to the Intel VMX instruction reference, the memory +* operand is read even if it isn't needed (e.g., for type==global) +*/ + vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), + vmx_instruction_info, &gva)) + return 1; + if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand, + sizeof(operand), &e)) { + kvm_inject_page_fault(vcpu, &e); + return 1; + } + + type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf); + + switch (type) { + case VMX_EPT_EXTENT_GLOBAL: + if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT)) + nested_vmx_failValid(vcpu, + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + else { + ept_sync_global(); + nested_vmx_succeed(vcpu); + } + break; + case VMX_EPT_EXTENT_CONTEXT: + if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT)) + nested_vmx_failValid(vcpu, + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + else { + /* +* We efficiently handle the common case, of L1 +* invalidating the last eptp it used to run L2. +* TODO: Instead of saving one last_eptp02, look up +* operand.eptp in the shadow EPT table cache, to +* find its shadow. Then last_eptp02 won't be needed. +*/ + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); + struct vcpu_vmx *vmx = to_vmx(vcpu); + if (vmcs12 && nested_cpu_
[PATCH 06/10] nEPT: Some additional comments
Some additional comments to preexisting code: Explain who (L0 or L1) handles EPT violation and misconfiguration exits. Don't mention "shadow on either EPT or shadow" as the only two options. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 @@ -5815,7 +5815,20 @@ static bool nested_vmx_exit_handled(stru return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); case EXIT_REASON_EPT_VIOLATION: + /* +* L0 always deals with the EPT violation. If nested EPT is +* used, and the nested mmu code discovers that the address is +* missing in the guest EPT table (EPT12), the EPT violation +* will be injected with nested_ept_inject_page_fault() +*/ + return 0; case EXIT_REASON_EPT_MISCONFIG: + /* +* L2 never uses directly L1's EPT, but rather L0's own EPT +* table (shadow on EPT) or a merged EPT table that L0 built +* (EPT on EPT). So any problems with the structure of the +* table is L0's fault. +*/ return 0; case EXIT_REASON_WBINVD: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING); @@ -6736,7 +6749,12 @@ static void prepare_vmcs02(struct kvm_vc vmx_set_cr4(vcpu, vmcs12->guest_cr4); vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12)); - /* shadow page tables on either EPT or shadow page tables */ + /* +* Note that kvm_set_cr3() and kvm_mmu_reset_context() will do the +* right thing, and set GUEST_CR3 and/or EPT_POINTER in all supported +* settings: 1. shadow page tables on shadow page tables, 2. shadow +* page tables on EPT, 3. EPT on EPT. +*/ kvm_set_cr3(vcpu, vmcs12->guest_cr3); kvm_mmu_reset_context(vcpu); @@ -7075,7 +7093,6 @@ void load_vmcs12_host_state(struct kvm_v if (nested_cpu_has_ept(vmcs12)) nested_ept_uninit_mmu_context(vcpu); - /* shadow page tables on either EPT or shadow page tables */ kvm_set_cr3(vcpu, vmcs12->host_cr3); kvm_mmu_reset_context(vcpu); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical address. The problem is that with nested EPT, cr3 is an *L2* physical address, not an L1 physical address as this test expects. As the comment above this test explains, it isn't necessary, and doesn't correspond to anything a real processor would do. So this patch just comments it out. Note that this wrong test could have also theoretically caused problems in nested NPT, not just in nested EPT. However, in practice, the problem was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus circumventing the problem. Additional potential calls to the buggy function are avoided in that we don't trap cr3 modifications when nested NPT is enabled. However, because in nested VMX we did want to use kvm_set_cr3() (as requested in Avi Kivity's review of the original nested VMX patches), we can't avoid this problem and need to fix it. Signed-off-by: Nadav Har'El --- arch/x86/kvm/x86.c | 11 --- 1 file changed, 11 deletions(-) --- .before/arch/x86/kvm/x86.c 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/x86.c 2011-11-10 11:33:59.0 +0200 @@ -690,17 +690,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, u */ } - /* -* Does the new cr3 value map to physical memory? (Note, we -* catch an invalid cr3 even in real-mode, because it would -* cause trouble later on when we turn on paging anyway.) -* -* A real CPU would silently accept an invalid cr3 and would -* attempt to use it - with largely undefined (and often hard -* to debug) behavior on the guest side. -*/ - if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT))) - return 1; vcpu->arch.cr3 = cr3; __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail); vcpu->arch.mmu.new_cr3(vcpu); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/10] nEPT: Fix page table format in nested EPT
When the existing KVM MMU code creates a shadow page table, it assumes it has the normal x86 page table format. This is obviously correct for normal shadow page tables, and also correct for AMD's NPT. Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary page tables, so when we create a shadow EPT table (i.e., in nested EPT), we need to slightly modify the way in which this table table is built. In particular, when mmu.c's link_shadow_page() creates non-leaf page table entries, it used to enable the "present", "accessed", "writable" and "user" flags on these entries. While this is correct for ordinary page tables, it is wrong in EPT tables - where these bits actually have completely different meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h). In particular, leaving the code as-is causes bit 5 of the PTE to be turned on (supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes an "EPT Misconfiguration" failure. So we must move link_shadow_page's list of extra bits to a new mmu context field, which is set differently for nested EPT. Signed-off-by: Nadav Har'El --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c | 16 +--- arch/x86/kvm/paging_tmpl.h |6 -- arch/x86/kvm/vmx.c |3 +++ 4 files changed, 17 insertions(+), 9 deletions(-) --- .before/arch/x86/include/asm/kvm_host.h 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/include/asm/kvm_host.h 2011-11-10 11:33:59.0 +0200 @@ -287,6 +287,7 @@ struct kvm_mmu { bool nx; u64 pdptrs[4]; /* pae */ + u64 link_shadow_page_set_bits; }; struct kvm_vcpu_arch { --- .before/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200 @@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr; vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault; vcpu->arch.mmu.shadow_root_level = get_ept_level(); + vcpu->arch.mmu.link_shadow_page_set_bits = + VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | + VMX_EPT_EXECUTABLE_MASK; vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; --- .before/arch/x86/kvm/mmu.c 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/mmu.c 2011-11-10 11:33:59.0 +0200 @@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_ return __shadow_walk_next(iterator, *iterator->sptep); } -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits) { - u64 spte; - - spte = __pa(sp->spt) - | PT_PRESENT_MASK | PT_ACCESSED_MASK - | PT_WRITABLE_MASK | PT_USER_MASK; - mmu_spte_set(sptep, spte); + mmu_spte_set(sptep, __pa(sp->spt) | set_bits); } static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) @@ -3366,6 +3361,13 @@ int kvm_init_shadow_mmu(struct kvm_vcpu vcpu->arch.mmu.base_role.cr0_wp = is_write_protection(vcpu); vcpu->arch.mmu.base_role.smep_andnot_wp = smep && !is_write_protection(vcpu); + /* +* link_shadow() should apply these bits in shadow page tables, and +* in shadow NPT tables (nested NPT). For nested EPT, different bits +* apply. +*/ + vcpu->arch.mmu.link_shadow_page_set_bits = PT_PRESENT_MASK | + PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK; return r; } --- .before/arch/x86/kvm/paging_tmpl.h 2011-11-10 11:33:59.0 +0200 +++ .after/arch/x86/kvm/paging_tmpl.h 2011-11-10 11:33:59.0 +0200 @@ -515,7 +515,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu goto out_gpte_changed; if (sp) - link_shadow_page(it.sptep, sp); + link_shadow_page(it.sptep, sp, + vcpu->arch.mmu.link_shadow_page_set_bits); } for (; @@ -535,7 +536,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1, true, direct_access, it.sptep); - link_shadow_page(it.sptep, sp); + link_shadow_page(it.sptep, sp, + vcpu->arch.mmu.link_shadow_page_set_bits); } clear_sp_write_flooding_count(it.sptep); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] nEPT: Fix cr3 handling in nested exit and entry
The existing code for handling cr3 and related VMCS fields during nested exit and entry wasn't correct in all cases: If L2 is allowed to control cr3 (and this is indeed the case in nested EPT), during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and we forgot to do so. This patch adds this copy. If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and whoever does control cr3 (L1 or L2) is using PAE, the processor might have saved PDPTEs and we should also save them in vmcs12 (and restore later). Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 30 ++ 1 file changed, 30 insertions(+) --- .before/arch/x86/kvm/vmx.c 2011-11-10 11:33:58.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:58.0 +0200 @@ -6737,6 +6737,17 @@ static void prepare_vmcs02(struct kvm_vc kvm_set_cr3(vcpu, vmcs12->guest_cr3); kvm_mmu_reset_context(vcpu); + /* +* Additionally, except when L0 is using shadow page tables, L1 or +* L2 control guest_cr3 for L2, so they may also have saved PDPTEs +*/ + if (enable_ept) { + vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0); + vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1); + vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2); + vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); + } + kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp); kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip); } @@ -6968,6 +6979,25 @@ void prepare_vmcs12(struct kvm_vcpu *vcp vmcs12->guest_pending_dbg_exceptions = vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); + /* +* In some cases (usually, nested EPT), L2 is allowed to change its +* own CR3 without exiting. If it has changed it, we must keep it. +* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined +* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12. +*/ + if (enable_ept) + vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3); + /* +* Additionally, except when L0 is using shadow page tables, L1 or +* L2 control guest_cr3 for L2, so save their PDPTEs +*/ + if (enable_ept) { + vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); + vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); + vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); + vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); + } + /* TODO: These cannot have changed unless we have MSR bitmaps and * the relevant bit asks not to trap the change */ vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/10] nEPT: MMU context for nested EPT
KVM's existing shadow MMU code already supports nested TDP. To use it, we need to set up a new "MMU context" for nested EPT, and create a few callbacks for it (nested_ept_*()). We then need to switch back and forth between this nested context and the regular MMU context when switching between L1 and L2. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 60 +++ 1 file changed, 60 insertions(+) --- .before/arch/x86/kvm/vmx.c 2011-11-10 11:33:58.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:58.0 +0200 @@ -6443,6 +6443,59 @@ static void vmx_set_supported_cpuid(u32 entry->ecx |= bit(X86_FEATURE_VMX); } +/* Callbacks for nested_ept_init_mmu_context: */ +static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu) +{ + /* return the page table to be shadowed - in our case, EPT12 */ + return get_vmcs12(vcpu)->ept_pointer; +} + +static u64 nested_ept_get_pdptr(struct kvm_vcpu *vcpu, int index) +{ + /* +* This function is called (as mmu.get_pdptr()) in mmu.c to help read +* a to-be-shadowed page table in PAE (3-level) format. However, the +* EPT table we're now shadowing (this is the nested EPT mmu), must +* always have 4 levels, and is not in PAE format, so this function +* should never be called. +*/ + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); +} + +static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu, + struct x86_exception *fault) +{ + struct vmcs12 *vmcs12; + nested_vmx_vmexit(vcpu); + vmcs12 = get_vmcs12(vcpu); + /* +* Note no need to set vmcs12->vm_exit_reason as it is already copied +* from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION. +*/ + vmcs12->exit_qualification = fault->error_code; + vmcs12->guest_physical_address = fault->address; +} + +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) +{ + int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu); + + vcpu->arch.mmu.set_cr3 = vmx_set_cr3; + vcpu->arch.mmu.get_cr3 = nested_ept_get_cr3; + vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr; + vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault; + vcpu->arch.mmu.shadow_root_level = get_ept_level(); + + vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; + + return r; +} + +static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu) +{ + vcpu->arch.walk_mmu = &vcpu->arch.mmu; +} + /* * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it @@ -6652,6 +6705,11 @@ static void prepare_vmcs02(struct kvm_vc vmx_flush_tlb(vcpu); } + if (nested_cpu_has_ept(vmcs12)) { + kvm_mmu_unload(vcpu); + nested_ept_init_mmu_context(vcpu); + } + if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER) vcpu->arch.efer = vmcs12->guest_ia32_efer; if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) @@ -6982,6 +7040,8 @@ void load_vmcs12_host_state(struct kvm_v vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK); kvm_set_cr4(vcpu, vmcs12->host_cr4); + if (nested_cpu_has_ept(vmcs12)) + nested_ept_uninit_mmu_context(vcpu); /* shadow page tables on either EPT or shadow page tables */ kvm_set_cr3(vcpu, vmcs12->host_cr3); kvm_mmu_reset_context(vcpu); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/10] nEPT: Module option
Add a module option "nested_ept" determining whether to enable Nested EPT. Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set its own cr3 and take its own page faults without either of L0 or L1 getting involved. This often significanlty improves L2's performance over the previous two alternatives (shadow page tables over ept, and shadow page tables over shadow page tables). nested_ept is currently enabled by default (when nested VMX is enabled), unless L0 doesn't have EPT or disabled it with ept=0. Users would not normally want to explicitly disable this option. One reason why one might want to disable it is to force L1 to make due without the EPT capability, when anticipating a future need to migrate this L1 to another host which doesn't have EPT. Note that currently there is no API to turn off nested EPT for just a single L1 guest. However, obviously, an individual L1 guest may choose not to use EPT - the nested_cpu_has_ept() checks if L1 actually used EPT when running L2. In the future, we can support emulation of EPT for L1 *always*, even when L0 itself doesn't have EPT. This so-called "EPT on shadow page tables" mode has some theoretical advantages over the baseline "shadow page tables on shadow page tables" mode typically used when EPT is not available to L0 - namely that L2's cr3 changes and page faults can be handled in L0 and do not need to be propagated to L1. However, currently we do not support this mode, and it is becoming less interesting as newer processors all support EPT. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 12 1 file changed, 12 insertions(+) --- .before/arch/x86/kvm/vmx.c 2011-11-10 11:33:58.0 +0200 +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:58.0 +0200 @@ -83,6 +83,10 @@ module_param(fasteoi, bool, S_IRUGO); static int __read_mostly nested = 0; module_param(nested, bool, S_IRUGO); +/* Whether L0 emulates EPT for its L1 guests. It doesn't mean L1 must use it */ +static int __read_mostly nested_ept = 1; +module_param(nested_ept, bool, S_IRUGO); + #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) #define KVM_GUEST_CR0_MASK \ @@ -875,6 +879,11 @@ static inline bool nested_cpu_has_virtua return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS; } +static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12) +{ + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT); +} + static inline bool is_exception(u32 intr_info) { return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -2642,6 +2651,9 @@ static __init int hardware_setup(void) if (!cpu_has_vmx_ple()) ple_gap = 0; + if (!nested || !enable_ept) + nested_ept = 0; + if (nested) nested_vmx_setup_ctls_msrs(); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/10] nEPT: Nested EPT support for Nested VMX
The following patches add nested EPT support to Nested VMX. Nested EPT means emulating EPT for an L1 guest, allowing it use EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set its own cr3 and take its own page faults without either of L0 or L1 getting involved. In many workloads this significanlty improves L2's performance over the previous two alternatives (shadow page tables over ept, and shadow page tables over shadow page tables). Our paper [1] described these three options, and the advantages of nested EPT ("multidimensional paging"). Nested EPT is enabled by default (if the hardware supports EPT), so users do not have to do anything special to enjoy the performance improvement that this patch gives to L2 guests. Just as a non-scientific, non-representative indication of the kind of dramatic performance improvement you may see in workloads that have a lot of context switches and page faults, here is a measurement of the time an example single-threaded "make" took in L2 (kvm over kvm): shadow over shadow: 105 seconds ("ept=0" forces this) shadow over EPT: 87 seconds (the previous default; Can be forced now with "nested_ept=0") EPT over EPT: 29 seconds (the default after this patch) Note that the same test on L1 (with EPT) took 25 seconds, so for this example workload, performance of nested virtualization is now very close to that of single-level virtualization. [1] "The Turtles Project: Design and Implementation of Nested Virtualization", http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf Patch statistics: - Documentation/virtual/kvm/nested-vmx.txt |4 arch/x86/include/asm/kvm_host.h |1 arch/x86/include/asm/vmx.h |2 arch/x86/kvm/mmu.c | 16 - arch/x86/kvm/paging_tmpl.h |6 arch/x86/kvm/vmx.c | 259 - arch/x86/kvm/x86.c | 11 7 files changed, 269 insertions(+), 30 deletions(-) -- Nadav Har'El IBM Haifa Research Lab -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel VT-x VMCS doubt
On Sat, Oct 15, 2011, Shashank Rachamalla wrote about "Re: Intel VT-x VMCS doubt": > However, I am not sure when the CPU switches between host and guest > state ? I came across instructions VMRUN and VMRESUME while reading > which do a VM entry. I think guest state gets loaded when these > instructions are executed. Someone please clarify ? Indeed - on VMLAUNCH or VMRESUME ("VMRUN" is an AMD instruction and not relevant to your Intel question), the processor, among other things, sets its state to the "guest state" specifies in the given VMCS. When the processor decides to "exit" the guest (the host determines, in the VMCS, on which events such exits happen), the processor loads back the "host state" as written in the VMCS. But after replying to your specific question, I'd recommend that before you continue to ask these questions, you take some time to read an organized introduction to VT-x. The official description of VMX (VT-x) by Intel can be found in their Software Developer's Manual, volume 3, starting with Chapter 20 ("Introduction to Virtual-Machine Extensions"). You can find it online (for browsing or downloading everything as one PDF file) in http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-3b-system-programming-manual.html Volume 2 of the same manual gives additional details on each instructions like the aforementioned VMLAUNCH and friends. -- Nadav Har'El|Sunday, Oct 16 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I started out with nothing... I still http://nadav.harel.org.il |have most of it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel VT-x VMCS doubt
On Sat, Oct 15, 2011, Shashank Rachamalla wrote about "Intel VT-x VMCS doubt": > Hi > > After catching up with some reading on Intel VT-x, I could understand > the use of VMCS ( virtual machine control structure ) which allows a VMM > to specify instructions and events that cause VM entries and exits. My > current understanding is that a VMCS structure is maintained by VMM for > each guest and this structure is used by CPU during a VM entry and VM > exit. This is correct. > However, I am not sure if each CPU core has a separate VMCS > structure for each guest ? Also, does the number of guest VCPUs > determine the number of VMCS structures required for a guest ? Can > someone please help me figure out answers to these questions.. I'm not sure the KVM development mailing list is the best place to ask these questions, but I'll try to answer anyway. The VMCS is, at its essense, an ordinary data structure in ordinary memory, and as you said there is just one per guest, and their number is NOT multiplied by the number of physical CPUs. When the hypervisor decides to actually run a particular guest with a particular VMCS on a particular physical CPU, there are certain things it needs to do to make this VMCS "ready" for this CPU. One things that needs to be done is to modify a few host fields in the VMCS, because in Linux a few registers take different default values in different cores, and the "host fields" in the VMCS remember the values we need to go back to after the exit. The second thing that needs to be done is, according to the spec, to VMPTRLD the VMCS onto the physical CPU (doing VMCLEAR first on a previous CPU, if this VMCS was previously used on a different CPU). This is necessary because, according to the spec, each physical CPU (core) might want work with a cached version of the VMCS - and this cache might be per core. So to make a long story short: while KVM is running a particular VMCS on a particular CPU, it slightly modifies this VMCS to fit the needs of this particular CPU; It *doesn't* save additional copies of the same VMCS for running on different CPUs. The event of switching a guest from running on one physical CPU to a different physical CPU is known as "VCPU migration" (the VCPU "migrates" between two physical CPUs) and KVM handles this case by setting a few VMCS fields and doing VMCLEAR/VMPTRLD as explained above. I hope this helps make things clearer... -- Nadav Har'El|Sunday, Oct 16 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Computers are like air conditioners. Both http://nadav.harel.org.il |stop working if you open windows. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] nVMX: Add KVM_REQ_IMMEDIATE_EXIT
On Fri, Sep 23, 2011, Marcelo Tosatti wrote about "Re: [PATCH 1/2] nVMX: Add KVM_REQ_IMMEDIATE_EXIT": > On Thu, Sep 22, 2011 at 01:52:56PM +0300, Nadav Har'El wrote: > > This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT. > > This bit requests that when next entering the guest, we should run it only > > for as little as possible, and exit again. > > > > We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1 >... > > @@ -5647,6 +5648,8 @@ static int vcpu_enter_guest(struct kvm_v > > } > > if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) > > record_steal_time(vcpu); > > + req_immediate_exit = > > + kvm_check_request(KVM_REQ_IMMEDIATE_EXIT, vcpu); >... > The immediate exit information can be lost if entry decides to bail out. > You can do > > req_immediate_exit = kvm_check_request(KVM_REQ_IMMEDIATE_EXIT) > after preempt_disable() > and then transfer back the bit in the bail out case in > if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests Thanks. But thinking about this a bit, it seems to me that in my case *losing* this bit on a canceled entry is the correct thing to do, as turning on this bit was decided in the injection phase (in enable_irq_window()), and next time, if the reason to turn on this bit still exists (i.e., L0 has something to inject to L1, but L2 needs to run), we will turn it on again. -- Nadav Har'El|Sunday, Sep 25 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Guarantee: this email is 100% free of http://nadav.harel.org.il |magnetic monopoles, or your money back! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] nVMX: Fix warning-causing idt-vectoring-info behavior
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original nVMX patch 23, titled "Correct handling of interrupt injection". Unfortunately, it is possible (though rare) that at this point there is valid idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2, and when L2 tried to run this interrupt's handler, it got a page fault - so it returns the original interrupt vector in idt_vectoring_info. The problem is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT like we wished to, because the VMX spec guarantees that idt_vectoring_info and exit_reason_external_interrupt can never happen together. This is not just specified in the spec - a KVM L1 actually prints a kernel warning "unexpected, valid vectoring info" if we violate this guarantee, and some users noticed these warnings in L1's logs. In order to better emulate a processor, which would never return the external interrupt and the idt-vectoring-info together, we need to separate the two injection steps: First, complete L1's injection into L2 (i.e., enter L2, injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT. Most of this is already in the code - the only change we need is to remain in L2 (and not exit to L1) in this case. Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that although we do enter L2 first, it will exit immediately after processing its injection, allowing us to promptly inject to L1. Note how we test vmcs12->idt_vectoring_info_field; This isn't really the vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated), but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value of this field. This was explained in patch 25 ("Correct handling of idt vectoring info") of the original nVMX patch series. Thanks to Dave Allan and to Federico Simoncelli for reporting this bug, to Abel Gordon for helping me figure out the solution, and to Avi Kivity for helping to improve it. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c |7 --- 1 file changed, 4 insertions(+), 3 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-09-22 13:51:31.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-09-22 13:51:31.0 +0300 @@ -3993,11 +3993,12 @@ static void vmx_set_nmi_mask(struct kvm_ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) { - struct vmcs12 *vmcs12; - if (to_vmx(vcpu)->nested.nested_run_pending) + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); + if (to_vmx(vcpu)->nested.nested_run_pending || + (vmcs12->idt_vectoring_info_field & +VECTORING_INFO_VALID_MASK)) return 0; nested_vmx_vmexit(vcpu); - vmcs12 = get_vmcs12(vcpu); vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT; vmcs12->vm_exit_intr_info = 0; /* fall through to normal code, but now in L1, not L2 */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] nVMX: Add KVM_REQ_IMMEDIATE_EXIT
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT. This bit requests that when next entering the guest, we should run it only for as little as possible, and exit again. We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1 to continue running so it can inject an event to it, we unfortunately cannot just pretend to have run L2 for a little while - We must really launch L2, otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2) will be lost. So the existing code runs L2 in this case. But L2 could potentially run for a long time until it exits, and the injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us to request that L2 will be entered, as necessary, but will exit as soon as possible after entry. Our implementation of this request uses smp_send_reschedule() to send a self-IPI, with interrupts disabled. The interrupts remain disabled until the guest is entered, and then, after the entry is complete (often including processing an injection and jumping to the relevant handler), the physical interrupt is noticed and causes an exit. On recent Intel processors, we could have achieved the same goal by using MTF instead of a self-IPI. Another technique worth considering in the future is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to slightly improve performance by avoiding the useless interrupt handler which ends up being called when smp_send_reschedule() is used. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 11 +++ arch/x86/kvm/x86.c |6 ++ include/linux/kvm_host.h |1 + 3 files changed, 14 insertions(+), 4 deletions(-) --- .before/include/linux/kvm_host.h2011-09-22 13:51:31.0 +0300 +++ .after/include/linux/kvm_host.h 2011-09-22 13:51:31.0 +0300 @@ -48,6 +48,7 @@ #define KVM_REQ_EVENT 11 #define KVM_REQ_APF_HALT 12 #define KVM_REQ_STEAL_UPDATE 13 +#define KVM_REQ_IMMEDIATE_EXIT14 #define KVM_USERSPACE_IRQ_SOURCE_ID0 --- .before/arch/x86/kvm/x86.c 2011-09-22 13:51:31.0 +0300 +++ .after/arch/x86/kvm/x86.c 2011-09-22 13:51:31.0 +0300 @@ -5610,6 +5610,7 @@ static int vcpu_enter_guest(struct kvm_v bool nmi_pending; bool req_int_win = !irqchip_in_kernel(vcpu->kvm) && vcpu->run->request_interrupt_window; + bool req_immediate_exit = 0; if (vcpu->requests) { if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu)) @@ -5647,6 +5648,8 @@ static int vcpu_enter_guest(struct kvm_v } if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu)) record_steal_time(vcpu); + req_immediate_exit = + kvm_check_request(KVM_REQ_IMMEDIATE_EXIT, vcpu); } @@ -5706,6 +5709,9 @@ static int vcpu_enter_guest(struct kvm_v srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); + if (req_immediate_exit) + smp_send_reschedule(vcpu->cpu); + kvm_guest_enter(); if (unlikely(vcpu->arch.switch_db_regs)) { --- .before/arch/x86/kvm/vmx.c 2011-09-22 13:51:31.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-09-22 13:51:31.0 +0300 @@ -3858,12 +3858,15 @@ static bool nested_exit_on_intr(struct k static void enable_irq_window(struct kvm_vcpu *vcpu) { u32 cpu_based_vm_exec_control; - if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) - /* We can get here when nested_run_pending caused -* vmx_interrupt_allowed() to return false. In this case, do -* nothing - the interrupt will be injected later. + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) { + /* +* We get here if vmx_interrupt_allowed() said we can't +* inject to L1 now because L2 must run. Ask L2 to exit +* right after entry, so we can inject to L1 more promptly. */ + kvm_make_request(KVM_REQ_IMMEDIATE_EXIT, vcpu); return; + } cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] nVMX injection corrections
The following two patches solve two injection-related nested VMX issues: 1. When we must run L2 next (namely on L1's VMLAUNCH/VMRESUME), injection into L1 was delayed for an unknown amount of time - until L2 exits. We now force (using a self IPI) an exit immediately after entry to L2, so that the injection into L1 happens promptly. 2. "unexpected, valid vectoring info" warnings appeared in L1. These are fixed by correcting the emulation of concurrent L0->L1 and L1->L2 injections: We cannot inject into L1 until the injection into L2 has been processed. Patch statistics: - arch/x86/kvm/vmx.c | 18 +++--- arch/x86/kvm/x86.c |6 ++ include/linux/kvm_host.h |1 + 3 files changed, 18 insertions(+), 7 deletions(-) -- Nadav Har'El IBM Haifa Research Lab -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] nVMX: Fix warning-causing idt-vectoring-info behavior
This patch solves two outstanding nested-VMX issues: 1. "unexpected, valid vectoring info" warnings appeared in L1. These are fixed by correcting the emulation of concurrent L0->L1 and L1->L2 injections. 2. When we must run L2 next (e.g., on L1's VMLAUNCH/VMRESUME), injection into L1 was delayed for an unknown amount of time - until L2 exits. We now force (using a self IPI) an exit immediately after entry to L2, so that the injection into L1 happens promptly. Some more details about these issues: When L0 wishes to inject an interrupt while L2 is running, it emulates an exit to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original nVMX patch 23, titled "Correct handling of interrupt injection". Unfortunately, it is possible (though rare) that at this point there is valid idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2, and when L2 tried to run this interrupt's handler, it got a page fault - so it returns the original interrupt vector in idt_vectoring_info. The problem is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT like we wished to, because the VMX spec guarantees that idt_vectoring_info and exit_reason_external_interrupt can never happen together. This is not just specified in the spec - a KVM L1 actually prints a kernel warning "unexpected, valid vectoring info" if we violate this guarantee, and some users noticed these warnings in L1's logs. In order to better emulate a processor, which would never return the external interrupt and the idt-vectoring-info together, we need to separate the two injection steps: First, complete L1's injection into L2 (i.e., enter L2, injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT. Most of this is already in the code - the only change we need is to remain in L2 (and not exit to L1) in this case. However, to ensure prompt injection to L1, instead of letting L2 run for a while after entering, we can send a self-IPI, which will ensure that L2 exits immediately after the (necessary) entry, so we can inject into L1 as soon as possible. Note that we added this self-IPI not only in the idt-vectoring-info case above, but in every case where we are forced to enter L2 despite wishing to inject to L1. This includes the case when L1 just VMLAUNCH/VMRESUMEed L2. Note how we test vmcs12->idt_vectoring_info_field; This isn't really the vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated), but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value of this field. This was explained in patch 25 ("Correct handling of idt vectoring info") of the original nVMX patch series. Thanks to Dave Allan and to Federico Simoncelli for reporting this bug, to Abel Gordon for helping me figure out the solution, and to Avi Kivity for helping to improve it. Signed-off-by: Nadav Har'El --- arch/x86/kvm/vmx.c | 20 +--- 1 file changed, 13 insertions(+), 7 deletions(-) --- .before/arch/x86/kvm/vmx.c 2011-09-21 13:45:59.0 +0300 +++ .after/arch/x86/kvm/vmx.c 2011-09-21 13:45:59.0 +0300 @@ -3858,12 +3858,17 @@ static bool nested_exit_on_intr(struct k static void enable_irq_window(struct kvm_vcpu *vcpu) { u32 cpu_based_vm_exec_control; - if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) - /* We can get here when nested_run_pending caused -* vmx_interrupt_allowed() to return false. In this case, do -* nothing - the interrupt will be injected later. + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) { + /* +* We get here if vmx_interrupt_allowed() returned 0 because +* we must enter L2 now, so we can't inject to L1 now. If we +* just do nothing, L2 will later exit and we can inject the +* IRQ to L1 then. But to make L2 exit more promptly, we send +* a self-IPI, causing L2 to exit right after entry. */ + smp_send_reschedule(vcpu->cpu); return; + } cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; @@ -3990,11 +3995,12 @@ static void vmx_set_nmi_mask(struct kvm_ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu) { if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) { - struct vmcs12 *vmcs12; - if (to_vmx(vcpu)->nested.nested_run_pending) + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); + if (to_vmx(vcpu)->nested.nested_run_pending || + (vmcs12->idt_vectoring_info_field & +VECTORING_INFO_VALID_MASK))
Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
On Sun, Aug 07, 2011, Joerg Roedel wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM": > > Joerg, can you review and ack Nadav's SVM patch? TIA > > Tested-by: Joerg Roedel > Acked-by: Joerg Roedel > > Reviewed and tested (didn't apply cleanly, but was easy to fix that up). > Looks all good so far. Hi guys, I'm a bit confused how we want to proceed from here. The patches which I sent appear to fix the original bug (as confirmed by the two people who reported it), but Zachary warned that it would break migration of nested SVM while L2 is running. I asked whether migration works at all while L2 is running (without exiting to L1 first) - and Marcelo suggested that it doesn't. If the problem Zachary pointed to is considered serious, I proposed a second option - to leave the code to *wrongly* return the L1 TSC MSR always, and check (and warn) in situations where this value is wrongly returned to the guest, but this will leave qemu to always read the TSC MSR from L1, even when L2 is running. While I proposed this as a second option, I don't think I can recommend it. So what's the verdict? :-) Thanks, Nadav. -- Nadav Har'El| Wednesday, Aug 10 2011, 10 Av 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |It's fortunate I have bad luck - without http://nadav.harel.org.il |it I would have no luck at all! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] qemu-kvm-0.15.0
On Wed, Aug 10, 2011, Thomas Mittelstaedt wrote about "Re: [ANNOUNCE] qemu-kvm-0.15.0": > Grabbed the version, created a disk image of 5G and tried to load and > iso via > /usr/local/bin/qemu -hda ~/virtualdisk.img -cdrom > ~/Downloads/oneiric-desktop-i386-alpha3-lubuntu.iso -boot d -m 1024 > > and it is very, very slow. Tried the same with the installed 0.11 Sounds like maybe you are using Qemu without KVM? Is running qemu with the "--enable-kvm" option better? "--enable-kvm" used to be the default in qemu-kvm, but maybe this changed (I haven't been following the discussion in this area). -- Nadav Har'El| Wednesday, Aug 10 2011, 10 Av 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Always keep your words soft and sweet, http://nadav.harel.org.il |just in case you have to eat them. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] postcopy livemigration proposal
> >* What's is postcopy livemigration > >It is is yet another live migration mechanism for Qemu/KVM, which > >implements the migration technique known as "postcopy" or "lazy" > >migration. Just after the "migrate" command is invoked, the execution > >host of a VM is instantaneously switched to a destination host. Sounds like a cool idea. > >The benefit is, total migration time is shorter because it transfer > >a page only once. On the other hand precopy may repeat sending same pages > >again and again because they can be dirtied. > >The switching time from the source to the destination is several > >hunderds mili seconds so that it enables quick load balancing. > >For details, please refer to the papers. While these are the obvious benefits, the possible downside (that, as always, depends on the workload) is the amount of time that the guest workload runs more slowly than usual, waiting for pages it needs to continue. There are a whole spectrum between the guest pausing completely (which would solve all the problems of migration, but is often considered unacceptible) and running at full-speed. Is it acceptable that the guest runs at 90% speed during the migration? 50%? 10%? I guess we could have nothing to lose from having both options, and choosing the most appropriate technique for each guest! > That's terrific (nice video also)! > Orit and myself had the exact same idea too (now we can't patent it..). I think new implementation is not the only reason why you cannot patent this idea :-) Demand-paged migration has actually been discussed (and done) for nearly a quarter of a century (!) in the area of *process* migration. The first use I'm aware of was in CMU's Accent 1987 - see [1]. Another paper, [2], written in 1991, discusses how process migration is done in UCB's Sprite operating system, and evaluates the various alternatives common at the time (20 years ago), including what it calls "lazy copying" is more-or-less the same thing as "post copy". Mosix (a project which, in some sense, is still alive to day) also used some sort of cross between pre-copying (of dirty pages) and copying on-demand of clean pages (from their backing store on the source machine). References [1] "Attacking the Process Migration Bottleneck" http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf [2] "Transparent Process Migration: Design Alternatives and the Sprite Implementation" http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf > Advantages: > - Virtual machines are using more and more memory resources , > for a virtual machine with very large working set doing live > migration with reasonable down time is impossible today. If a guest actually constantly uses (working set) most of its allocated memory, it will basically be unable to do any significant amount of work on the destination VM until this large working set is transfered to the destination. So in this scenario, "post copying" doesn't give any significant advantages over plain-old "pause guest and send it to the destination". Or am I missing something? > Disadvantageous: > - During the live migration the guest will run slower than in > today's live migration. We need to remember that even today > guests suffer from performance penalty on the source during the > COW stage (memory copy). I wonder if something like asynchronous page faults can help somewhat with multi-process guest workloads (and modified (PV) guest OS). > - Failure of the source or destination or the network will cause > us to lose the running virtual machine. Those failures are very > rare. How is this different from a VM running on a single machine that fails? Just that the small probability of failure (roughly) doubles for the relatively-short duration of the transfer? -- Nadav Har'El| Monday, Aug 8 2011, 8 Av 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If glory comes after death, I'm not in a http://nadav.harel.org.il |hurry. (Latin proverb) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html