Re: Monotonic clock with KVM pv-clock

2014-01-21 Thread Nadav Har'El
On Mon, Jan 20, 2014, Marcelo Tosatti wrote about "Re: Monotonic clock with KVM 
pv-clock":
> On Mon, Jan 20, 2014 at 11:56:56AM +0200, Nadav Har'El wrote:
> > Hi,
> > 
> > I'm trying to figure out how a guest OS can get a monotonic clock using
> > KVM's paravirtual clock.
> > 
> > At first, I thought that the clock I get using KVM_SYSTEM_TIME is a
> > monotonic clock, based on the host's monotonic clock.
> 
> It is. However, it is used in conjunction with TSC delta, part
> of the structure which is written at KVM_SYSTEM_TIME GPA. See
> pvclock_clocksource_read at arch/x86/kernel/pvclock.c.

Sure, I'm using the whole protocol for KVM_SYSTEM_TIME. I just wondered
if it has a meaningful definition (namely, of being a monotonic clock) when
being used not in conjunction with the KVM_WALL_CLOCK protocol.

> > 2. What happens when the wall-clock time is set on the host? I was
> >hoping that only KVM_WALL_CLOCK changes and KVM_SYSTEM_TIME doesn't,
> >but am no longer sure this is actually the case.
> 
> Yes, it is the case. The host clock which backs system_timestamp field
> of pvclock structure is
> 
> CLOCK_MONOTONIC
> Clock  that  cannot be set and represents monotonic time
> since some unspecified starting point.  This clock is not affected by
> discontinuous jumps in the system time (e.g., if the system
> administrator manually changes the clock), but is affected by the
> incremental adjustments per???formed by adjtime(3) and NTP.

Excellent.

> 
> > If KVM_SYSTEM_TIME is not a correct way to get a monotonic paravirtual clock
> > from KVM, is there a correct way?
> 
> Inside a Linux guest? Can use sched_clock().
> 
> If not a Linux guest, either implement kvmclock-like driver
> (kvm-unit-test contains one).

This is actually an OSv guest (http://osv.io/).

Until now we were using kvmclock by adding up the wallclock and
systemtime protocol, and this resulted with a good wall-time clock
(CLOCK_REALTIME), but I really wanted to also have a monotonic clock
(CLOCK_MONOTONIC) for the guest.

I took a look and how a Linux guest uses the KVM pv clock, and thought
that the "system time" part of the kvm clock protocol will be a good
monotonic clock, but wasn't sure I was understanding this right.

Thanks for the clarifications!

Nadav.

-- 
Nadav Har'El| Tuesday, Jan 21 2014, 20 Shevat 5774
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |"The average person thinks he isn't." -
http://nadav.harel.org.il   |Father Larry Lorenzoni
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Monotonic clock with KVM pv-clock

2014-01-20 Thread Nadav Har'El
Hi,

I'm trying to figure out how a guest OS can get a monotonic clock using
KVM's paravirtual clock.

At first, I thought that the clock I get using KVM_SYSTEM_TIME is a
monotonic clock, based on the host's monotonic clock.
But I'm no longer sure this is actually done correctly:

1. What happens on live migration? On the new host, will the
   KVM_SYSTEM_TIME continue from the same point? Or will it jump because
   KVM_WALL_CLOCK (the wall time during boot) is different on the new host?

2. What happens when the wall-clock time is set on the host? I was
   hoping that only KVM_WALL_CLOCK changes and KVM_SYSTEM_TIME doesn't,
   but am no longer sure this is actually the case.

If KVM_SYSTEM_TIME is not a correct way to get a monotonic paravirtual clock
from KVM, is there a correct way?

Thanks,
Nadav.

-- 
Nadav Har'El|  Monday, Jan 20 2014, 19 Shevat 5774
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Diplomat: A man who always remembers a
http://nadav.harel.org.il   |woman's birthday but never her age.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nested migration issues

2013-08-27 Thread Nadav Har'El
On Tue, Aug 27, 2013, Jan Kiszka wrote about "Re: nested migration issues":
> > Can anybody tell me whether kvm supports this kind of nested migration
> > or anybody is working on the relating field?
> 
> No, KVM does not support migration while nested VMX (or SVM) is enabled.
> We lack userspace interfaces to export/import the required state, and
> that mostly because we do not really know yet what state needs to be
> made available. Specifically nVMX is still under development. Migration
> support will officially follow once we feel we reached a mature state
> with nesting support.

Also check out https://bugzilla.kernel.org/show_bug.cgi?id=53851 for
some earlier ideas about the issue of migrating L1 complete with its
L2s.

> That said, if you want to play with adding such an interface yourself,
> no one will stop you. It may serve as a reference for us what is
> required, but you should not expect it to be merged soon.
> 
> BTW, not only migration is broken so far. Even reset does not work while
> VMX is enabled. For the same reason: userspace cannot load a VCPU state
> that resets all VMX states properly.
> 
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Nadav Har'El|   Tuesday, Aug 27 2013, 21 Elul 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Committee: A group of people that keeps
http://nadav.harel.org.il   |minutes and wastes hours.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/11] KVM: nVMX: Synchronize VMCS12 content with the shadow vmcs

2013-03-10 Thread Nadav Har'El
Hi Abel, very nice patches.

On Sun, Mar 10, 2013, Abel Gordon wrote about "[PATCH 10/11] KVM: nVMX: 
Synchronize VMCS12 content with the shadow vmcs":
>   nested_vmx_vmexit(vcpu);
> + if (enable_shadow_vmcs)
> + copy_vmcs12_to_shadow(to_vmx(vcpu));

I was curious why your patch adds this call to copy_vmcs12_to_shadow after
every nested_vmx_vmexit (3 times), instead of making this call inside
nested_vmx_vmexit(), say right after prepare_vmcs12(). Until I saw:

>   nested_vmx_vmexit(vcpu);
>   vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
>   vmcs12->vm_exit_intr_info = 0;
> + if (enable_shadow_vmcs)
> + copy_vmcs12_to_shadow(to_vmx(vcpu));

where you need to copy this exit-reason override as well...
I wonder if there isn't a less ugly and repetitive way to do this :(

-- 
Nadav Har'El|Monday, Mar 11 2013, 29 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Fame: when your name is in everything but
http://nadav.harel.org.il   |the phone book.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS

2013-03-04 Thread Nadav Har'El
On Sun, Mar 03, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Fix content of 
MSR_IA32_VMX_ENTRY/EXIT_CTLS":
>   /* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported. */
>  #ifdef CONFIG_X86_64
>   nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #else
>   nested_vmx_exit_ctls_high = 0;
>  #endif
> + nested_vmx_exit_ctls_high |= 0x36dff;

Can you please compose this 0x36dff out of constants? Is
VM_EXIT_HOST_ADDR_SPACE_SIZE one of them?

It's important to verify that we actually support all these bits - even
if we *should* support them, it doesn't mean we actually do (but if we
do, we should say we do).

> - nested_vmx_entry_ctls_low = 0;
> + /* If bit 55 of VMX_BASIC is off, bits 0-8 and 12 must be 1. */
> + nested_vmx_entry_ctls_low = 0x11ff;

Setting nested_vmx_entry_ctls_low = 0 just means that although the spec
says only 1 setting is supported, we *also* support 0 setting. I'm not
sure why this is a bad thing. Our VMX will be even better than the
real processors' ;-)


-- 
Nadav Har'El| Monday, Mar 4 2013, 22 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |My opinions may have changed, but not the
http://nadav.harel.org.il   |fact that I am right.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix setting of CR0 and CR4 in guest mode

2013-03-04 Thread Nadav Har'El
On Mon, Mar 04, 2013, Jan Kiszka wrote about "Re: [PATCH] KVM: nVMX: Fix 
setting of CR0 and CR4 in guest mode":
> >>>>  if (is_guest_mode(vcpu)) {
> >>>> -/*
> >>>> - * We get here when L2 changed cr0 in a way that did 
> >>>> not change
> >>>> - * any of L1's shadowed bits (see 
> >>>> nested_vmx_exit_handled_cr),
> >>>> - * but did change L0 shadowed bits. This can currently 
> >>>> happen
> >>>> - * with the TS bit: L0 may want to leave TS on (for 
> >>>> lazy fpu
> >>>> - * loading) while pretending to allow the guest to 
> >>>> change it.
> >>>> - */
> >>> Can't say I understand this patch yet, but it looks like the comment is
> >>> still valid. Why have you removed it?
> >>
> >> L0 allows L1 or L2 at most to own TS, the rest is host-owned. I think
> >> the comment was always misleading.
> >>
> > I do not see how it is misleading. For everything but TS we will not get
> > here (if L1 is kvm). For TS we will get here if L1 allows L2 to change
> > it, but L0 does not.
> 
> For everything *but guest-owned* we will get here, thus for most CR0
> accesses (bit-wise, not regarding frequency).

For most CR0 bits, L1 (at least, a KVM one) will shadow (trap) them, so
we won't get to this point you modified at all... Instead,
nested_vmx_exit_handled_cr() would notice that a shadowed-by-L1 bit
was modified so an exit to L1 is required. We only get to that code
you changed if a bit was modified that L1 did *not* want to trap, but L0 did.
This is definitely not the bit-wise majority of the cases - unless you
have an L1 that does not trap most of the CR0 bits.

But I'm more worried about the actual code change :-) I didn't
understand if there's a situation where the existing code did something
wrong, or why it was wrong. Did you check the lazy-FPU-loading (TS bit)
aspect of your new code? To effectively check this, what I had to do
is to run on all of L0, L1, and L2, long runs of parallel "make" (make -j3) -
concurrently. Even code which doesn't do floating-point calculations uses
the FPU sometimes for its wide registers, so all these processes, guests
and guest's guests, compete for the FPU, exercising very well this code
path. If the TS bit is handled wrongly, some of these make processes
will die, when one of the compilations dies of SIGSEGV (forgetting to
set the FPU registers leads to some uninitialized pointers being used),
so it's quite easy to exercise this.

-- 
Nadav Har'El| Monday, Mar 4 2013, 22 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |"A witty saying proves nothing." --
http://nadav.harel.org.il   |Voltaire
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Reset RFLAGS on VM-exit

2013-03-03 Thread Nadav Har'El
On Sun, Mar 03, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Reset RFLAGS 
on VM-exit":
> From: Jan Kiszka 
> 
> Ouch, how could this work so well that far? We need to clear RFLAGS to
> the reset value as specified by the SDM. Particularly, IF must be off
> after VM-exit!

nested_vmx_succeed() or nested_vmx_fail*() were already clearing some of
the fields that I understood was necessary to clear. But they did not clear
the IF - I never realised (and didn't verify now) that this is part of the
spec. And since L1 KVM anyways enters L2 with interrupts disabled, nested KVM
would not see a difference.

> + vmx_set_rflags(vcpu, 0x02);

There's a macro X86_EFLAGS_BIT1 which you can use for this 0x02.


-- 
Nadav Har'El| Sunday, Mar 3 2013, 22 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |"I don't use drugs, my dreams are
http://nadav.harel.org.il   |frightening enough." -- M. C. Escher
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-02-26 Thread Nadav Har'El
On Thu, Feb 14, 2013, Nakajima, Jun wrote about "Re: [Bug 53611] New: nVMX: Add 
nested EPT":
> We have started looking at the pataches first. But I couldn't
> reproduce the results by simply applying the original patches to v3.6:
> - L2 Ubuntu 12.04 (64-bit)  (smp 2)
> - L1 Ubuntu 12.04 (64-bit) KVM (smp 2)
> - L0 Ubuntu 12.04 (64-bit)-based. kernel/KVM is v3.6 + patches (the
> ones in nept-v2.tgz).
> https://bugzilla.kernel.org/attachment.cgi?id=93101
> 
> Without the patches, the L2 guest works. With it, it hangs at boot
> time (just black screen):
> - EPT was detected by L1 KVM.
> - UP L2 didn't help.
> - Looks like it's looping at EPT_walk_add_generic at the same address in L0.
> 
> Will take a closer look. It would be helpful if the test configuration
> (e.g kernel/commit id used, L1/L2 guests) was documented as well.

I sent the patches in August 1st, and they applied to commit
ade38c311a0ad8c32e902fe1d0ae74d0d44bc71e from a week earlier.

In most of my tests, L1 and L2 were old images - L1 had Linux 2.6.33,
while L2 had Linux 2.6.28. In most of my tests both L1 and L2 were UP.

I've heard another report of my patch not working with newer L1/L2 -
the report said that L2 failed to boot (like you reported), and also
that L1 became unstable (running anything in it gave a memory fault).
So it is very likely that this code still has bugs - but since I already
know of errors and holes that need to be plugged (see the announcement file
together with the patches), it's not very surprising :( These patches
definitely need some lovin', but it's easier than starting from scratch.

Nadav.

-- 
Nadav Har'El|   Tuesday, Feb 26 2013, 16 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I think therefore I am. My computer
http://nadav.harel.org.il   |thinks for me, therefore I am not.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nSVM/nVMX: Implement vmexit on INIT assertion

2013-02-25 Thread Nadav Har'El
On Sun, Feb 24, 2013, Jan Kiszka wrote about "[PATCH] KVM: nSVM/nVMX: Implement 
vmexit on INIT assertion":
> From: Jan Kiszka 
> 
> On Intel, raising INIT causing an unconditional vmexit. On AMD, this is
> controlled by the interception mask.

Hi,

I never tried to closely follow the KVM code paths related this code,
but I do have one question: The VMX spec says:

   "The INIT signal is blocked whenever a logical processor is in VMX root
operation. It is not blocked in VMX non-root operation. Instead, INITs
cause VM exits (see Section 22.3, Other Causes of VM Exits)."

So when running a non-nested L1 guest, or an L2 guest, the new behavior
appears correct. However, it looks like if L1 is running in root
mode (i.e., did VMXON once but not running L2 now), the INIT signal
needs to be blocked, not do what it does now. It appears (but I'm not sure...)
that right now, it causes the L1 guest to lock up, and is not ignored? 

Nadav.

-- 
Nadav Har'El|Monday, Feb 25 2013, 15 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If con is the opposite of pro, is
http://nadav.harel.org.il   |congress the opposite of progress?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-02-23 Thread Nadav Har'El
On Sat, Feb 23, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Replace 
kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state":
> - kvm_set_cr0(vcpu, vmcs12->host_cr0);
> + vmx_set_cr0(vcpu, vmcs12->host_cr0);

I don't remember now why I did this (and I'm not looking at the code),
but this you'll need to really test carefully, including
shadow-on-shadow mode (ept=0 in L0), to verify you're not missing any
important side-effect of kvm_set_cr0.

Also, if I remember correctly, during nVMX's review, Avi Kivity asked
in several places that when I called vmx_set_cr0, I should instead call
kvm_set_cr0(), because it does some extra stuff and does some extra
checks. Hmm, see, see this:
http://markmail.org/message/hhidqyhbo2mrgxxc

where Avi asked for the reverse patch you're attempting now.

-- 
Nadav Har'El|  Saturday, Feb 23 2013, 14 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |It is better to be thought a fool, then
http://nadav.harel.org.il   |to open your mouth and remove all doubt.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-21 Thread Nadav Har'El
On Thu, Feb 21, 2013, Jan Kiszka wrote about "Re: [PATCH] KVM: nVMX: Rework 
event injection and recovery":
> That generally does not help to inject/report an external IRQ to L1 as
> L1 runs with IRQs disabled around VMLAUNCH/RESUME.

Good point, I forgot that :(

So it looks like nested_run_pending was necessary, after all.

-- 
Nadav Har'El|  Thursday, Feb 21 2013, 11 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |An egotist is a person of low taste, more
http://nadav.harel.org.il   |interested in himself than in me.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-21 Thread Nadav Har'El
On Thu, Feb 21, 2013, Gleb Natapov wrote about "Re: [PATCH] KVM: nVMX: Rework 
event injection and recovery":
> will not be called while there is an event to reinject. 51cfe38ea5 still
> does not explain why nested_run_pending is needed. We cannot #vmexit
> without entering L2, but we can undo VMLAUNCH/VMRESUME emulation leaving
> rip pointing to the instruction. We can start by moving
> skip_emulated_instruction() from nested_vmx_run() to nested_vmx_vmexit().

This is a very interesting idea!
Don't forget to also skip_emulated_instruction() in nested_vmx_entry_failure().

And please expand the comment at the end of nested_vmx_run(), saying
that also skipping the instruction is done on exit, unless the
instruction needs to be retried because we needed to inject an
interrupt into L1 before running it.

Whether this is actually clearer than the "nested_run_pending" approach
I don't know.



-- 
Nadav Har'El|  Thursday, Feb 21 2013, 11 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Cat rule #2: Bite the hand that won't
http://nadav.harel.org.il   |feed you fast enough.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Nadav Har'El
Hi,

By the way, if you haven't seen my description of why the current code
did what it did, take a look at
http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
Another description might also come in handy:
http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html

On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event 
injection and recovery":
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
> 
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.

Last time I checked, if I'm remembering correctly, the nested SVM code did
something a bit different: After the exit from L2 to L1 and unnecessarily
queuing the pending interrupt for injection, it skipped one entry into L1,
and as usual after the entry the interrupt queue is cleared so next time
around, when L1 one is really entered, the wrong injection is not attempted.

> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.

> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.

I didn't understand this last point.

> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct 
> vmcs12 *vmcs12)
>   vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>   vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>  
> - /* clear vm-entry fields which are to be cleared on exit */
> - if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> + /* drop what we picked up for L0 via vmx_complete_interrupts */
> + vcpu->arch.nmi_injected = false;
> + kvm_clear_exception_queue(vcpu);
> + kvm_clear_interrupt_queue(vcpu);

It would be nice to move these lines out of prepare_vmcs12(), since they
don't really do anything with vmcs12, and move it into
nested_vmx_vmexit() (which is the one which called prepare_vmcs12()).

Did you test this both with PIN_BASED_EXT_INTR_MASK (the usual case) and
!PIN_BASED_EXT_INTR_MASK (the case which interests you)? We need to make
sure that in the former case, this doesn't clear the interrupt queue after
we put an interrupt to be injected in it (at first glance it seems fine,
but these code paths are so convoluted, it's hard to be sure).

> + if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
> + vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
> + /*
> +  * Preserve the event that was supposed to be injected
> +  * by emulating it would have been returned in
> +  * IDT_VECTORING_INFO_FIELD.
> +  */
> + if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
> + INTR_INFO_VALID_MASK) {
> + vmcs12->idt_vectoring_info_field =
> + vmcs12->vm_entry_intr_info_field;
> + vmcs12->idt_vectoring_error_code =
> + vmcs12->vm_entry_exception_error_code;
> + vmcs12->vm_exit_instruction_len =
> + vmcs12->vm_entry_instruction_len;
> + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);

I'm afraid I'm missing what you are trying to do here. Why would
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK ever be
true? After all, the processor clears it after each sucessful exit so
last if() will only succeed on failed entries - but this is NOT the
case if we're in the enclosing if (note that vmcs12->vm_exit_reason  =
vmcs_read32(VM_EXIT_REASON)). Maybe I'm missing something?

Nadav.

-- 
Nadav Har'El| Wednesday, Feb 20 2013, 10 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Change is inevitable, except from a
http://nadav.harel.org.il   |vending machine.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1

2013-02-16 Thread Nadav Har'El
On Sat, Feb 16, 2013, Jan Kiszka wrote about "Re: [PATCH] KVM: nVMX: Fix 
injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1":
> No, this is wrong. I first wrote a patch that ignored enable_irq_window
> when the guest is not interested in external IRQs. But then I thought
> that wasn't correct. I tend to believe now my first idea was better.

If you want to make the !PIN_BASED_EXT_INTR_MASK case work correctly,
please also see:

https://bugzilla.kernel.org/show_bug.cgi?id=53711

Nadav.


-- 
Nadav Har'El|   Saturday, Feb 16 2013, 6 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Always keep your words soft and sweet,
http://nadav.harel.org.il   |just in case you have to eat them.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: nVMX: Improve I/O exit handling

2013-02-14 Thread Nadav Har'El
On Thu, Feb 14, 2013, Gleb Natapov wrote about "Re: [PATCH v2] KVM: nVMX: 
Improve I/O exit handling":
> > >> Not sure how to map a failure on real HW behaviour. I guess it's best to
> > > Exit to L1 with nested_vmx_failValid() may be?
> > 
> > To my understanding, nested_vmx_failValid/Invalid are related to errors
> > directly related to vm instruction execution. This one is triggered by
> > the guest later on.
> > 
> You are right. We need some kind of error vmexit here, but nothing
> appropriate is defined by the spec, so lets just assume that exit to L1
> is needed if we cannot read permission bitmaps.

On real hardware, note that the MSR-bitmap address specified in the VMCS
is a physical address, so there can never be an error - if I understand
correctly, there is no such thing as a non-existant physical address -
reading from non-existant physical memory returns random garbage (please
correct me if I'm wrong here!). So if I'm correct, using a non-existent
address for an MSR-bitmap will give you an undefined behavior - not any
sort of entry error to special type of exit.

The current code, using a random value on the stack, fits with the
"undefined behavior" definition, but you're right the more logical
behavior is to, indeed, always exit on the MSR access when the bitmap
cannot be read. This will make an unreadable bitmap equivalent to no
bitmap at all - which I think makes most sense.

You're also right that this case is identical in both MSR and I/O bitmap
cases, and should be fixed in both.


-- 
Nadav Har'El|   Thursday, Feb 14 2013, 4 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |A fanatic is one who can't change his
http://nadav.harel.org.il   |mind and won't change the subject.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 53611] New: nVMX: Add nested EPT

2013-02-11 Thread Nadav Har'El
Hi,

On Mon, Feb 11, 2013, Jan Kiszka wrote about "Re: [Bug 53611] New: nVMX: Add 
nested EPT":
> On 2013-02-11 13:49, bugzilla-dae...@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=53611
> >Summary: nVMX: Add nested EPT

Yikes, I didn't realize that these bugzilla edits all get spammed to the
entire mailing list :( Sorry about those...

> I suppose they do not apply anymore as well. Do you have a recent tree
> around somewhere or plan to resume work on it?

Unfortunately, no - I did not have time to work on these patches since
August.

The reason I'm now stuffing these things into the bug tracker is that
at the end of this month I am leaving IBM to a new job, so I'm pretty
sure I won't have time myself to continue any work on nested VMX, and
would like for the missing nested-VMX features to be documented in case
someone else comes along and wants to improve it. So unfortunately, you
should expect more of this bugzilla spam on the mailing list...

Nadav.

-- 
Nadav Har'El| Monday, Feb 11 2013, 1 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |The message above is just this
http://nadav.harel.org.il   |signature's way of propagating itself.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Improve I/O exit handling

2013-02-11 Thread Nadav Har'El
On Sun, Feb 10, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Improve I/O 
exit handling":
> +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
> +struct vmcs12 *vmcs12)
> +{
> + unsigned long exit_qualification;
> + gpa_t bitmap, last_bitmap;
> + bool string, rep;
> + u16 port;
> + int size;
> + u8 b;
> +
> + if (nested_cpu_has(get_vmcs12(vcpu), CPU_BASED_UNCOND_IO_EXITING))
> + return 1;

Instead of calling get_vmcs12(vcpu), you can just use "vmcs12" variable
which you already have. I see I left the same redundant call also in
nested_vmx_exit_handled_msr :(

> + if (port < 0x8000)
> + bitmap = vmcs12->io_bitmap_a;
> + else
> + bitmap = vmcs12->io_bitmap_b;
> + bitmap += port / 8;

In the port >= 0x8000, I believe need to subtract 0x8000 from the port
number before using it as an offset into io_bitmap_b?

Nadav.

-- 
Nadav Har'El| Monday, Feb 11 2013, 1 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Attention: There will be a rain dance
http://nadav.harel.org.il   |Friday night, weather permitting.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: what does SVM_EXIT_VMRUN mean?

2012-09-19 Thread Nadav Har&#x27;El
On Wed, Sep 19, 2012, Steven wrote about "what does SVM_EXIT_VMRUN mean?":
> hi,
> I am wondering what the exit code SVM_EXIT_VMRUN means? Its name is
> not as clear as SVM_EXIT_READ_CR0.
> Thanks.

This exit happens when the guest ran the VMRUN instruction. In other
words, the guest is a hypervisor and running its own guest. This is
known as nested virtualization.

In an "ordinary" (non-hypervisor) guest, you'll never see this exit
reason.

-- 
Nadav Har'El| Thursday, Sep 20 2012, 4 Tishri 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Sign in pool: "Welcome to our OOL. Notice
http://nadav.harel.org.il   |there is no P, please keep it that way."
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Guest mode question

2012-09-19 Thread Nadav Har&#x27;El
On Wed, Sep 19, 2012, Steven wrote about "Re: Guest mode question":
> thanks for your reply. My question is regarding the the normal
> single-level virtualization.
> Suppose that the physical machine has only one core and 2 VMs.
> Each VM has one vcpu, so there are two vcpu threads running with the
> hypervisor.
> Does your answer mean these two VMs can not be in the guest mode at
> the same time?

Like I said, there is no such thing (in non-nested context) as a VM
being in "guest mode". Rather, it is the CPU which is in guest mode.
When the CPU is at the moment running guest instructions (of any of the
guests) it is in guest mode. When it isn't, it's in so-called "root
mode". So your question becomes moot.

> For example, when the hypervisor schedules one VM, the
> running VM must have a vmexit call first and then the hypervisor can
> issue vmrun call to the other VM.

Indeed. Virtualization hardware doesn't magically create additional
hardware threads (cores/hyperthreads). The CPU can only run one thread
of execution, and in particular cannot run two VMs concurrently. To
switch between two VMs, the CPU needs to first exit from running the
first VM (go from guest mode back to root mode), then decide which
other VM to run, and VMLAUNCH/VMRUN it.

The first VM might exit because of "natural causes" (i.e., a sensitive
instruction), or it might be forced to exit because of, e.g., a physical
timer interrupt (normally, any physical interrupt causes a guest to exit,
so the host can handle it). After each exit, the hypervisor resumes
running at might decide (using Linux's normal scheduling policies) to
schedule a different process or VM instead of the one that just exited.

-- 
Nadav Har'El|Wednesday, Sep 19 2012, 3 Tishri 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Evening news begins with 'Good evening',
http://nadav.harel.org.il   |and then proceeds to say why it isn't.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Guest mode question

2012-09-15 Thread Nadav Har&#x27;El
On Sat, Sep 15, 2012, Steven wrote about "Guest mode question":
> Hi,
> I have a question about vm execution in the guest mode. Suppose I have
> two VMs on a machine.
> Is it possible that both VMs are in the guest mode? Like this
>  vm1 enters guest mode
>  vm2 enters guest mode
>  // at this point, two VMs are in guest mode
> 
> Or the hypervisor will first force one VM to vmexit before the second
> VM enters the guest mode?

Your description, of a *VM* being in guest mode, makes it sound like
you're describing *nested* virtualization, i.e., running another hypervisor
(like KVM) as the guest of the parent KVM. If this is what you actually
meant, then, yes, indeed, several VMs may be in guest mode because they
are each running a nested guest (aka an L2, a guest of a guest).

But if you didn't ask about nested virtualization, but rather normal,
single-level, virtualization, there's a different answer:

It is the CPU thread (one core, or even one hyperthread), *not* the VM,
which can be in root or guest mode. It is in root mode when the hypervisor
runs, and in guest mode when a specific guest (VM) runs. One CPU thread can
only run one code at a time, so of course it cannot be running two VMs at the
same time - if it is running VM B now, it's because VM A is not running now.
Of course, if you have a muticore machine, both A and B might be
running, on two separate cores. In that case, both cores (not VMs) may
be in guest mode concurrently.

-- 
Nadav Har'El|Sunday, Sep 16 2012, 29 Elul 5772
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Does replacing myself with a shell-script
http://nadav.harel.org.il   |make me impressive or insignificant?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multi-dimensional Paging in Nested virtualization

2012-09-13 Thread Nadav Har&#x27;El
On Tue, Sep 11, 2012, siddhesh phadke wrote about "Multi-dimensional Paging in 
Nested virtualization":
> I read turtles project paper where they have explained  how
> multi-dimensional page tables are built on L0. L2 is launched with
> empty EPT 0->2 and EPT 0->2 is built on-the-fly.
> I tried to find out how this is done in kvm code but i could not find
> where EPT 0->2 is built.

Nested EPT is not yet included in the mainline KVM. The original nested EPT
code that we had written as part of the Turtles paper became obsolete when
much of KVM's MMU code has been rewritten.

I have since rewritten the nested EPT code for the modern KVM. I sent
the second (latest) version of these patches to the KVM mailing list in
August, and you can find them in, for example,
http://comments.gmane.org/gmane.comp.emulators.kvm.devel/95395

These patches were not yet accepted into KVM. They have bugs in various
setups (which I have not yet found the time to fix, unfortunately),
and some known issues found by Avi Kivity on this mailing lest.

> Does L1 handle ept violation first and then L0 updates its EPT0->2?
> How this is done?

This is explained in the turtles paper, but here's the short story:

L1 defines an EPT table for L2 which we call EPT12. L0 builds from this
an EPT02, with L1 addresses changed to L0. Now, when L2 runs and we get
an EPT violation, we exit to L0 (in nested vmx, any exit first gets to
L0). L0 checks if the translation is missing already in EPT12, and if it
isn't it emulates an exit into L1 - and inject the EPT violation into
L1. But if the translation wasn't missing in EPT12, then it's L0's
problem, and we just need to update EPT02.

> Can anybody give me some pointers about where to look into the code?

Please look at the patches above. Each patch is also documented.

Nadav.

-- 
Nadav Har'El|  Thursday, Sep 13 2012, 26 Elul 5772
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |error compiling committee.c: too many
http://nadav.harel.org.il   |arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h

2012-08-02 Thread Nadav Har&#x27;El
On Thu, Aug 02, 2012, Xiao Guangrong wrote about "Re: [PATCH 02/10] nEPT: Add 
EPT tables support to paging_tmpl.h":
> > +   #ifdef CONFIG_X86_64
> > +   #define PT_MAX_FULL_LEVELS 4
> > +   #define CMPXCHG cmpxchg
> > +   #else
> > +   #define CMPXCHG cmpxchg64
> > +   #define PT_MAX_FULL_LEVELS 2
> > +   #endif
> 
> Missing the case of FULL_LEVELS == 3? Oh, you mentioned it
> as PAE case in the PATCH 0.

I understood this differently (and it would not be surprising if
wrongly...): With nested EPT, we only deal with two *EPT* tables -
the shadowed page table and shadow page table are both EPT.
And EPT tables cannot have three levels - even if PAE is used. Or at least,
that's what I thought...

> Note A/D bits are supported on new intel cpus, this function should be 
> reworked
> for nept. I know you did not export this feather to guest, but we can reduce
> the difference between nept and other mmu models if A/D are supported.

I'm not sure what you meant: If the access/dirty bits are supported in
newer cpus, do you think we *should* support them also in the processor
L1 processor, or are you saying that it would be easier to support them
because this is what the shadow page table code normally does anyway,
so *not* supporting them will take effort?

> > +#if PTTYPE != PTTYPE_EPT
> >  static int FNAME(walk_addr_nested)(struct guest_walker *walker,
> >struct kvm_vcpu *vcpu, gva_t addr,
> >u32 access)
> > @@ -335,6 +395,7 @@ static int FNAME(walk_addr_nested)(struc
> > return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
> > addr, access);
> >  }
> > +#endif
> > 
> 
> Hmm, you do not need the special walking functions?

Since these functions are static, the compiler warns me on every
function that is never used, so I had to #if them out...


-- 
Nadav Har'El| Thursday, Aug 2 2012, 15 Av 5772
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |It's fortunate I have bad luck - without
http://nadav.harel.org.il   |it I would have no luck at all!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested kvm_intel broken on pre 3.3 hosts

2012-08-01 Thread Nadav Har&#x27;El
On Wed, Aug 01, 2012, Avi Kivity wrote about "Re: Nested kvm_intel broken on 
pre 3.3 hosts":
> Right - it's not just kvm-as-a-guest that will trip on this.  But
> there's no point in everyone backporting it on their own.  If you're
> doing the backport, please post it here and we'll forward it to the
> stable branch.

If I understand correctly, the failure occurs because new versions of
KVM refuse to work if the processor doesn't support CPU_BASED_RDPMC_EXITING -
which older versions of nested VMX didn't say that they did.

But must the KVM guest refuse to work if this feature isn't supported?
I.e., why not move in setup_vmcs_config() the CPU_BASED_RDPMC_EXITING
from "min" to "opt"? Isn't losing the PMU feature a lesser evil than
not working at all?  In any case, perhaps the original reporter can use
this as a workaround, at least, because it requires modifying the (L1)
guest, not the host.


-- 
Nadav Har'El|Wednesday, Aug 1 2012, 13 Av 5772
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |May you live as long as you want - and
http://nadav.harel.org.il   |never want as long as you live.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] nEPT: Miscelleneous cleanups

2012-08-01 Thread Nadav Har&#x27;El
Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2012-08-01 17:22:47.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2012-08-01 17:22:47.0 +0300
@@ -616,7 +616,6 @@ static void nested_release_page_clean(st
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -895,8 +894,7 @@ static inline bool nested_cpu_has2(struc
(vmcs12->secondary_vm_exec_control & bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-   struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6135,7 +6133,7 @@ static int vmx_handle_exit(struct kvm_vc
 
if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
!(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-   get_vmcs12(vcpu), vcpu {
+   get_vmcs12(vcpu) {
if (vmx_interrupt_allowed(vcpu)) {
vmx->soft_vnmi_blocked = 0;
} else if (vmx->vnmi_blocked_time > 10LL &&

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] nEPT: Documentation

2012-08-01 Thread Nadav Har&#x27;El
Update the documentation to no longer say that nested EPT is not supported.

Signed-off-by: Nadav Har'El 
---
 Documentation/virtual/kvm/nested-vmx.txt |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- .before/Documentation/virtual/kvm/nested-vmx.txt2012-08-01 
17:22:47.0 +0300
+++ .after/Documentation/virtual/kvm/nested-vmx.txt 2012-08-01 
17:22:47.0 +0300
@@ -38,8 +38,8 @@ The current code supports running Linux 
 Only 64-bit guest hypervisors are supported.
 
 Additional patches for running Windows under guest KVM, and Linux under
-guest VMware server, and support for nested EPT, are currently running in
-the lab, and will be sent as follow-on patchsets.
+guest VMware server, are currently running in the lab, and will be sent as
+follow-on patchsets.
 
 
 Running nested VMX

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] nEPT: Nested INVEPT

2012-08-01 Thread Nadav Har&#x27;El
If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table for
L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course
of this modification already calls INVEPT. Therefore, when L1 calls INVEPT,
we don't really need to do anything. In particular we *don't* need to call
the real INVEPT again. All we do in our INVEPT is verify the validity of the
call, and its parameters, and then do nothing.

In KVM Forum 2010, Dong et al. presented "Nested Virtualization Friendly KVM"
and classified our current nested EPT implementation as "shadow-like virtual
EPT". He recommended instead a different approach, which he called "VTLB-like
virtual EPT". If we had taken that alternative approach, INVEPT would have had
a bigger role: L0 would only rebuild the shadow EPT table when L1 calls INVEPT.

Signed-off-by: Nadav Har'El 
---
 arch/x86/include/asm/vmx.h |2 
 arch/x86/kvm/vmx.c |   87 +++
 2 files changed, 89 insertions(+)

--- .before/arch/x86/include/asm/vmx.h  2012-08-01 17:22:47.0 +0300
+++ .after/arch/x86/include/asm/vmx.h   2012-08-01 17:22:47.0 +0300
@@ -280,6 +280,7 @@ enum vmcs_field {
 #define EXIT_REASON_APIC_ACCESS 44
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
+#define EXIT_REASON_INVEPT 50
 #define EXIT_REASON_WBINVD 54
 #define EXIT_REASON_XSETBV 55
 #define EXIT_REASON_INVPCID58
@@ -406,6 +407,7 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT   (1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT   (1ull << 17)
+#define VMX_EPT_INVEPT_BIT (1ull << 20)
 #define VMX_EPT_AD_BIT (1ull << 21)
 #define VMX_EPT_EXTENT_INDIVIDUAL_BIT  (1ull << 24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25)
--- .before/arch/x86/kvm/vmx.c  2012-08-01 17:22:47.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2012-08-01 17:22:47.0 +0300
@@ -2026,6 +2026,10 @@ static __init void nested_vmx_setup_ctls
/* nested EPT: emulate EPT also to L1 */
nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+   nested_vmx_ept_caps |=
+   VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+   VMX_EPT_EXTENT_CONTEXT_BIT |
+   VMX_EPT_EXTENT_INDIVIDUAL_BIT;
nested_vmx_ept_caps &= vmx_capability.ept;
} else
nested_vmx_ept_caps = 0;
@@ -5702,6 +5706,87 @@ static int handle_vmptrst(struct kvm_vcp
return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+   u32 vmx_instruction_info;
+   unsigned long type;
+   gva_t gva;
+   struct x86_exception e;
+   struct {
+   u64 eptp, gpa;
+   } operand;
+
+   if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
+   !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   /* According to the Intel VMX instruction reference, the memory
+* operand is read even if it isn't needed (e.g., for type==global)
+*/
+   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmx_instruction_info, &gva))
+   return 1;
+   if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+   sizeof(operand), &e)) {
+   kvm_inject_page_fault(vcpu, &e);
+   return 1;
+   }
+
+   type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+   switch (type) {
+   case VMX_EPT_EXTENT_GLOBAL:
+   if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   else {
+   /*
+* Do nothing: when L1 changes EPT12, we already
+* update EPT02 (the shadow EPT table) and call INVEPT.
+* So when L1 calls INVEPT, there's nothing left to do.
+*/

[PATCH 07/10] nEPT: Advertise EPT to L1

2012-08-01 Thread Nadav Har&#x27;El
Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2012-08-01 17:22:47.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2012-08-01 17:22:47.0 +0300
@@ -1946,6 +1946,7 @@ static u32 nested_vmx_secondary_ctls_low
 static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
/*
@@ -2021,6 +2022,14 @@ static __init void nested_vmx_setup_ctls
nested_vmx_secondary_ctls_low = 0;
nested_vmx_secondary_ctls_high &=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+   if (enable_ept) {
+   /* nested EPT: emulate EPT also to L1 */
+   nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+   nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+   nested_vmx_ept_caps &= vmx_capability.ept;
+   } else
+   nested_vmx_ept_caps = 0;
+
 }
 
 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
@@ -2120,8 +2129,8 @@ static int vmx_get_vmx_msr(struct kvm_vc
nested_vmx_secondary_ctls_high);
break;
case MSR_IA32_VMX_EPT_VPID_CAP:
-   /* Currently, no nested ept or nested vpid */
-   *pdata = 0;
+   /* Currently, no nested vpid support */
+   *pdata = nested_vmx_ept_caps;
break;
default:
return 0;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/10] nEPT: Some additional comments

2012-08-01 Thread Nadav Har&#x27;El
Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2012-08-01 17:22:47.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2012-08-01 17:22:47.0 +0300
@@ -5952,7 +5952,20 @@ static bool nested_vmx_exit_handled(stru
return nested_cpu_has2(vmcs12,
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
case EXIT_REASON_EPT_VIOLATION:
+   /*
+* L0 always deals with the EPT violation. If nested EPT is
+* used, and the nested mmu code discovers that the address is
+* missing in the guest EPT table (EPT12), the EPT violation
+* will be injected with nested_ept_inject_page_fault()
+*/
+   return 0;
case EXIT_REASON_EPT_MISCONFIG:
+   /*
+* L2 never uses directly L1's EPT, but rather L0's own EPT
+* table (shadow on EPT) or a merged EPT table that L0 built
+* (EPT on EPT). So any problems with the structure of the
+* table is L0's fault.
+*/
return 0;
case EXIT_REASON_WBINVD:
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING);
@@ -6881,7 +6894,12 @@ static void prepare_vmcs02(struct kvm_vc
vmx_set_cr4(vcpu, vmcs12->guest_cr4);
vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
-   /* shadow page tables on either EPT or shadow page tables */
+   /*
+* Note that kvm_set_cr3() and kvm_mmu_reset_context() will do the
+* right thing, and set GUEST_CR3 and/or EPT_POINTER in all supported
+* settings: 1. shadow page tables on shadow page tables, 2. shadow
+* page tables on EPT, 3. EPT on EPT.
+*/
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
 
@@ -7220,7 +7238,6 @@ void load_vmcs12_host_state(struct kvm_v
 
if (nested_cpu_has_ept(vmcs12))
nested_ept_uninit_mmu_context(vcpu);
-   /* shadow page tables on either EPT or shadow page tables */
kvm_set_cr3(vcpu, vmcs12->host_cr3);
kvm_mmu_reset_context(vcpu);
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3

2012-08-01 Thread Nadav Har&#x27;El
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/x86.c |   11 ---
 1 file changed, 11 deletions(-)

--- .before/arch/x86/kvm/x86.c  2012-08-01 17:22:47.0 +0300
+++ .after/arch/x86/kvm/x86.c   2012-08-01 17:22:47.0 +0300
@@ -659,17 +659,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, u
 */
}
 
-   /*
-* Does the new cr3 value map to physical memory? (Note, we
-* catch an invalid cr3 even in real-mode, because it would
-* cause trouble later on when we turn on paging anyway.)
-*
-* A real CPU would silently accept an invalid cr3 and would
-* attempt to use it - with largely undefined (and often hard
-* to debug) behavior on the guest side.
-*/
-   if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
-   return 1;
vcpu->arch.cr3 = cr3;
__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
vcpu->arch.mmu.new_cr3(vcpu);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] nEPT: Fix cr3 handling in nested exit and entry

2012-08-01 Thread Nadav Har&#x27;El
The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   30 ++
 1 file changed, 30 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2012-08-01 17:22:46.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2012-08-01 17:22:46.0 +0300
@@ -6885,6 +6885,17 @@ static void prepare_vmcs02(struct kvm_vc
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
 
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+*/
+   if (enable_ept) {
+   vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+   vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+   vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+   vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+   }
+
kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -7116,6 +7127,25 @@ void prepare_vmcs12(struct kvm_vcpu *vcp
vmcs12->guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+   /*
+* In some cases (usually, nested EPT), L2 is allowed to change its
+* own CR3 without exiting. If it has changed it, we must keep it.
+* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+*/
+   if (enable_ept)
+   vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so save their PDPTEs
+*/
+   if (enable_ept) {
+   vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+   vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+   vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+   vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+   }
+
/* TODO: These cannot have changed unless we have MSR bitmaps and
 * the relevant bit asks not to trap the change */
vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] nEPT: MMU context for nested EPT

2012-08-01 Thread Nadav Har&#x27;El
KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/mmu.c |   38 +++
 arch/x86/kvm/mmu.h |1 
 arch/x86/kvm/vmx.c |   52 +++
 3 files changed, 91 insertions(+)

--- .before/arch/x86/kvm/mmu.h  2012-08-01 17:22:46.0 +0300
+++ .after/arch/x86/kvm/mmu.h   2012-08-01 17:22:46.0 +0300
@@ -52,6 +52,7 @@ int kvm_mmu_get_spte_hierarchy(struct kv
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
--- .before/arch/x86/kvm/mmu.c  2012-08-01 17:22:46.0 +0300
+++ .after/arch/x86/kvm/mmu.c   2012-08-01 17:22:46.0 +0300
@@ -3616,6 +3616,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu 
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
 
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
+{
+   ASSERT(vcpu);
+   ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
+
+   context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+
+   context->nx = is_nx(vcpu); /* TODO: ? */
+   context->new_cr3 = paging_new_cr3;
+   context->page_fault = EPT_page_fault;
+   context->gva_to_gpa = EPT_gva_to_gpa;
+   context->sync_page = EPT_sync_page;
+   context->invlpg = EPT_invlpg;
+   context->update_pte = EPT_update_pte;
+   context->free = paging_free;
+   context->root_level = context->shadow_root_level;
+   context->root_hpa = INVALID_PAGE;
+   context->direct_map = false;
+
+   /* TODO: reset_rsvds_bits_mask() is not built for EPT, we need
+  something different.
+*/
+   reset_rsvds_bits_mask(vcpu, context);
+
+
+   /* TODO: I copied these from kvm_init_shadow_mmu, I don't know why
+  they are done, or why they write to vcpu->arch.mmu and not context
+*/
+   vcpu->arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
+   vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
+   vcpu->arch.mmu.base_role.smep_andnot_wp =
+   kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) &&
+   !is_write_protection(vcpu);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu);
+
 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 {
int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
--- .before/arch/x86/kvm/vmx.c  2012-08-01 17:22:46.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2012-08-01 17:22:46.0 +0300
@@ -901,6 +901,11 @@ static inline bool nested_cpu_has_virtua
return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -6591,6 +6596,46 @@ static void vmx_set_supported_cpuid(u32 
entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+   /* return the page table to be shadowed - in our case, EPT12 */
+   return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+   struct x86_exception *fault)
+{
+   struct vmcs12 *vmcs12;
+   nested_vmx_vmexit(vcpu);
+   vmcs12 = get_vmcs12(vcpu);
+   /*
+* Note no need to set vmcs12->vm_exit_reason as it is already copied
+* from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
+*/
+   vmcs12->exit_qualification = fault->error_code;
+   vmcs12->guest_physical_address = fault->address;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+   int r = kvm_init_shadow_EPT_mmu(vcpu, &vcpu->arch.mmu);
+
+   vcpu->arch.mmu.set_cr3   = vmx_set_cr3;
+   vcpu->arch.mmu.get_cr3   = nested_ept_get_cr3;
+   vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+   vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
+
+   return r;
+}
+
+static void nested_ept_uninit_

[PATCH 02/10] nEPT: Add EPT tables support to paging_tmpl.h

2012-08-01 Thread Nadav Har&#x27;El
This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).

This patch adds EPT support to paging_tmpl.h.

paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.

There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.

So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with "EPT") which correctly read and write EPT tables.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/mmu.c |   14 +
 arch/x86/kvm/paging_tmpl.h |   98 ---
 2 files changed, 96 insertions(+), 16 deletions(-)

--- .before/arch/x86/kvm/mmu.c  2012-08-01 17:22:46.0 +0300
+++ .after/arch/x86/kvm/mmu.c   2012-08-01 17:22:46.0 +0300
@@ -1971,15 +1971,6 @@ static void shadow_walk_next(struct kvm_
return __shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
-{
-   u64 spte;
-
-   spte = __pa(sp->spt)
-   | PT_PRESENT_MASK | PT_ACCESSED_MASK
-   | PT_WRITABLE_MASK | PT_USER_MASK;
-   mmu_spte_set(sptep, spte);
-}
 
 static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
   unsigned direct_access)
@@ -3427,6 +3418,11 @@ static bool sync_mmio_spte(u64 *sptep, g
return false;
 }
 
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
--- .before/arch/x86/kvm/paging_tmpl.h  2012-08-01 17:22:46.0 +0300
+++ .after/arch/x86/kvm/paging_tmpl.h   2012-08-01 17:22:46.0 +0300
@@ -50,6 +50,22 @@
#define PT_LEVEL_BITS PT32_LEVEL_BITS
#define PT_MAX_FULL_LEVELS 2
#define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+   #define pt_element_t u64
+   #define guest_walker guest_walkerEPT
+   #define FNAME(name) EPT_##name
+   #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+   #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+   #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+   #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+   #define PT_LEVEL_BITS PT64_LEVEL_BITS
+   #ifdef CONFIG_X86_64
+   #define PT_MAX_FULL_LEVELS 4
+   #define CMPXCHG cmpxchg
+   #else
+   #define CMPXCHG cmpxchg64
+   #define PT_MAX_FULL_LEVELS 2
+   #endif
 #else
#error Invalid PTTYPE value
 #endif
@@ -78,6 +94,7 @@ static gfn_t gpte_to_gfn_lvl(pt_element_
return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
 }
 
+#if PTTYPE != PTTYPE_EPT
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
   pt_element_t __user *ptep_user, unsigned index,
   pt_element_t orig_pte, pt_element_t new_pte)
@@ -100,15 +117,22 @@ static int FNAME(cmpxchg_gpte)(struct kv
 
return (ret != orig_pte);
 }
+#endif
 
 static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte,
   bool last)
 {
unsigned access;
 
+#if PTTYPE == PTTYPE_EPT
+   /* We rely here that ACC_WRITE_MASK==VMX_EPT_WRITABLE_MASK */
+   access = (gpte & VMX_EPT_WRITABLE_MASK) | ACC_USER_MASK |
+   ((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0);
+#else
access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
if (last && !is_dirty_gpte(gpte))
access &= ~ACC_WRITE_MASK;
+#endif
 
 #if PTTYPE == 64
if (vcpu->arch.mmu.nx)
@@ -135,6 +159,30 @@ static bool FNAME(is_last_gpte)(struct g
return false;
 }
 
+static inline int FNAME(is_present_gpte)(unsigned long pte)
+{
+#if PTTYPE == PTTYPE_EPT
+   return pte & (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+   VMX_EPT_EXECUTABLE_MASK);
+#else
+   return is_present_gpte(pte);
+#endif
+}
+
+static inline int FNAME(check_write_user_access)(struct kvm_vcpu *vcpu,
+  bool write_fau

[PATCH 01/10] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2012-08-01 Thread Nadav Har&#x27;El
Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2012-08-01 17:22:46.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2012-08-01 17:22:46.0 +0300
@@ -1976,6 +1976,7 @@ static __init void nested_vmx_setup_ctls
 #else
nested_vmx_exit_ctls_high = 0;
 #endif
+   nested_vmx_exit_ctls_high |= VM_EXIT_LOAD_IA32_EFER;
 
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -1983,6 +1984,7 @@ static __init void nested_vmx_setup_ctls
nested_vmx_entry_ctls_low = 0;
nested_vmx_entry_ctls_high &=
VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
+   nested_vmx_entry_ctls_high |= VM_ENTRY_LOAD_IA32_EFER;
 
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
@@ -6768,10 +6770,18 @@ static void prepare_vmcs02(struct kvm_vc
vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 
-   /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
-   vmcs_write32(VM_EXIT_CONTROLS,
-   vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
-   vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+   /* L2->L1 exit controls are emulated - the hardware exit is to L0 so
+* we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
+* bits are further modified by vmx_set_efer() below.
+*/
+   vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+   /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+* emulated by vmx_set_efer(), below.
+*/
+   vmcs_write32(VM_ENTRY_CONTROLS,
+   (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
+   ~VM_ENTRY_IA32E_MODE) |
(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
 
if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/10] nEPT v2: Nested EPT support for Nested VMX

2012-08-01 Thread Nadav Har&#x27;El
The following patches add nested EPT support to Nested VMX.

This is the second version of this patch set. Most of the issues from the
previous reviews were handled, and in particular there is now a new variant
of paging_tmpl for EPT page tables.

However, while this version does work in my tests, there are still some known
problems/bugs with this version and unhandled issues from the previous review:

 1. 32-bit *PAE* L2s currently don't work. non-PAE 32-bit L2s do work
(and so do, of course, 64-bit L2s).

 2. nested_ept_inject_page_fault() assumes vm_exit_reason is already set
to EPT_VIOLATION. However, it is conceivable that L0 emulates some
L2 instruction, and during this emulation we read some L2 memory
causing a need to exit (from L2 to L1) with an EPT violation.

 3. Moreover, now nested_ept_inject_page_fault() always causes an
EPT_VIOLATION, with vmcs12->exit_qualification = fault->error_code.
This is wrong: first fault->error code is not in exit qualification
format but in PFERR_* format. Moreover, PFERR_RSVD_MASK would mean
we need to cause an EPT_MISCONFIG, NOT EPT_VIOLATION.
Instead of trying to fix this by translating PFERR to exit_qualification,
we should calculate and remember in walk_addr() the exit qualification
(and and an additional bit: whether it's an EPT VIOLATION or
MISCONFIGURATION). This will be remembered in new fields in x86_exception.

Avi suggested: "[add to x86_exception] another bool, to distinguish
between EPT VIOLATION and EPT_QUALIFICATION. The error_code field should
be extended to 64 bits for EXIT_QUALIFICATION (though only bits 0-12 are
defined). You need another field for the guest linear address. 
EXIT_QUALIFICATION has to be calculated, it cannot be derived from the
original exit. Look at kvm_propagate_fault()."
He also added: "If we're injecting an EPT VIOLATION to L1 (because we
weren't able to resolve it; say L1 write-protected the page), then we
need to compute EXIT_QUALIFICATION.  Bits 3-5 of EXIT_QUALIFICATION are
computed from EPT12 paging structure entries (easy to derive them from
pt_access/pte_access)."

 4. Also, nested_ept_inject_page_fault() doesn't set guest linear address.
 
 5. There are several "TODO"s left in the code.

If there's any volunteer willing to help me with some of these issues,
it would be great :-)


About nested EPT:
-

Nested EPT means emulating EPT for an L1 guest, allowing it to use EPT when
running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
its own cr3 and take its own page faults without either of L0 or L1 getting
involved. In many workloads this significanlty improves L2's performance over
the previous two alternatives (shadow page tables over ept, and shadow page
tables over shadow page tables). Our paper [1] described these three options,
and the advantages of nested EPT ("multidimensional paging" in the paper).

Nested EPT is enabled by default (if the hardware supports EPT), so users do
not have to do anything special to enjoy the performance improvement that
this patch gives to L2 guests. L1 may of course choose not to use nested
EPT, by simply not using EPT (e.g., a KVM in L1 may use the "ept=0" option).

Just as a non-scientific, non-representative indication of the kind of
dramatic performance improvement you may see in workloads that have a lot of
context switches and page faults, here is a measurement of the time
an example single-threaded "make" took in L2 (kvm over kvm):

 shadow over shadow: 105 seconds
 ("ept=0" in L0 forces this)

 shadow over EPT: 87 seconds
 (the previous default; Can be forced with "ept=0" in L1)

 EPT over EPT: 29 seconds
 (the default after this patch)

Note that the same test on L1 (with EPT) took 25 seconds, so for this example
workload, performance of nested virtualization is now very close to that of
single-level virtualization.


[1] "The Turtles Project: Design and Implementation of Nested Virtualization",
http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf


Patch statistics:
-

 Documentation/virtual/kvm/nested-vmx.txt |4 
 arch/x86/include/asm/vmx.h   |2 
 arch/x86/kvm/mmu.c   |   52 +++-
 arch/x86/kvm/mmu.h   |1 
 arch/x86/kvm/paging_tmpl.h   |   98 -
 arch/x86/kvm/vmx.c   |  227 +++++++--
 arch/x86/kvm/x86.c   |   11 -
 7 files changed, 354 insertions(+), 41 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tracking nested guest ioctl in L0 hypervisor

2012-08-01 Thread Nadav Har&#x27;El
On Mon, Jul 30, 2012, siddhesh phadke wrote about "Tracking nested guest ioctl 
in L0 hypervisor":
> I am trying to understand KVM code for nested virtualization and my
> goal is to find whether an ioctl performed by L2 guest can be
> intercepted in L0.
> 
> Hence just for experimental purpose I wrote an blank ioctl in L2
> guest. When that ioctl is received by L1 KVM hypervisor ,it uses
> kvm_hypercall0() mentioned in kvm_para.h to notify L0. Am I doing this
> correct or is there any other method to do the same or I am completely
> off the track?
> 
> Can anyone please help me with this?

Do you really mean an *ioctl* in L2 - which is just a system call in
L2 (and never intercepted by L0 or L1), or a *hypercall*? From the
mention of kvm_hypercall0() it sounds like you mean a hypercall.

As you can see in vmx.c, nested_vmx_exit_handled(), when L0 receives
a VMCALL exit (i.e., a hypercall) from L2, we return 1 - meaning that
we exit to L1 so that it can handle this hypercall.

I believe that this is this is the more sensible behavior, but if you
want L0 to handle hypercalls, you can, in the EXIT_REASON_VMCALL case
in that function, return 0, which would cause L0 to handle this exit.

-- 
Nadav Har'El|Wednesday, Aug 1 2012, 13 Av 5772
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Despite the cost of living, have you
http://nadav.harel.org.il   |noticed how it remains so popular?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: what is this function mean "is_no_device"?

2012-06-21 Thread Nadav Har&#x27;El
On Wed, Jun 20, 2012, sheng qiu wrote about "what is this function mean 
"is_no_device"?":
> Hi all,
> 
> does anyone can explain what "is_no_device()" and vmx_fpu_activate(vcpu) do?

This is a relatively complicated and delicate issue, so I suggest you
also refer to Intel's spec for more details, and dig deeper into the
relevant Linux code (not just in KVM). But I'll try to give you the gist
of it:

When an OS does a context switch (we're talking about an OS now, not yet
a hypervisor), it needs to save the old process's registers, and load saved
registers of the new process. The original 80386 had few enough registers
to make this switch short enough. But with the math coprocessor (the
80387 - aka the FPU, Floating Point Unit), there was a problem: It had a
bunch of registers, long (80 bits and more) registers, and switching them all
the time was a serious performance problem. This was even more of a
waste since most processes didn't actually use the FPU - so all this
switching was usually done for nothing.

So a trick was invented - "lazy FPU loading". The idea is that when
the OS switches to a different process, it does NOT load the new
process's saved FPU registers. If this process never uses the FPU, we
saved the cost of this load. But what if it *does* use the FPU? A new bit
was added to the CR0 register, the "TS" (task switch) bit. When TS is 1,
the processor "pretends" that there is no FPU (just like the original
8086 had no FPU), so every floating-point operation throws an #NM ("no
math") exception. The OS catches this exception, now finally loads the
current task's FPU registers, and zeros the TS bit. The floating-point
operation now restarts, and since the TS bit is off, it succeeds (and
uses the right content in the registers).

The is_no_device() function you notices checks if an exception is the
NM exception. "device not available" or "no device" are alternative
names of the #NM exception.

To understand vmx_fpu_activate() you need to understand now what KVM
does when both the host processes (after all, KVM is Linux) and
processes in the guest, use the FPU. To make a really long story short,
once any process in the guest uses the FPU, KVM uses "vmx_fpu_activate()"
to say that this guest now has full control of the TS bit, and the NM
exception. This will allow the guest OS to play its usual "lazy FPU loading"
tricks without the host needing to get involved - the host will only care
about the FPU when we later switch tasks from this guest to another guest
or Linux process. And vice versa - the guest might think it wants TS to
be 0 (because it had already set the FPU registers correctly for the current
task) but the host needs it to be 1 (because unknown to the guest, Linux
switched to a different host process and loaded its FPU registers).
All of these games are done using VMX's CR0 shadowing features (which you
can read about in the VMX spec).  

If you think all of this was complicated, just think what it takes to
do all of this correctly in *nested* virtualization (where the host,
guest, and guest's guest all want to delay loading the FPU registers) -
it took me about a month to get that working without bugs ;-)


-- 
Nadav Har'El|  Thursday, Jun 21 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |From the Linux getopt(3) manpage: "BUGS:
http://nadav.harel.org.il   |This manpage is confusing."
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM handling external interrupts

2012-06-07 Thread Nadav Har&#x27;El
>  - You discuss interrupt delivery without stating that you have MSIs in
>mind. Some aspects may be helpful for legacy interrupts as well, but
>you obviously can't achieve exit-less operation there. Not an issue,
>should just be made clear.

Can you eleborate on why exit-less operation cannot be achieved without
MSI? Doesn't the VMCS flag to avoid exiting on external interrupts
apply to any interrupts? Or something else won't work?

In any case, you're right that our implementation and tests all used
MSI.

> > need paravirtualize the guest: no if you have x2APIC.
>
> ...and the guest makes use of it. This excludes older OSes. When did
> Windows start to use it?

Iff you can't use x2APIC, and don't want to paravirtualize
the guest, you still get exit-less interrupt *delivery*, which as we
showed in the benchmarks, gets you more than half of the performance
improvement (although with newer KVM's improvement in EOI emulation
performance, the over-half improvement should be somewhat less pronounced).

Nadav.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: x86: Implement PCID/INVPCID for guests with EPT

2012-05-14 Thread Nadav Har&#x27;El
On Mon, May 14, 2012, Marcelo Tosatti wrote about "Re: [PATCH v2] KVM: x86: 
Implement PCID/INVPCID for guests with EPT":
> > +* Enable INVPCID for non-ept guests may cause performance regression,
> > +* and without INVPCID, PCID has little benefits. So disable them all
> > +* for non-ept guests.
> > +*
> > +* PCID is not supported for nested guests yet.
> > +*/
> > +   return enable_ept && (boot_cpu_data.x86_capability[4] & 
> > bit(X86_FEATURE_PCID)) && !cpu_has_hypervisor;
> > +}
> 
> The comment Avi made was regarding running a nested guest, not running
> _as_ a nested guest (which is what cpu_has_hypervisor is about).
> 
> You can disable INVPCID exec control (which #UDs), if its in Level-2
> guest mode (see if_guest_mode()), and restore the Level-1 value when
> leaving nested mode.


I also don't understand the !cpu_has_hypervisor thing. Regarding
running a nested guest - the code in prepare_vmcs02() sets (among other
things) secondary exec controls used to run the L2 guest. It mostly
"or"s the bits requested by KVM (L0) and the guest (L1). So:

 1. If you think L0 mustn't run L2 with a certain bit, despite running L1
with it (I didn't follow the reasoning why this is the case), you can
specificially turn it off in that function (like we already do for
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES).

 2. The other thing that could theoretically happen is that L1 will ask
to turn on this bit for L2 (I think this is the case you wanted to
avoid). This *won't* happen, because we tell L1 via MSR which bits
it is allowed to set on secondary controls (see
nested_vmx_secondary_ctls_high), and currently this only includes
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES and
SECONDARY_EXEC_ENABLE_EPT (the latter only for nested EPT), and enforce
that L1 doesn't try to turn on more bits. 

-- 
Nadav Har'El|   Tuesday, May 15 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |A bird in the hand is safer than one
http://nadav.harel.org.il   |overhead.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Networking performance on a KVM Host (with no guests)

2012-04-22 Thread Nadav Har&#x27;El
On Sat, Apr 21, 2012, Chegu Vinod wrote about "Re: Networking performance on a 
KVM Host (with no guests)":
> some traffic through between the hosts... then we can't expect to get line 
> rate 
> on that private NIC. 

If you're not using device assignment, then just disable intel_iommu.

If you *are* using device assignment, try setting intel_iommu to something
other then on. Maybe intel_iommu=pt will work (see
http://lwn.net/Articles/329174/). But I actually never tried this myself -
so let me know if it works ;-)

Is anyone else aware of a different "best practice", on how to enable
VT-d for device assignment, without any changes (in performance or
security) in the host?

-- 
Nadav Har'El|Sunday, Apr 22 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Sign in zoo: Do not feed the animals. If
http://nadav.harel.org.il   |you have food give it to the guard on duty
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Networking performance on a KVM Host (with no guests)

2012-04-20 Thread Nadav Har&#x27;El
On Fri, Apr 20, 2012, Chegu Vinod wrote about "Re: Networking performance on a 
KVM Host (with no guests)":
> Removing the "intel_iommu=on" boot time parameter  in the Config 1
> case seemed to help

"intel_iommu=on" is essential with you're mostly running guests *and*
using device assignment.

However, unfortunately, it also has a serious performance penalty for
I/O in *host* processes:

When intel_iommu=on, Linux (completely unrelated to KVM) adds a new level
of protection which didn't exist without an IOMMU - the network card, which
without an IOMMU could write (via DMA) to any memory location, now is
not allowed - the card can only write to memory locates which the OS
wanted it to write. Theoretically, this can protect the OS against
various kinds of attacks. But what happens now is that every time that
Linux passes a new buffer to the card, it needs to change the IOMMU
mappings. This noticably slows down I/O, unfortunately.

-- 
Nadav Har'El|Friday, Apr 20 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |A bird in the hand is safer than one
http://nadav.harel.org.il   |overhead.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-18 Thread Nadav Har&#x27;El
On Wed, Apr 18, 2012, Avi Kivity wrote about "Re: [PATCH 0/4] Export offsets of 
VMCS fields as note information for kdump":
> Right; they're also not required to be in memory at all - the processor
> can cache them, even for VMCSs that are not active at this time. 
> Running VMXOFF at panic time can fix that, but you have to broadcast it
> to all processors, probably using NMI...

I believe that a VMCLEAR ensures that the VMCS is written back to
memory. KVM uses this fact when migrating a VMCS between two separate
physical CPUs - it runs VMCLEAR on the old CPU, to write the VMCS to
memory, and then VMPTRLD on the new CPU.

So you don't need to VMXOFF, but do need to VMCLEAR. But there's still
the complication that you mention - you need to do the VMCLEAR on the right
processor...


-- 
Nadav Har'El| Wednesday, Apr 18 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |The only "intuitive" interface is the
http://nadav.harel.org.il   |nipple. After that, it's all learned.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Has any work 3.3 kvm-kmod for rhel 6.2 kernel successfully?

2012-04-16 Thread Nadav Har&#x27;El
On Mon, Apr 16, 2012, ya su wrote about "Has any work 3.3 kvm-kmod for rhel 6.2 
kernel successfully?":
> Hi,all:
> 
> I try 3.3 kvm-kmod to compile against redhat 2.6.32-220.7.1, after
> change some macros in external-module-compat-comm.h ,
> external-module-compat.h, and in some C files, finally I can compile
> and run qemu-kvm(0.12 with rhel release) with 3.3 module, everything
> looks fine except that screen-display can not flush correctly,  it
> looks like the display-card memory can not get updated in time when it
> changes.
> 
>Is there anyone that can give me some clues, thanks.

I don't have an answer to your specific question (about kmod on 2.6.32),
but perhaps you can provide more information - what guest were you using
(also Redhat Linux with 2.6.32? something different?)? Does the problem
also happen for different guests? And what do you mean by "screen display
cannot flush correctly"? Are you using an emulated serial console, emulated
VGA (on X? on VNC?), or what?

-- 
Nadav Har'El|Monday, Apr 16 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If Barbie is so popular, why do you have
http://nadav.harel.org.il   |to buy her friends?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Nadav Har&#x27;El
On Wed, Apr 11, 2012, Guido Winkelmann wrote about "Re: Nested virtualization 
on Intel does not work - second level freezes when third level is starting":
> No, even 2-level nesting is broken. I can run Host->Guest, but not 
> Host->Guest->2nd Level Guest. I haven't even tried with a third virtualized 
> level.

I see. I guess I completely misunderstood what you reported. Sorry.

I think Orit was right. 3.3rc5 had a regression in the nested support,
which I discovered and Avi Kivity fixed; I didn't notice this before
now, but unfortunately the fix only got to 3.4rc1 and never made it into
3.3 (I just verified, it's not in 3.3.1 but it is in 3.4).
This bug displayed itself similarly to what you saw (L1 would hang when
running L2).

If you can run a later kernel, I hope the problem will be solved.

Otherwise, perhaps you can patch your kernel with the following patch
and try again?

--- .before/arch/x86/kvm/vmx.c  2012-03-19 18:34:24.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2012-03-19 18:34:24.0 +0200
@@ -2210,6 +2210,10 @@ static int vmx_set_msr(struct kvm_vcpu *
msr = find_msr_entry(vmx, msr_index);
if (msr) {
msr->data = data;
+   if (msr - vmx->guest_msrs < vmx->save_nmsrs)
+   kvm_set_shared_msr(msr->index, msr->data,
+   msr->mask);
break;
}



-- 
Nadav Har'El|  Thursday, Apr 12 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If you tell the truth, you don't have to
http://nadav.harel.org.il   |remember anything.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Nadav Har&#x27;El
On Wed, Apr 11, 2012, Guido Winkelmann wrote about "Nested virtualization on 
Intel does not work - second level freezes when third level is starting":
> Nested virtualization on Intel does not work for me with qemu-kvm. As soon as 
> the third layer OS (second virtualised) is starting the Linux kernel, the 
> entire second layer freezes up. The last thing I can see console of the third 

Hi,

>From your description, I understand that "ordinary" (2-level) nested
virtualization working for you (host, guest and 2nd-level guest), and it's
the third nesting level (guest's guest's guest) which is broken?

This is the second report of this nature in a week (see the previous
report in https://bugzilla.kernel.org/show_bug.cgi?id=43068 - the
details there are different), so I guess I'll need to find the time
to give this issue some attention. L3 did work for me when the nested
VMX patches were included in KVM, so either something broke since, or
(perhaps more likely) your slightly different setups have features that
my setup didn't.

But in any case, like I explain in the aforementioned URL, even if L3 would
work, in the current implementation it would be extremenly slow - perhaps to
the point of being unusable (I think you saw this with grub performance in L3).
So I wonder if you'd really want to use it, even if it worked... Just
curious, what were you thinking of doing with L3?

Nadav.


-- 
Nadav Har'El| Wednesday, Apr 11 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If I were two-faced, would I be wearing
http://nadav.harel.org.il   |this one? Abraham Lincoln
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: source for virt io backend driver

2012-04-08 Thread Nadav Har&#x27;El
On Sun, Apr 08, 2012, Steven wrote about "Re: source for virt io backend 
driver":
> > vhost-net. This you can find in drivers/vhost/net.c (and
> > drivers/vhost/vhost.c for the generic vhost infrastructure for virtio
> > drivers in the host kernel).
> 
> Yes, I am looking for the backend code in the kernel. I found the file
> for net backend drivers/vhost/net.c. However, the backend code for blk
> is not there.
> Could you point out this? Thanks.

Indeed, vhost-block is not (yet) part of the mainline kernel.

Maybe someone else more experienced in vhost-block can recommend where
to get the latest version.

Nadav.


-- 
Nadav Har'El| Monday, Apr 9 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Diplomat: A man who always remembers a
http://nadav.harel.org.il   |woman's birthday but never her age.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: source for virt io backend driver

2012-04-08 Thread Nadav Har&#x27;El
On Sat, Apr 07, 2012, Steven wrote about "source for virt io backend driver":
> Hi,
> I would like to know where I can find the source for backend driver of
> virtio device. For example, I know that the front-end net driver is
>...
> Can anyone help to point out where the source of net virtio backend in
> the host kernel?

If you use "normal" virtio, the backend is in qemu, i.e., host user
space, *not* in the host kernel. So you need to look for it in qemu,
not kvm.

If you want the backend to be in the kernel, then you probably mean
vhost-net. This you can find in drivers/vhost/net.c (and
drivers/vhost/vhost.c for the generic vhost infrastructure for virtio
drivers in the host kernel).

Of course, also with vhost-net, qemu is involved in setting up this
device. But qemu doesn't need to get involved in the data path (data,
interrupts, etc.) which is done completely in the kernel, and therefore
much more efficiently.

-- 
Nadav Har'El| Sunday, Apr 8 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Lottery: You need a dollar and a dream.
http://nadav.harel.org.il   |We'll take the dollar, you keep the dream.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VT-X Locking Bit Flip in Real Mode?

2012-03-30 Thread Nadav Har&#x27;El
On Fri, Mar 30, 2012, Jake Thomas wrote about "VT-X Locking Bit Flip in Real 
Mode?":
>  The assumption here is that once the locking bit is in the
> "locked" position, you can't unlock it until a power cycle unlocks it.

This is indeed what the Intel spec say - unless I'm misunderstanding
something.

> But I can't help but wonder if you can flip the locking bit from
> the "locked" position to the "unlocked" position if you're still in
> real mode.

What makes you think that being in real mode makes a difference?
As far as I know, it doesn't. Any reason why you think it does?

-- 
Nadav Har'El|  Saturday, Mar 31 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Take my advice, I don't use it anyway.
http://nadav.harel.org.il   |
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling

2012-03-19 Thread Nadav Har&#x27;El
The existing code emulated the guest's use of the IA32_FEATURE_CONTROL MSR
in a way that was enough to run nested VMX guests, but did not fully
conform to the VMX specification, and in particular did not allow a guest
BIOS to prevent the guest OS from using VMX by setting the lock bit on this
MSR.

This patch emulates this MSR better, allowing the guest to lock it, and
verifying its setting on VMXON. Also make sure that this MSR (and of course,
VMXON state) is reset on guest vcpu reset (via SIPI).

Signed-off-by: Nadav Har'El 
Reported-by: Julian Stecklina 
---
 arch/x86/kvm/vmx.c |   43 +++
 arch/x86/kvm/x86.c |3 ++-
 2 files changed, 25 insertions(+), 21 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2012-03-19 18:34:24.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2012-03-19 18:34:24.0 +0200
@@ -352,6 +352,7 @@ struct nested_vmx {
 * we must keep them pinned while L2 runs.
 */
struct page *apic_access_page;
+   u64 msr_ia32_feature_control;
 };
 
 struct vcpu_vmx {
@@ -1998,9 +1999,6 @@ static int vmx_get_vmx_msr(struct kvm_vc
}
 
switch (msr_index) {
-   case MSR_IA32_FEATURE_CONTROL:
-   *pdata = 0;
-   break;
case MSR_IA32_VMX_BASIC:
/*
 * This MSR reports some information about VMX support. We
@@ -2072,21 +2070,6 @@ static int vmx_get_vmx_msr(struct kvm_vc
return 1;
 }
 
-static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
-{
-   if (!nested_vmx_allowed(vcpu))
-   return 0;
-
-   if (msr_index == MSR_IA32_FEATURE_CONTROL)
-   /* TODO: the right thing. */
-   return 1;
-   /*
-* No need to treat VMX capability MSRs specially: If we don't handle
-* them, handle_wrmsr will #GP(0), which is correct (they are readonly)
-*/
-   return 0;
-}
-
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
@@ -2129,6 +2112,9 @@ static int vmx_get_msr(struct kvm_vcpu *
case MSR_IA32_SYSENTER_ESP:
data = vmcs_readl(GUEST_SYSENTER_ESP);
break;
+   case MSR_IA32_FEATURE_CONTROL:
+   data = to_vmx(vcpu)->nested.msr_ia32_feature_control;
+   break;
case MSR_TSC_AUX:
if (!to_vmx(vcpu)->rdtscp_enabled)
return 1;
@@ -2197,6 +2183,12 @@ static int vmx_set_msr(struct kvm_vcpu *
}
ret = kvm_set_msr_common(vcpu, msr_index, data);
break;
+   case MSR_IA32_FEATURE_CONTROL:
+   if (to_vmx(vcpu)->nested.msr_ia32_feature_control
+   & FEATURE_CONTROL_LOCKED)
+   return 1;
+   to_vmx(vcpu)->nested.msr_ia32_feature_control = data;
+   break;
case MSR_TSC_AUX:
if (!vmx->rdtscp_enabled)
return 1;
@@ -2205,8 +2197,6 @@ static int vmx_set_msr(struct kvm_vcpu *
return 1;
/* Otherwise falls through */
default:
-   if (vmx_set_vmx_msr(vcpu, msr_index, data))
-   break;
msr = find_msr_entry(vmx, msr_index);
if (msr) {
msr->data = data;
@@ -3807,6 +3797,8 @@ static int vmx_vcpu_setup(struct vcpu_vm
return 0;
 }
 
+static void free_nested(struct vcpu_vmx *vmx);
+
 static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3920,6 +3912,9 @@ static int vmx_vcpu_reset(struct kvm_vcp
/* HACK: Don't enable emulation on guest boot/reset */
vmx->emulation_required = 0;
 
+   /* Reset nested-VMX settings: */
+   vmx->nested.msr_ia32_feature_control = 0;
+   free_nested(vmx);
 out:
return ret;
 }
@@ -5031,6 +5026,14 @@ static int handle_vmon(struct kvm_vcpu *
return 1;
}
 
+#define VMXON_NEEDED_FEATURES \
+ (FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX)
+   if ((vmx->nested.msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
+   != VMXON_NEEDED_FEATURES) {
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+   }
+
INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
vmx->nested.vmcs02_num = 0;
 
--- .before/arch/x86/kvm/x86.c  2012-03-19 18:34:24.0 +0200
+++ .after/arch/x86/kvm/x86.c   2012-03-19 18:34:24.0 +0200
@@ -799,7 +799,8 @@ static u32 msrs_to_save[] = {
 #ifdef CONFIG_X86_64
MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
 #endif
-   MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
+   MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
+   MSR_IA32_FEATURE_CONTROL
 };
 
 static unsigned num_m

Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling

2012-03-19 Thread Nadav Har&#x27;El
Hi, in a minute I'll send a new version of the MSR_IA32_FEATURE_CONTROL
patch for nested VMX; I just wanted to reply first to your comments so
you'll know what to expect:

On Wed, Mar 07, 2012, Avi Kivity wrote about "Re: PATCH: nVMX: Better 
MSR_IA32_FEATURE_CONTROL handling":
> On 03/07/2012 05:58 PM, Nadav Har'El wrote:
> > +   u64 msr_ia32_feature_control;
> >  };
> 
> Need to add to the list of exported MSRs so it can be live migrated
> (msrs_to_save).

Did this.

> The variable itself should live in vcpu->arch, even if
> some bits are vendor specific.

But not this. I understand what you explained about vmx.c being for
Intel *hosts*, not about emulating Intel *guests*, but I do think that
since none of the bits in this MSR are relevant on AMD hosts (which
don't do nested VMX), it isn't useful to support this MSR outside vmx.c.

So I left this variable it in vmx->nested. As I noted earlier, svm.c
did exactly the same thing (nested.vm_cr_msr), so at least there's
symmetry here.

> > @@ -1999,7 +2000,7 @@ static int vmx_get_vmx_msr(struct kvm_vc
> >  
> > switch (msr_index) {
> > case MSR_IA32_FEATURE_CONTROL:
> > -   *pdata = 0;
> > +   *pdata = to_vmx(vcpu)->nested.msr_ia32_feature_control;
> > break;
> 
> In a separate patch, please move this outside vmx_get_vmx_msr().  It's
> not a vmx msr.

Done, but not split into two patches: The patch removes the old case in
vmx_get_vmx_msr() (and also removes vmx_set_vmx_msr() entirely) and
instead adds the case in vmx_get_msr() and vmx_set_msr().

> > +#define VMXON_NEEDED_FEATURES \
> > + (FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX)
> 
> Use const u64 instead of #define please, it jars my eyes.

I would, if Linux coding style allowed to declare variables in the
middle of blocks. Unfortunately it doesn't, so I left this #define.
I don't think it's that bad.

-- 
Nadav Har'El|Monday, Mar 19 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |A conscience does not prevent sin. It
http://nadav.harel.org.il   |only prevents you from enjoying it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: don't forget to schedule()

2012-03-18 Thread Nadav Har&#x27;El
On Sun, Mar 04, 2012, Michael S. Tsirkin wrote about "Re: [PATCH] vhost: don't 
forget to schedule()":
> On Sun, Mar 04, 2012 at 11:10:01AM +0200, Nadav Har'El wrote:
> > On Tue, Feb 28, 2012, Avi Kivity wrote about "Re: [PATCH] vhost: don't 
> > forget to schedule()":
> > > > > > +   if (need_resched())
> > > > > > +   schedule();
> > > > > This is cond_resched(), no?
>...
> > Hi. This discussion is already getting several orders of magnitude longer
> > than the patch :-)
>...
> 
> The argument is mostly about what's faster, right?
> You can push the argument along if you run some benchmarks to show the
> performance impact of the proposed variants.

Hi,

I did some tests using Netperf TCP-STREAM, message size 1K, over a
10Gbps network. Each of the numbers below are an average of several runs.

Test 1: "First, Do no harm" - verify that with just one guest, when these
extra reschedule attempts can't help, they at least don't hurt.

   * Unmodified vhost-net: 3135 Mbps.

   * My patch, with if(need_sched())schedule(): 3088 Mbps.
 The 1.5% average performance decrease is very small, and close the to
 measurement variation (1.4%) so I'd say it's negligable.
   
   * Avi's proposed alternative, with cond_sched() had, suprisingly, a
 very high variation in the results. Two measurments yielded an
 average of 3038 Mbps (close, but slightly lower than above), but
 6 other measurements yielded significantly lower numbers, between 2300
 and 2875.

   So it seems my patch does not hurt performance, so we don't need to
   find a "better" version for performance's sake.

   The cond_resched() version was worse - had a slightly worse maximum
   performance, and a high variance, so I don't recommend it. That being
   said, I don't have a good explanation on why cond_resched() is bad.


Test 2: Demonstrate that the patch is actually important.
Here we run two or three guests doing the same Netperf to a different
server, with all 2 or 3 vhost threads' affinity set to the same core.

 * Behaviour of the unmodified vhost-net was terrible, and demonstrates
   what this bug fixes:
 -  For two guests, one yielded a throughput of 3172 Mbps, but the second
yielded only 0.5 Mbps (!!), because its vhost thread hardly ran at all.
For three guests, again, one guest yielded 3202 Mbps, but the second
and third each yielded less than 1Mbps.
 - "top" and similar utilities did not show the vhost threads at all,
   because CPU time is not accounted if you never schedule().

 * Performance with our patch (if(need_sched())schedule()), fixes both
   problems:
 - For two guests, one yields 1557 Mbps, the second 1531 Mbps.
   Fair, and efficient (1557+1531=3088), as expected.
   For three guests, the number sare 1032+1028+1045=3105.
 - "top" shows each vhost thread to take close to the expect half or
   third of the core running both.

 * With the alternative cond_resched(), again the problems are fixed,
   but performance is less consistent, varying by almost 20% from run to
   run. Again, I can't explaine why.

All it all, I think this proves that the patch I sent, as is, is
important, and does no harm, and definitely not worse (and perhaps
better) than the proposed alternative implementation with cond_resched.
So please apply.

Thanks,
Nadav.

-- 
Nadav Har'El|Sunday, Mar 18 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If glory comes after death, I'm not in a
http://nadav.harel.org.il   |hurry. (Latin proverb)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling

2012-03-15 Thread Nadav Har&#x27;El
On Wed, Mar 07, 2012, Avi Kivity wrote about "Re: PATCH: nVMX: Better 
MSR_IA32_FEATURE_CONTROL handling":
> > struct page *apic_access_page;
> > +   u64 msr_ia32_feature_control;
> >  };
>...
> (msrs_to_save).  The variable itself should live in vcpu->arch, even if
> some bits are vendor specific.

Does this MSR exist in AMD? I was under the impression that it is an
Intel-only MSR, and that AMD has something different, the VM_CR MSR,
so it didn't make sense to put this in vcpu->arch. Is my impression
wrong?

I seems, by the way, that svm.c has vm_cr_msr in svm->nested, basically the
same what I did, not in vcpu->arch. Why is this bad?

Also, it seems that VM_CR is also not on the list on msrs_to_save.
A bug?

> > @@ -1999,7 +2000,7 @@ static int vmx_get_vmx_msr(struct kvm_vc
> >  
> > switch (msr_index) {
> > case MSR_IA32_FEATURE_CONTROL:
> > -   *pdata = 0;
> > +   *pdata = to_vmx(vcpu)->nested.msr_ia32_feature_control;
> > break;
> 
> In a separate patch, please move this outside vmx_get_vmx_msr().  It's
> not a vmx msr.

I agree, I'll move it. But if it's a VMX-only MSR, I want to leave it
in vmx.c, and not move it to x86.c.

-- 
Nadav Har'El|  Thursday, Mar 15 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If I am not for myself, who will be for
http://nadav.harel.org.il   |me? If I am only for myself, who am I?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PATCH: nVMX: Better MSR_IA32_FEATURE_CONTROL handling

2012-03-07 Thread Nadav Har&#x27;El
The existing code emulates the guest's use of the IA32_FEATURE_CONTROL MSR
in a way that was enough to run nested VMX guests, but did not fully
conform to the VMX specification, and in particular did not allow a guest
BIOS to prevent the guest OS from using VMX by setting the lock bit on this
MSR.

This patch emulates this MSR better, allowing the guest to lock it, and
verifying its setting on VMXON. Also make sure that this MSR (and of course,
VMXON state) is reset on guest vcpu reset (via SIPI).

Signed-off-by: Nadav Har'El 
Reported-by: Julian Stecklina 
---
 arch/x86/kvm/vmx.c |   24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2012-03-07 17:52:02.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2012-03-07 17:52:02.0 +0200
@@ -352,6 +352,7 @@ struct nested_vmx {
 * we must keep them pinned while L2 runs.
 */
struct page *apic_access_page;
+   u64 msr_ia32_feature_control;
 };
 
 struct vcpu_vmx {
@@ -1999,7 +2000,7 @@ static int vmx_get_vmx_msr(struct kvm_vc
 
switch (msr_index) {
case MSR_IA32_FEATURE_CONTROL:
-   *pdata = 0;
+   *pdata = to_vmx(vcpu)->nested.msr_ia32_feature_control;
break;
case MSR_IA32_VMX_BASIC:
/*
@@ -2077,9 +2078,13 @@ static int vmx_set_vmx_msr(struct kvm_vc
if (!nested_vmx_allowed(vcpu))
return 0;
 
-   if (msr_index == MSR_IA32_FEATURE_CONTROL)
-   /* TODO: the right thing. */
+   if (msr_index == MSR_IA32_FEATURE_CONTROL) {
+   if (to_vmx(vcpu)->nested.msr_ia32_feature_control
+   & FEATURE_CONTROL_LOCKED)
+   return 0;
+   to_vmx(vcpu)->nested.msr_ia32_feature_control = data;
return 1;
+   }
/*
 * No need to treat VMX capability MSRs specially: If we don't handle
 * them, handle_wrmsr will #GP(0), which is correct (they are readonly)
@@ -3807,6 +3812,8 @@ static int vmx_vcpu_setup(struct vcpu_vm
return 0;
 }
 
+static void free_nested(struct vcpu_vmx *vmx);
+
 static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3920,6 +3927,9 @@ static int vmx_vcpu_reset(struct kvm_vcp
/* HACK: Don't enable emulation on guest boot/reset */
vmx->emulation_required = 0;
 
+   /* Reset nested-VMX settings: */
+   vmx->nested.msr_ia32_feature_control = 0;
+   free_nested(vmx);
 out:
return ret;
 }
@@ -5031,6 +5041,14 @@ static int handle_vmon(struct kvm_vcpu *
return 1;
}
 
+#define VMXON_NEEDED_FEATURES \
+ (FEATURE_CONTROL_LOCKED | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX)
+   if ((vmx->nested.msr_ia32_feature_control & VMXON_NEEDED_FEATURES)
+   != VMXON_NEEDED_FEATURES) {
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+   }
+
INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
    vmx->nested.vmcs02_num = 0;
 
-- 
Nadav Har'El|  Wednesday, Mar 7 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |In Fortran, God is real unless declared
http://nadav.harel.org.il   |an integer.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.

2012-03-07 Thread Nadav Har&#x27;El
On Wed, Mar 07, 2012, Avi Kivity wrote about "Re: [PATCH] KVM: Enable 
VMX-related bits in MSR_IA32_FEATURE_CONTROL.":
> On 03/06/2012 07:33 PM, Nadav Har'El wrote:
> > By the way, am I right in my understanding that KVM doesn't support
> > SMX in the guest? 
> 
> Isn't nested vmx crazy enough?

:-)

> btw, any updates on nested EPT? Nested vmx is pointless without it.

Isn't "pointless" a bit strong? :-)

I have done most of the changes that you asked for, but still need to finish
the EPT violation overhaul. I was meaning to work on this now, when I got
sidetracked by all these other small issues.

Hopefully, I'll have a version ready for submittion by next week.

-- 
Nadav Har'El|  Wednesday, Mar 7 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |"[I'm] so full of action, my name should
http://nadav.harel.org.il   |be a verb" -- Big Daddy Kane ("Raw", 1987)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.

2012-03-06 Thread Nadav Har&#x27;El
On Tue, Mar 06, 2012, Nadav Har'El wrote about "Re: [PATCH] KVM: Enable 
VMX-related bits in MSR_IA32_FEATURE_CONTROL.":
>  2. handle_vmon() does not check the previous setting of this MSR.
> If the guest (or its BIOS) doesn't set both FEATURE_CONTROL_LOCKED
> and FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX, it should get a

By the way, am I right in my understanding that KVM doesn't support
SMX in the guest? I didn't see mention of CR4.SMXE or a GETSEC exit
handler, which is why I'm assuming that it doesn't...
If this assumption isn't true, I'll also need to worry about the
..._INSIDE_SMX case.

-- 
Nadav Har'El|Tuesday, Mar 6 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |When you handle yourself, use your head;
http://nadav.harel.org.il   |when you handle others, use your heart.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Enable VMX-related bits in MSR_IA32_FEATURE_CONTROL.

2012-03-06 Thread Nadav Har&#x27;El
On Tue, Mar 06, 2012, Avi Kivity wrote about "Re: [PATCH] KVM: Enable 
VMX-related bits in MSR_IA32_FEATURE_CONTROL.":
> > case MSR_IA32_FEATURE_CONTROL:
> > -   *pdata = 0;
> > +/*
> > + * If nested VMX is enabled, set the lock bit (bit 0)
> > + * and the "Enable VMX outside SMX" bit (bit 2) in the
> > + * FEATURE_CONTROL MSR.
> > + */
> > +   *pdata = nested_vmx_allowed(vcpu) ? 0x5 : 0;

0x5 can be written as FEATURE_CONTROL_LOCKED |
  FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX

> > break;
> > case MSR_IA32_VMX_BASIC:
> > /*
> 
> The way I read it, it should be done by the guest, not the host.

This is also how I understand it. Check out KVM's own hardware_enable()
to see how a guest does turn on these bits before using VMXON - it
doesn't need to rely on the BIOS to have done it earlier (the BIOS, can,
however, prevent the guest from doing this by setting only the lock bit).

What is true, however, is that the existing code is probably incomplete
in three ways (see section 20.7 of the SDM):

 1. It always returns 0 for this MSR, even if the guest previously set it
to something else. 

 2. handle_vmon() does not check the previous setting of this MSR.
If the guest (or its BIOS) doesn't set both FEATURE_CONTROL_LOCKED
and FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX, it should get a
#GP on an attempt to VMXON. This will allow L1's BIOS to disable
nested VMX if it wishes (not that I think this is a very useful
usecase...).

 3. vmx_set_vmx_msr to MSR_IA32_FEATURE_CONTROL should throw a #GP if
FEATURE_CONTROL_LOCKED is on.

I'll create a patch to do this shortly.

-- 
Nadav Har'El|Tuesday, Mar 6 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Thousands of years ago, cats were
http://nadav.harel.org.il   |worshipped as gods. They never forgot.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nVMX: Fix erroneous exception bitmap check

2012-03-06 Thread Nadav Har&#x27;El
The code which checks whether to inject a pagefault to L1 or L2 (in
nested VMX) was wrong, incorrect in how it checked the PF_VECTOR bit.
Thanks to Dan Carpenter for spotting this.

Signed-off-by: Nadav Har'El 
Reported-by: Dan Carpenter 
---
 arch/x86/kvm/vmx.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2012-03-06 16:31:25.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2012-03-06 16:31:25.0 +0200
@@ -1664,7 +1664,7 @@ static int nested_pf_handled(struct kvm_
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 
/* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
-   if (!(vmcs12->exception_bitmap & PF_VECTOR))
+   if (!(vmcs12->exception_bitmap & (1u << PF_VECTOR)))
return 0;
 
nested_vmx_vmexit(vcpu);

-- 
Nadav Har'El|Tuesday, Mar 6 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |This box was intentionally left blank.
http://nadav.harel.org.il   |
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: Fix delayed load of shared MSRs

2012-03-06 Thread Nadav Har&#x27;El
On Tue, Mar 06, 2012, Avi Kivity wrote about "[PATCH] KVM: VMX: Fix delayed 
load of shared MSRs":
> Shared MSRs (MSR_*STAR and related) are stored in both vmx->guest_msrs
> and in the CPU registers, but vmx_set_msr() only updated memory. Prior
> to 46199f33c2953, this didn't matter, since we called vmx_load_host_state(),
> which scheduled a vmx_save_host_state(), which re-synchronized the CPU
> state, but now we don't, so the CPU state will not be synchronized until
> the next exit to host userspace.  This mostly affects nested vmx workloads,
> which play with these MSRs a lot.

Thanks. I checked this patch, and indeed it solves the bug with nested
VMX.

-- 
Nadav Har'El|Tuesday, Mar 6 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I think therefore I am. My computer
http://nadav.harel.org.il   |thinks for me, therefore I am not.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Heads-up: Nested VMX got broken by commit

2012-03-06 Thread Nadav Har&#x27;El
Hi,

I just noticed that Nested VMX got broken (at least in my tests) by commit
46199f33c29533e7ad2a7d2128dc30175d1d4157.

The specific change causing the problem was:

@@ -2220,7 +2216,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32
msr_index, u64 data)
break;
msr = find_msr_entry(vmx, msr_index);
if (msr) {
-   vmx_load_host_state(vmx);
msr->data = data;
break;
}

And if anyone wants a quick workaround to making nested VMX work again,
returning this line fixes the problem.

I'm still trying to figure out why this line, which indeed seems unrelated
and unnecessary, is necessary for the correct functioning of nested VMX.
My (unsubstantiated) guess is that it isn't that it is actually necessary
in this point - it's just that it does something that should have been more
properly done in another place, but I've yet to figure out exactly what.
I'll send a patch when I have this figured out. If anybody else has any
guess, I'd love to hear.

Nadav.


-- 
Nadav Har'El|Tuesday, Mar 6 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Sign in pool: "Welcome to our OOL. Notice
http://nadav.harel.org.il   |there is no P, please keep it that way."
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: don't forget to schedule()

2012-03-04 Thread Nadav Har&#x27;El
On Tue, Feb 28, 2012, Avi Kivity wrote about "Re: [PATCH] vhost: don't forget 
to schedule()":
> > > > +   if (need_resched())
> > > > +   schedule();
> > > This is cond_resched(), no?
> >
> > It indeed looks similar, but it appears there are some slightly
> > different things happening in both cases, especially for a preemptive
>...
> I'd have expected that cond_resched() is a no-op with preemptible
> kernels, but I see this is not the case.

Hi. This discussion is already getting several orders of magnitude longer
than the patch :-)

Would you like me to send a new one-line patch calling "cond_resched()"
instead of need_resched/schedule? Or anything else that I can do?

> > But I now see that in kvm_main.c, there's also this:
> >
> > if (!need_resched())
> > return;
> > cond_resched();
>...
> 
> It's bogus.  Look at commit 3fca03653010:
>...
> +   if (!need_resched())
> +   return;
> vcpu_put(vcpu);
> cond_resched();
> vcpu_load(vcpu);
> 
> at that time, it made sense to do the extra check to avoid the expensive
> vcpu_put/vcpu_load.  Later preempt notifiers made them redundant
> (15ad71460d75), and they were removed, but the extra check remained.

Do you want a patch to remove this extra check? Or you can just remove
it yourself?

Thanks,
Nadav.

-- 
Nadav Har'El| Sunday, Mar 4 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |As every cat owner knows, nobody owns a
http://nadav.harel.org.il   |cat.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: don't forget to schedule()

2012-02-28 Thread Nadav Har&#x27;El
On Tue, Feb 28, 2012, Avi Kivity wrote about "Re: [PATCH] vhost: don't forget 
to schedule()":
> > +   if (need_resched())
> > +   schedule();
> 
> This is cond_resched(), no?

It indeed looks similar, but it appears there are some slightly
different things happening in both cases, especially for a preemptive
kernel... Unfortunately, I am not astute (or experienced) enough to tell 
which of the two idioms are better or more appropriate for this case.

The idiom that I used seemed right, and seemed to work in my tests.
Moreover I also noticed it was used in vmx.c. Also, vhost.c was already
calling schedule(), not cond_resched(), so I thought it made sense to
call the same thing...

But I now see that in kvm_main.c, there's also this:

if (!need_resched())
return;
cond_resched();

Which seems to combine both idioms ;-) Can anybody shed a light on what
is the right way to do it?

-- 
Nadav Har'El|   Tuesday, Feb 28 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I am logged in, therefore I am.
http://nadav.harel.org.il   |
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vhost: don't forget to schedule()

2012-02-27 Thread Nadav Har&#x27;El
This is a tiny, but important, patch to vhost.

Vhost's worker thread only called schedule() when it had no work to do, and
it wanted to go to sleep. But if there's always work to do, e.g., the guest
is running a network-intensive program like netperf with small message sizes,
schedule() was *never* called. This had several negative implications (on
non-preemptive kernels):

 1. Passing time was not properly accounted to the "vhost" process (ps and
top would wrongly show it using zero CPU time).

 2. Sometimes error messages about RCU timeouts would be printed, if the
core running the vhost thread didn't schedule() for a very long time.

 3. Worst of all, a vhost thread would "hog" the core. If several vhost
threads need to share the same core, typically one would get most of the
CPU time (and its associated guest most of the performance), while the
others hardly get any work done.

The trivial solution is to add

if (need_resched())
schedule();

After doing every piece of work. This will not do the heavy schedule() all
the time, just when the timer interrupt decided a reschedule is warranted
(so need_resched returns true).

Thanks to Abel Gordon for this patch.

Signed-off-by: Nadav Har'El 
---
 vhost.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c14c42b..ae66278 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -222,6 +222,8 @@ static int vhost_worker(void *data)
if (work) {
__set_current_state(TASK_RUNNING);
work->fn(work);
+   if (need_resched())
+   schedule();
} else
    schedule();
 
-- 
Nadav Har'El|Monday, Feb 27 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Ways to Relieve Stress #10: Make up a
http://nadav.harel.org.il   |language and ask people for directions.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX

2011-12-12 Thread Nadav Har&#x27;El
On Sun, Nov 13, 2011, Avi Kivity wrote about "Re: [PATCH 0/10] nEPT: Nested EPT 
support for Nested VMX":
> > I also believed that the fault injection part was also correct: I
> > thought that the code already knows when to handle the fault in L2 (when
> > the address is missing in cr3), in L1 (when the translation is missing
> > in EPT12) or else, in L0.
> 
> It does, but it needs to propagate the fault code correctly.  The exit
> reason (ept violation vs ept misconfiguration) is meaningless, since we
> don't encode anything about it from ept12 into ept02.  In particular an
> ept violation could lead to
> 
> - no fault, ept02 updated, instruction retried
> - no fault, instruction emulated
> - L2 fault
> - ept violation, need to compute ept12 permissions for exit qualification
> - ept misconfiguration
> 
> (the second and third cases occur when it is impossible to create an
> ept02 mapping - when L0 emulates a gpa that L1 assigns to L2 via ept12).

I'm now trying to figure out this part, and I think I am beginning to
understand the mess you are referring to:

In nested_ept_inject_page_fault I now assume the exit reason is always EPT
VIOLATION and have

vmcs12->exit_qualification = fault->error_code;

But fault->error_code is not in the exit qualification format but in
the PFERR_* format, which has different meanings for the bits...
Moreover, PFERR_RSVD_MASK should cause an EPT MISCONFIG, not EPT
VIOLATION. Is this what you meant above?

I didn't quite understand what you meant in the 4th case about needing
to compute ept12 permissions. I'm assuming that if the EPT violation
was caused because L0 decreased permissions from what L1 thought, then L0
will solve the problem itself and not inject it to L1. So if we are injecting
the fault to L1, don't we already know the correct fault reason and don't
need to compute it?

There's another complication: when the fault comes from an EPT violation
in L2, handle_ept_violation() calls mmu.page_fault() with an error_code of
exit_qualification & 0x3. This means that the error_code in this case is
*not* in the expected PFERR_* format, and we need to know that in
nested_ept_inject_page_fault. Moreover, in the original EPT visolation's
exit qualification, there were various other bits which we lose (and don't
have a direct parallel in PFERR_* anyway), so when we reinject the fault,
L1 doesn't get them.

What a mess :(

-- 
Nadav Har'El|Monday, Dec 12 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Hardware, n.: The parts of a computer
http://nadav.harel.org.il   |system that can be kicked.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/10] nEPT: Nested INVEPT

2011-12-11 Thread Nadav Har&#x27;El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 08/10] nEPT: Nested 
INVEPT":
> On 11/10/2011 12:01 PM, Nadav Har'El wrote:
> > If we let L1 use EPT, we should probably also support the INVEPT 
> > instruction.
>..
> > +   if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
> > +   (vmcs12->ept_pointer == operand.eptp) &&
> > +   vmx->nested.last_eptp02)
> > +   ept_sync_context(vmx->nested.last_eptp02);
> > +   else
> > +   ept_sync_global();
> 
> Are either of these needed?  Won't a write to a shadowed EPT table cause
> them anyway?

This is very good point... You're right that as it stands, any changes
to the guest EPT table (EPT12) will cause changes to the shadow EPT
table (EPT02), and these already cause KVM to do an INVEPT, so no point
to do this again when the guest asks.  So basically, I can have INVEPT
emulated by doing absolutely nothing (after checking all the checks), right?

I wonder if I am missing any reason why a hypervisor might want to do
INVEPT without changing the EPT12 table first.

-- 
Nadav Har'El|Sunday, Dec 11 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Why do programmers mix up Christmas and
http://nadav.harel.org.il   |Halloween? Because DEC 25 = OCT 31
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-12-08 Thread Nadav Har&#x27;El
On Mon, Nov 14, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> > >> +#if PTTYPE == EPT 
> > >> +real_gfn = mmu->translate_gpa(vcpu, 
> > >> gfn_to_gpa(table_gfn),
> > >> +  EPT_WRITABLE_MASK);
> > >> +#else
> > >>  real_gfn = mmu->translate_gpa(vcpu, 
> > >> gfn_to_gpa(table_gfn),
> > >>
> > >> PFERR_USER_MASK|PFERR_WRITE_MASK);
> > >> +#endif
> > >> +
> > > 
> > > Unneeded, I think.
> >
> > Is it because translate_nested_gpa always set USER_MASK ? 
> 
> Yes... maybe that function needs to do something like
> 
>access |= mmu->default_access;

Unless I'm misunderstanding something, translate_nested_gpa, and
gva_to_gpa, take as their "access" parameter a bitmask of PFERR_*,
so it's fine for PFERR_USER_MASK to be enabled in translate_nested_gpa;
It just shouldn't cause PT_USER_MASK to be used. The only additional
problem I can find is in walk_addr_generic which

does

if (!check_write_user_access(vcpu, write_fault, user_fault,
  pte))
eperm = true;

and that checks pte & PT_USER_MASK, which it shouldn't if
PTTYPE==PTTYPE_EPT.

It's really confusing that we now have in mmu.c no less than 4 (!)
access bit schemes, similar in many ways but different in many others:

1. page fault error codes (PFERR_*_MASK)
2. x86 page tables acess bits (PT_*_MASK)
3. KVM private access bits (ACC_*_MASK)
4. EPT access bits (VMX_EPT_*_MASK).

I just have to try hard not to confuse them.

-- 
Nadav Har'El|   Thursday, Dec 8 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Sorry, but my karma just ran over your
http://nadav.harel.org.il   |dogma.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-12-07 Thread Nadav Har&#x27;El
On Sun, Nov 13, 2011, Orit Wasserman wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> +++ b/arch/x86/kvm/mmu.h
> @@ -48,6 +48,11 @@
>  #define PFERR_RSVD_MASK (1U << 3)
>  #define PFERR_FETCH_MASK (1U << 4)
>  
> +#define EPT_WRITABLE_MASK 2
> +#define EPT_EXEC_MASK 4

This is another example of the "unclean" movement of VMX-specific things into
x86 :( We already have VMX_EPT_WRITABLE_MASK and friends in vmx.h. I'll
need to think what is less ugly: to move them to mmu.h, or to include vmx.h
in mmu.c, or perhaps even create a new include file, ept.h. Avi, do you have
a preference?
The last thing I want to do is to repeat the same definitions in two places.

-- 
Nadav Har'El|  Wednesday, Dec 7 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |"A witty saying proves nothing." --
http://nadav.harel.org.il   |Voltaire
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-12-06 Thread Nadav Har&#x27;El
On Sun, Nov 13, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> On 11/13/2011 01:30 PM, Orit Wasserman wrote:
> > Maybe this patch can help, this is roughly what Avi wants (I hope) done 
> > very quickly.
> > I'm sorry I don't have setup to run nested VMX at the moment so i can't 
> > test it.
>...
> > +#define PTTYPE EPT
> > +#include "paging_tmpl.h"
> > +#undef PTTYPE
> 
> Yes, that's the key.

I'm now preparing a patch based on such ideas.

One downside of this approach is that mmu.c (and therefore the x86
module) will now include EPT-specific functions that are of no use or
relevance to the SVM code. It's not a terrible disaster, but it's
"unclean". I'll try to think if there's a cleaner way.

Nadav.

-- 
Nadav Har'El|Tuesday, Dec 6 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Writing software is like sex: One mistake
http://nadav.harel.org.il   |and you have to support it forever.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-23 Thread Nadav Har&#x27;El
On Wed, Nov 23, 2011, Nadav Har'El wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> > +{
> > +   int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> > +
> > +   vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
> > +
> > +   return r;
> > +}
>..
> I didn't see you actually call this function anywhere - how is it
> supposed to work?
>..
> It seems we need a fifth case in that function.
>..

On second thought, why is this modifying nested_mmu.gva_to_gpa, and not
mmu.gva_to_gpa? Isn't the nested_mmu the L2 CR3, which is *not* in EPT
format, and what we really want to change is the outer mmu, which is
EPT12 and is indeed in EPT format?
Or am I missing something?

Thanks,
Nadav.

-- 
Nadav Har'El| Wednesday, Nov 23 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |My password is my dog's name. His name is
http://nadav.harel.org.il   |a#j!4@h, but I change it every month.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-23 Thread Nadav Har&#x27;El
On Sun, Nov 13, 2011, Orit Wasserman wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> Maybe this patch can help, this is roughly what Avi wants (I hope) done very 
> quickly.
> I'm sorry I don't have setup to run nested VMX at the moment so i can't test 
> it.

Hi Orit, thanks for the code - I'm now working on incorporating
something based on this into my patch. However, I do have a question:

> +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> +{
> + int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> +
> + vcpu->arch.nested_mmu.gva_to_gpa = EPT_gva_to_gpa_nested;
> +
> + return r;
> +}

I didn't see you actually call this function anywhere - how is it
supposed to work?

The way I understand the current code, kvm_mmu_reset_context() calls
init_kvm_mmu() which (in our case) calls init_kvm_nested_mmu().
I think the above gva_to_gpa setting should be there - right?
It seems we need a fifth case in that function.
But at that point in mmu.c, how will I be able to check if this is the
nested EPT case? Do you have any suggestion?

Thanks,
Nadav.

-- 
Nadav Har'El| Wednesday, Nov 23 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |This message contains 100% recycled
http://nadav.harel.org.il   |characters.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Allow aligned byte and word writes to IOAPIC registers.

2011-11-23 Thread Nadav Har&#x27;El
On Wed, Nov 23, 2011, Avi Kivity wrote about "Re: [PATCH] Allow aligned byte 
and word writes to IOAPIC registers.":
> On 11/22/2011 06:09 PM, Julian Stecklina wrote:
> > This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
> > ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
> > consistent with ioapic_mmio_read, which also allows byte and word accesses.
> >
> 
> Your patch indents with spaces, while Linux uses tabs for indents.

You might want to run scripts/checkpatch.pl on your patches before
sending them - this will catch this and many other stylistic faux pas.

-- 
Nadav Har'El| Wednesday, Nov 23 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Why do doctors call what they do
http://nadav.harel.org.il   |practice? Think about it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nested virtualization on Intel and needed cpu flags to pass

2011-11-23 Thread Nadav Har&#x27;El
On Tue, Nov 22, 2011, Gianluca Cecchi wrote about "nested virtualization on 
Intel and needed cpu flags to pass":
> I'm going to test nested virtualization on Intel with Fedora 16 host.
>...
> [root at f16 ~]# uname -r
> 3.1.1-2.fc16.x86_64

This Linux version indeed has nested VMX, while

> # uname -r
> 2.6.40.6-0.fc15.x86_64

doesn't.

> Based on docs, the "nested" option is disabled by default on Intel

Indeed. You need to load the kvm_intel module with the "nested=1"
option. You also need to tell qemu that the emulated CPU will advertise
VMX - the simplest way to do this is with "-cpu host" option to qemu.
It seems you did all of this correctly.

> Preliminary results are not so good.
> I created an F16 guest (f16vm), with default options
>..
> Inside the guest I create a windows xp with default values proposed by
>..
> Until now not able to complete installation

Unfortunately, this is a known bug - which I promised to work on, but
haven't yet got around to :(
nested-vmx.txt explictly lists under "known limitations" that: "The
current code supports running Linux guests under KVM guests."

> more tests to come with newer kernel 3.1.1-2.fc16.x86_64
> But before proceeding, probably I need to adjust particular features
> to enable/disable about
> cpu to pass to f16vm guest...

I don't think this is the problem. The underlying problem is that the
VMX spec is very complex, and ideally nested VMX should correctly emulate
each and every bit and each and every corner of it. Because all our
testing was done with KVM L1s and Linux L2s, all the paths that get
exercised in that case were tested and their bugs ironed out - but it is
possible that Windows L2s end up excercising slightly different code
paths that still have bugs, and those need to be fixed.

Unfortunately, I doubt a newer kernel will solve your problem. I think
there are genuine bugs that will have to be fixed.

> What are current advises about cpu flags to pass to optimally use
> nested virtualization with intel at this time?

I don't think there is any such guidelines. The only thing you really
need is "-cpu qemu64,+vmx" (replace qemu64 by whatever you want) to
advertise the exisance of VMX.

-- 
Nadav Har'El| Wednesday, Nov 23 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |When you lose, don't lose the lesson.
http://nadav.harel.org.il   |
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM VMExit

2011-11-14 Thread Nadav Har&#x27;El
On Mon, Nov 14, 2011, Xin Tong wrote about "Re: KVM VMExit":
> x86 has a Undefined Instruction, its opcode is 0F 0B and it generates
> an invalid opcode exception. This instruction is provided for software
> testing to explicitly generate an invalid opcode exception. The opcode
> for this instruction is reserved for this purpose.

Seeing your recent questions on this list aren't really about KVM
development, but rather about VMX (apparently) basics, I suggest you
also grab yourself a copy of the Intel Software Manual - see volume 3
of:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
which answers all of your recent questions.

About this question - as far as I know there is no way to specifically
cause exit only on the UD2 instruction. What you can do, however, is to
cause exit on any #UD exception, including the one generated by UD2.
The VMCS has an "exception bitmap" which defines which exceptions cause
an exit, and you can turn on the #UD bit to ask for this exit.


-- 
Nadav Har'El|Monday, Nov 14 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Willpower: The ability to eat only one
http://nadav.harel.org.il   |salted peanut.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Invalidate TLB

2011-11-13 Thread Nadav Har&#x27;El
Hi,

On Sun, Nov 13, 2011, Xin Tong wrote about "Invalidate TLB":
> I have 2 questions regard TLB and PageTable in KVM.
> 
> 1. I do not really get how the TLB gets saved and restored on when the
> guest os vmexits and vmresumes. Or maybe the TLB is not saved at all (
> TLB flushes when vmexit and vmenter happen).

The TLB isn't saved and restored... Because the TLB translations of the
host aren't relevant in the guest (and vice versa), in the first
generation of VMX (Intel's hardware virtualization), it was necessary to
flush the TLB on each VMX transion (vmexit and vmenter).

In the second generation, VMX added TLB tags known as "VPID" (virtual
processor id). The hypervisor allocates a unique id for each guest, and
itself uses id 0, and translations cached in the TLB are tagged with the
VPID value. Therefore, when VPID is being used, you no longer need to
flush the TLB on every entry and exit.

> 2. Say i have multiple vCPUs and each of them running some guest OS
> processes. suddenly one of the vCPU vmexits due to a trapping
> instruction. This trapping instruction modifies the page table of a
> currently running process ( running on one of the other vCPUs - vCPU X
> ). how does the vCPU X gets notified ( i.e. TLB invalidated).  some
> kind of apic needs to be sent, right ?

I didn't quite follow your example, but there is indeed a remote tlb
flush IPI.

-- 
Nadav Har'El|Sunday, Nov 13 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I would give my right arm to be
http://nadav.harel.org.il   |ambidextrous.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX

2011-11-13 Thread Nadav Har&#x27;El
Hi,

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 0/10] nEPT: Nested EPT 
support for Nested VMX":
> This patchset is missing a fairly hairy patch that makes reading L2
> virtual addresses work.

This was supposed to be part of the nested TDP code that is already in
the code. To read an L2 virtual address, the code is supposed, if I
understand correctly, to walk the "walk" mmu (EPT01 and guest_cr3)
and then use the EPT table - just like the normal EPT case which uses
the EPT table and the guest_cr3.

I even believed that this inner "walk mmu" will work fine without any
rewrite needed for ia32/ept differences, because it works (or so I believed)
just like normal EPT, with the first table being an EPT table, and the second
table being a normal page table.

I also believed that the fault injection part was also correct: I
thought that the code already knows when to handle the fault in L2 (when
the address is missing in cr3), in L1 (when the translation is missing
in EPT12) or else, in L0.

So what is the "hairy" missing part?

> The standard example is L1 passing a bit of
> hardware (emulated in L0) to a L2; when L2 accesses it, the instruction
> will fault and need to be handled in L0, transparently to L1.  The
> emulation can cause a fault to be injected to L2, or and EPT violation
> or misconfiguration injected to L1.

I don't understand the example. You are refering to nested device
assignment from L1 to L2 (so L1 stops caring about the device)? Since we
don't emulate an IOMMU for L1, how can that be done?

Thanks,
Nadav.

-- 
Nadav Har'El|Sunday, Nov 13 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |An error? Impossible! My modem is error
http://nadav.harel.org.il   |correcting.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-12 Thread Nadav Har&#x27;El
On Sat, Nov 12, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> host may write-protect a page.  Second, the shadow and guest ptes may be
> in different formats (ept vs ia32).

I'm afraid I've lost you here... The shadow table and the to-be-shadowed 
table are both ia32 (this is the normal shadow table code), or both ept
(the nested tdp code). When are they supposed to be in different
formats (ept vs ia32)?

I'm also puzzled in what situation will the host will write-protect an EPT02
(shadow EPT) page?

> In fact that happens to accidentally work, no?  Intermediate ptes are
> always present/write/user, which translates to read/write/execute in EPT.

It didn't work because it also used to set the "accessed" bit, bit 5,
which on EPT is reserved and caused EPT misconfiguration. So I had to
fix link_shadow_page, or nested EPT would not work at all.

> Don't optimize for least changes, optimize for best result afterwards.

As I'm sure you remember, two years ago, in September 6 2009, you wrote in
your blog about the newly contributed nested VMX patch set, and in
particular its nested EPT (which predated the nested NPT contribution).

Nested EPT was, for some workloads, a huge performance improvement, but
you (if I understand correctly) did not want that code in KVM because
it, basically, optimized for getting the job done, in the most correct
and most efficient manner - but without regard of how cleanly this fit with
other types of shadowing (normal shadow page tables, and nested NPT),
or how much of the code was being duplicated or circumvented.

So this time around, I couldn't really "not optimize for least changes".
This time, the nested EPT had to fit (like a square peg in a round hole
;-)), into the preexisting MMU and NPT shadowing. I couldn't really just write
the most correct and most efficient code (which Orit Wasserman already
did, two years earlier). This time I needed to figure out the least obtrusive
way of changing the existing code. The hardest thing about doing this
was trying to understand all the complexities and subtleties of the existing
MMU code in KVM, which already does 101 different cases in one
overloaded piece of code, which is not commented or documented.
And of course, add to that all the complexities (some might even say "cruft")
which the underlying x86 architecture itself has acrued over the years.
So it's not surprising I've missed some of the important subtleties which
didn't have any effect in the typical case I've tried. Like I said, in my
tests nested EPT *did* work. And even getting to that point was hard enough :-)

> We need a third variant of walk_addr_generic that parses EPT format
> PTEs.  Whether that's best done by writing paging_ept.h or modifying
> paging_tmpl.h, I don't know.

Thanks. I'll think about everything you've said in this thread (I'm still
not convinced I understood all your points, so just understanding them
will be the first step). I'll see what I can do to improve the patch.

But I have to be honest - I'm not sure how quickly I can finish this.
I really appreciate all your comments about nested VMX in the last two
years - most of them have been spot-on, 100% correct, and really helpful
for making me understand things which I had previously misunderstood.
However, since you are (of course) extremely familiar with every nook and
cranny of KVM, what normally happens is that every comment which took you
5 minutes to figure out, takes me 5 days to fully understand, and to actually
write, debug and test the fixed code. Every review that takes you two days
to go through (and is very much appreciated!) takes me several months to fix
each and every thing you asked for.

Don't get me wrong, I *am* planning to continue working (part-time) on nested
VMX, and nested EPT in particular. But if you want it to pick up the pace,
I could use some help with actual coding from people who have much more
intimate knowledge of the non-nested-VMX parts of KVM than I have.

In the meantime, if anybody wants to experiment with a much faster
Nested VMX than we had before, you can try my current patch. It may not
be perfect, but in many ways it is better than the old shadow-on-ept code.
And in simple (64 bit, 4k page) kvm-over-kvm configurations like I tried, it
works well.

Nadav.

-- 
Nadav Har'El|  Saturday, Nov 12 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |What's tiny, yellow and very dangerous? A
http://nadav.harel.org.il   |canary with the super-user password.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Nadav Har&#x27;El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> This is all correct, but the code in question parses the EPT12 table
> using the ia32 page table format.  They're sufficiently similar so that
> it works, but it isn't correct.
> 
> Bit 0: EPT readable, ia32 present
> Bit 1: Writable; ia32 meaning dependent on cr0.wp
> Bit 2: EPT executable, ia32 user (so, this implementation will interpret
> a non-executable EPT mapping, if someone could find a use for it, as a
> L2 kernel only mapping)
>

This is a very good point.

I was under the mistaken (?) impression that the page-table shadowing
code will just copy these bits as-is from the shadowed table (EPT12) to the
shadow table (EPT02), without caring what they actually mean. I knew we had
a problem when building, not copying, PTEs, and hence the patch to
link_shadow_page).

Also I realized we sometimes need to actually walk the TDP EPT12+cr3 (e.g.,
to see if an EPT violation is L1's fault), but I thought this was just the
normal TDP walk, which already knows how to correctly read the EPT
table.

> walk_addr() will also write to bits 6/7, which the L1 won't expect.

I didn't notice this :(

Back to the drawing board, I guess. I need to figure out exactly what
needs to be fixed, and how to do this with the least obtrusive changes to
the existing use case (normal shadow page tables, and nested EPT).

-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Learn from mistakes of others; you won't
http://nadav.harel.org.il   |live long enough to make them all yourself
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] nEPT: Module option

2011-11-10 Thread Nadav Har&#x27;El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module 
option":
> > By "this", do you mean without the "nested_ept" option, or without the
> > hypothetical "EPT on shadow page tables" feature?
> 
> Er, both.  The feature should be controlled on a per-guest basis, not
> per host.
>..
> It's just redundant, since we do need a per-guest control.

I agreed that per-guest control would have been nicer, but since we
don't have an API for specifying that per guest since EPT is not,
unfortunately, a CPUID feature, I thought that at least a host-level
flag would be useful.

Why would it be useful? I agree it isn't the most important option since
sliced bread, but if, for example, one day we discover a bug with nested
EPT, L0 can disable it for all L1 guests and basically force them to use
shadow page tables on EPT.
It was also useful for me to have this option for benchmarking, because
I can force back the old shadow-on-EPT method with just a single option
in L0 (instead of needing to give "ept=0" option in L1s).

If you really don't like the existance of this option, I can easily
remove it of course.


-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Guarantee: this email is 100% free of
http://nadav.harel.org.il   |magnetic monopoles, or your money back!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Nadav Har&#x27;El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> > +{
> > +   int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
>...
> > +   vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
>...
> 
> kvm_init_shadow_mmu() will cause ->page_fault to be set to something
> like paging64_page_fault(), which is geared to reading EPT ptes.  How
> does this work?

Hi,

I'm afraid I didn't understand the problem.

Nested EPT's merging of two EPT tables (EPT01 and EPT12) works just like
normal shadow page tables' merging of two CR3s (host cr3 and guest cr3):

When L0 receives a "page fault" from L2 (actually an EPT violation - real
guest #PF don't cause exits), L0 first looks it up in the shadowed table,
which is basically EPT12. If the address is there, L0 handles the fault itself
(updating the shadow EPT table, EPT02 using the normal shadow pte building
code). But if the address wasn't in the shadowed page table (EPT12),
mmu->inject_page_fault() is called, which in our case actually causes L1 to
get an EPT-violation (not #PF - see kvm_propagate_fault()).

Please note that all this logic is shared with the existing nested NPT
code (which itself shared most of the code with the preexisting shadow
page tables code). All this code sharing makes it really difficult to
understand at first glance why the code is really working, but once you
understood why one of these cases works, the others work similarly.
And it does in fact work - in typical cases which I tried, at least.

If you still think I'm missing something, I won't be entirely surprised
( :-) ), so let me know.

Nadav.


-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I put a dollar in one of those change
http://nadav.harel.org.il   |machines. Nothing changed.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] nEPT: Module option

2011-11-10 Thread Nadav Har&#x27;El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module 
option":
> On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> > Add a module option "nested_ept" determining whether to enable Nested EPT.
>...
> > In the future, we can support emulation of EPT for L1 *always*, even when L0
> > itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
> > has some theoretical advantages over the baseline "shadow page tables on
> > shadow page tables" mode typically used when EPT is not available to L0 -
> > namely that L2's cr3 changes and page faults can be handled in L0 and do not
> > need to be propagated to L1. However, currently we do not support this mode,
> > and it is becoming less interesting as newer processors all support EPT.
> >
> >
> 
> I think we can live without this.

By "this", do you mean without the "nested_ept" option, or without the
hypothetical "EPT on shadow page tables" feature?

If the former, then I agree we can "live" without it, but since it was
trivial to add, I don't see what harm it can do, and its nice that we
can return with a single L0 option to the old shadow-on-ept paging.
Is there anything specific you don't like about having this option?

About the latter, I agree - as I said, there isn't much point to go and
write this (quite complicated) 3-level shadowing when all new processors
have EPT anyway. So I didn't.

> But we do need a way to control what
> features are exposed to the guest, for compatibility and live migration
> purposes, as we do with cpuid.  So we need some way for host userspace
> to write to the vmx read-only feature reporting MSRs.

I think this is a general issue (which we already discussed earlier),
of nested VMX and not specific to nested EPT. I already put all the
capabilities which the MSR report in variables initialized in a single
function, nested_vmx_setup_ctls_msrs(), so once we devise an appropriate
userspace interface to set these, we can do so easily.

Does nested SVM also have a similar problem, of whether or not it
advertises new or optional SVM features to L1? If it does have this
problem, how was it solved there?

-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I considered atheism but there weren't
http://nadav.harel.org.il   |enough holidays.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Nadav Har&#x27;El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 04/10] nEPT: Fix page 
table format in nested EPT":
> > @@ -287,6 +287,7 @@ struct kvm_mmu {
> > bool nx;
> >  
> > u64 pdptrs[4]; /* pae */
> > +   u64 link_shadow_page_set_bits;
>...
> > +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 
> > set_bits)
> >  {
> > -   u64 spte;
> > -
> > -   spte = __pa(sp->spt)
> > -   | PT_PRESENT_MASK | PT_ACCESSED_MASK
> > -   | PT_WRITABLE_MASK | PT_USER_MASK;
> > -   mmu_spte_set(sptep, spte);
> > +   mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
> >  }
> >
> 
> Minor nit: you can just use link_shadow_page_set_bits here instead of
> passing it around (unless later you have a different value for the
> parameter?)

The problem was that link_shadow_page did not take an kvm_mmu parameter,
so I don't know where to find this link_shadow_page_set_bits. So either
I pass the pointer to the entire kvm_mmu to link_shadow_page, or I just
pass the only field which I need... I thought that passing the single
field I need was cleaner - but I can easily change it if you prefer to
pass the kvm_mmu.

Thanks,

Nadav.

-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I had a lovely evening. Unfortunately,
http://nadav.harel.org.il   |this wasn't it. - Groucho Marx
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] nEPT: Documentation

2011-11-10 Thread Nadav Har&#x27;El
Update the documentation to no longer say that nested EPT is not supported.

Signed-off-by: Nadav Har'El 
---
 Documentation/virtual/kvm/nested-vmx.txt |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- .before/Documentation/virtual/kvm/nested-vmx.txt2011-11-10 
11:33:59.0 +0200
+++ .after/Documentation/virtual/kvm/nested-vmx.txt 2011-11-10 
11:33:59.0 +0200
@@ -38,8 +38,8 @@ The current code supports running Linux 
 Only 64-bit guest hypervisors are supported.
 
 Additional patches for running Windows under guest KVM, and Linux under
-guest VMware server, and support for nested EPT, are currently running in
-the lab, and will be sent as follow-on patchsets.
+guest VMware server, are currently running in the lab, and will be sent as
+follow-on patchsets.
 
 
 Running nested VMX
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] nEPT: Miscelleneous cleanups

2011-11-10 Thread Nadav Har&#x27;El
Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:34:00.0 +0200
@@ -611,7 +611,6 @@ static void nested_release_page_clean(st
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
@@ -875,8 +874,7 @@ static inline bool nested_cpu_has2(struc
(vmcs12->secondary_vm_exec_control & bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-   struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6020,7 +6018,7 @@ static int vmx_handle_exit(struct kvm_vc
 
if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
!(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-   get_vmcs12(vcpu), vcpu {
+   get_vmcs12(vcpu) {
if (vmx_interrupt_allowed(vcpu)) {
vmx->soft_vnmi_blocked = 0;
} else if (vmx->vnmi_blocked_time > 10LL &&
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/10] nEPT: Advertise EPT to L1

2011-11-10 Thread Nadav Har&#x27;El
Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:59.0 +0200
@@ -1908,6 +1908,7 @@ static u32 nested_vmx_secondary_ctls_low
 static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
/*
@@ -1980,6 +1981,16 @@ static __init void nested_vmx_setup_ctls
nested_vmx_secondary_ctls_low = 0;
nested_vmx_secondary_ctls_high &=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+   if (nested_ept)
+   nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+
+   /* ept capabilities */
+   if (nested_ept) {
+   nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+   nested_vmx_ept_caps &= vmx_capability.ept;
+   } else
+   nested_vmx_ept_caps = 0;
+
 }
 
 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
@@ -2079,8 +2090,8 @@ static int vmx_get_vmx_msr(struct kvm_vc
nested_vmx_secondary_ctls_high);
break;
case MSR_IA32_VMX_EPT_VPID_CAP:
-   /* Currently, no nested ept or nested vpid */
-   *pdata = 0;
+   /* Currently, no nested vpid support */
+   *pdata = nested_vmx_ept_caps;
break;
default:
return 0;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] nEPT: Nested INVEPT

2011-11-10 Thread Nadav Har&#x27;El
If we let L1 use EPT, we should probably also support the INVEPT instruction.

Signed-off-by: Nadav Har'El 
---
 arch/x86/include/asm/vmx.h |2 
 arch/x86/kvm/vmx.c |  112 +++
 2 files changed, 114 insertions(+)

--- .before/arch/x86/include/asm/vmx.h  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/include/asm/vmx.h   2011-11-10 11:33:59.0 +0200
@@ -279,6 +279,7 @@ enum vmcs_field {
 #define EXIT_REASON_APIC_ACCESS 44
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
+#define EXIT_REASON_INVEPT 50
 #define EXIT_REASON_WBINVD 54
 #define EXIT_REASON_XSETBV 55
 
@@ -404,6 +405,7 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT   (1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT   (1ull << 17)
+#define VMX_EPT_INVEPT_BIT (1ull << 20)
 #define VMX_EPT_EXTENT_INDIVIDUAL_BIT  (1ull << 24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT  (1ull << 26)
--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:59.0 +0200
@@ -351,6 +351,8 @@ struct nested_vmx {
struct list_head vmcs02_pool;
int vmcs02_num;
u64 vmcs01_tsc_offset;
+   /* Remember last EPT02, for single-context INVEPT optimization */
+   u64 last_eptp02;
/* L2 must run next, and mustn't decide to exit to L1. */
bool nested_run_pending;
/*
@@ -1987,6 +1989,10 @@ static __init void nested_vmx_setup_ctls
/* ept capabilities */
if (nested_ept) {
nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+   nested_vmx_ept_caps |=
+   VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+   VMX_EPT_EXTENT_CONTEXT_BIT |
+   VMX_EPT_EXTENT_INDIVIDUAL_BIT;
nested_vmx_ept_caps &= vmx_capability.ept;
} else
nested_vmx_ept_caps = 0;
@@ -5568,6 +5574,105 @@ static int handle_vmptrst(struct kvm_vcp
return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+   u32 vmx_instruction_info;
+   unsigned long type;
+   gva_t gva;
+   struct x86_exception e;
+   struct {
+   u64 eptp, gpa;
+   } operand;
+
+
+   if (!nested_ept || !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   /* According to the Intel VMX instruction reference, the memory
+* operand is read even if it isn't needed (e.g., for type==global)
+*/
+   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmx_instruction_info, &gva))
+   return 1;
+   if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+   sizeof(operand), &e)) {
+   kvm_inject_page_fault(vcpu, &e);
+   return 1;
+   }
+
+   type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+   switch (type) {
+   case VMX_EPT_EXTENT_GLOBAL:
+   if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   else {
+   ept_sync_global();
+   nested_vmx_succeed(vcpu);
+   }
+   break;
+   case VMX_EPT_EXTENT_CONTEXT:
+   if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   else {
+   /*
+* We efficiently handle the common case, of L1
+* invalidating the last eptp it used to run L2.
+* TODO: Instead of saving one last_eptp02, look up
+* operand.eptp in the shadow EPT table cache, to
+* find its shadow. Then last_eptp02 won't be needed.
+*/
+   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   if (vmcs12 && nested_cpu_

[PATCH 06/10] nEPT: Some additional comments

2011-11-10 Thread Nadav Har&#x27;El
Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:59.0 +0200
@@ -5815,7 +5815,20 @@ static bool nested_vmx_exit_handled(stru
return nested_cpu_has2(vmcs12,
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
case EXIT_REASON_EPT_VIOLATION:
+   /*
+* L0 always deals with the EPT violation. If nested EPT is
+* used, and the nested mmu code discovers that the address is
+* missing in the guest EPT table (EPT12), the EPT violation
+* will be injected with nested_ept_inject_page_fault()
+*/
+   return 0;
case EXIT_REASON_EPT_MISCONFIG:
+   /*
+* L2 never uses directly L1's EPT, but rather L0's own EPT
+* table (shadow on EPT) or a merged EPT table that L0 built
+* (EPT on EPT). So any problems with the structure of the
+* table is L0's fault.
+*/
return 0;
case EXIT_REASON_WBINVD:
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING);
@@ -6736,7 +6749,12 @@ static void prepare_vmcs02(struct kvm_vc
vmx_set_cr4(vcpu, vmcs12->guest_cr4);
vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
-   /* shadow page tables on either EPT or shadow page tables */
+   /*
+* Note that kvm_set_cr3() and kvm_mmu_reset_context() will do the
+* right thing, and set GUEST_CR3 and/or EPT_POINTER in all supported
+* settings: 1. shadow page tables on shadow page tables, 2. shadow
+* page tables on EPT, 3. EPT on EPT.
+*/
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
 
@@ -7075,7 +7093,6 @@ void load_vmcs12_host_state(struct kvm_v
 
if (nested_cpu_has_ept(vmcs12))
nested_ept_uninit_mmu_context(vcpu);
-   /* shadow page tables on either EPT or shadow page tables */
kvm_set_cr3(vcpu, vmcs12->host_cr3);
kvm_mmu_reset_context(vcpu);
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3

2011-11-10 Thread Nadav Har&#x27;El
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch just
comments it out.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/x86.c |   11 ---
 1 file changed, 11 deletions(-)

--- .before/arch/x86/kvm/x86.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/x86.c   2011-11-10 11:33:59.0 +0200
@@ -690,17 +690,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, u
 */
}
 
-   /*
-* Does the new cr3 value map to physical memory? (Note, we
-* catch an invalid cr3 even in real-mode, because it would
-* cause trouble later on when we turn on paging anyway.)
-*
-* A real CPU would silently accept an invalid cr3 and would
-* attempt to use it - with largely undefined (and often hard
-* to debug) behavior on the guest side.
-*/
-   if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
-   return 1;
vcpu->arch.cr3 = cr3;
__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
vcpu->arch.mmu.new_cr3(vcpu);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Nadav Har&#x27;El
When the existing KVM MMU code creates a shadow page table, it assumes it
has the normal x86 page table format. This is obviously correct for normal
shadow page tables, and also correct for AMD's NPT.
Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary
page tables, so when we create a shadow EPT table (i.e., in nested EPT),
we need to slightly modify the way in which this table table is built.

In particular, when mmu.c's link_shadow_page() creates non-leaf page table
entries, it used to enable the "present", "accessed", "writable" and "user"
flags on these entries. While this is correct for ordinary page tables, it
is wrong in EPT tables - where these bits actually have completely different
meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h).
In particular, leaving the code as-is causes bit 5 of the PTE to be turned on
(supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes
an "EPT Misconfiguration" failure.

So we must move link_shadow_page's list of extra bits to a new mmu context
field, which is set differently for nested EPT.

Signed-off-by: Nadav Har'El 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   16 +---
 arch/x86/kvm/paging_tmpl.h  |6 --
 arch/x86/kvm/vmx.c  |3 +++
 4 files changed, 17 insertions(+), 9 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h 2011-11-10 11:33:59.0 
+0200
+++ .after/arch/x86/include/asm/kvm_host.h  2011-11-10 11:33:59.0 
+0200
@@ -287,6 +287,7 @@ struct kvm_mmu {
bool nx;
 
u64 pdptrs[4]; /* pae */
+   u64 link_shadow_page_set_bits;
 };
 
 struct kvm_vcpu_arch {
--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:59.0 +0200
@@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s
vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr;
vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
vcpu->arch.mmu.shadow_root_level = get_ept_level();
+   vcpu->arch.mmu.link_shadow_page_set_bits =
+   VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+   VMX_EPT_EXECUTABLE_MASK;
 
vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
 
--- .before/arch/x86/kvm/mmu.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/mmu.c   2011-11-10 11:33:59.0 +0200
@@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_
return __shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
+static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits)
 {
-   u64 spte;
-
-   spte = __pa(sp->spt)
-   | PT_PRESENT_MASK | PT_ACCESSED_MASK
-   | PT_WRITABLE_MASK | PT_USER_MASK;
-   mmu_spte_set(sptep, spte);
+   mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
 }
 
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
@@ -3366,6 +3361,13 @@ int kvm_init_shadow_mmu(struct kvm_vcpu 
vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
vcpu->arch.mmu.base_role.smep_andnot_wp
= smep && !is_write_protection(vcpu);
+   /*
+* link_shadow() should apply these bits in shadow page tables, and
+* in shadow NPT tables (nested NPT). For nested EPT, different bits
+* apply.
+*/
+   vcpu->arch.mmu.link_shadow_page_set_bits = PT_PRESENT_MASK |
+   PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
return r;
 }
--- .before/arch/x86/kvm/paging_tmpl.h  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/paging_tmpl.h   2011-11-10 11:33:59.0 +0200
@@ -515,7 +515,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
goto out_gpte_changed;
 
if (sp)
-   link_shadow_page(it.sptep, sp);
+   link_shadow_page(it.sptep, sp,
+   vcpu->arch.mmu.link_shadow_page_set_bits);
}
 
for (;
@@ -535,7 +536,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 
sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
  true, direct_access, it.sptep);
-   link_shadow_page(it.sptep, sp);
+   link_shadow_page(it.sptep, sp,
+   vcpu->arch.mmu.link_shadow_page_set_bits);
}
 
clear_sp_write_flooding_count(it.sptep);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] nEPT: Fix cr3 handling in nested exit and entry

2011-11-10 Thread Nadav Har&#x27;El
The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   30 ++
 1 file changed, 30 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:58.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:58.0 +0200
@@ -6737,6 +6737,17 @@ static void prepare_vmcs02(struct kvm_vc
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
 
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+*/
+   if (enable_ept) {
+   vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+   vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+   vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+   vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+   }
+
kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -6968,6 +6979,25 @@ void prepare_vmcs12(struct kvm_vcpu *vcp
vmcs12->guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+   /*
+* In some cases (usually, nested EPT), L2 is allowed to change its
+* own CR3 without exiting. If it has changed it, we must keep it.
+* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+*/
+   if (enable_ept)
+   vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so save their PDPTEs
+*/
+   if (enable_ept) {
+   vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+   vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+   vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+   vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+   }
+
/* TODO: These cannot have changed unless we have MSR bitmaps and
 * the relevant bit asks not to trap the change */
vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Nadav Har&#x27;El
KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). We then need to switch back and forth between this
nested context and the regular MMU context when switching between L1 and L2.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   60 +++
 1 file changed, 60 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:58.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:58.0 +0200
@@ -6443,6 +6443,59 @@ static void vmx_set_supported_cpuid(u32 
entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+   /* return the page table to be shadowed - in our case, EPT12 */
+   return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static u64 nested_ept_get_pdptr(struct kvm_vcpu *vcpu, int index)
+{
+   /*
+* This function is called (as mmu.get_pdptr()) in mmu.c to help read
+* a to-be-shadowed page table in PAE (3-level) format. However, the
+* EPT table we're now shadowing (this is the nested EPT mmu), must
+* always have 4 levels, and is not in PAE format, so this function
+* should never be called.
+*/
+   kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+}
+
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+   struct x86_exception *fault)
+{
+   struct vmcs12 *vmcs12;
+   nested_vmx_vmexit(vcpu);
+   vmcs12 = get_vmcs12(vcpu);
+   /*
+* Note no need to set vmcs12->vm_exit_reason as it is already copied
+* from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
+*/
+   vmcs12->exit_qualification = fault->error_code;
+   vmcs12->guest_physical_address = fault->address;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+   int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
+
+   vcpu->arch.mmu.set_cr3   = vmx_set_cr3;
+   vcpu->arch.mmu.get_cr3   = nested_ept_get_cr3;
+   vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr;
+   vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+   vcpu->arch.mmu.shadow_root_level = get_ept_level();
+
+   vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
+
+   return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.walk_mmu = &vcpu->arch.mmu;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -6652,6 +6705,11 @@ static void prepare_vmcs02(struct kvm_vc
vmx_flush_tlb(vcpu);
}
 
+   if (nested_cpu_has_ept(vmcs12)) {
+   kvm_mmu_unload(vcpu);
+   nested_ept_init_mmu_context(vcpu);
+   }
+
if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
vcpu->arch.efer = vmcs12->guest_ia32_efer;
if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -6982,6 +7040,8 @@ void load_vmcs12_host_state(struct kvm_v
vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
kvm_set_cr4(vcpu, vmcs12->host_cr4);
 
+   if (nested_cpu_has_ept(vmcs12))
+   nested_ept_uninit_mmu_context(vcpu);
/* shadow page tables on either EPT or shadow page tables */
kvm_set_cr3(vcpu, vmcs12->host_cr3);
kvm_mmu_reset_context(vcpu);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] nEPT: Module option

2011-11-10 Thread Nadav Har&#x27;El
Add a module option "nested_ept" determining whether to enable Nested EPT.

Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when
running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
its own cr3 and take its own page faults without either of L0 or L1 getting
involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over ept, and shadow page
tables over shadow page tables).

nested_ept is currently enabled by default (when nested VMX is enabled),
unless L0 doesn't have EPT or disabled it with ept=0.

Users would not normally want to explicitly disable this option. One reason
why one might want to disable it is to force L1 to make due without the EPT
capability, when anticipating a future need to migrate this L1 to another
host which doesn't have EPT. Note that currently there is no API to turn off
nested EPT for just a single L1 guest. However, obviously, an individual L1
guest may choose not to use EPT - the nested_cpu_has_ept() checks if L1
actually used EPT when running L2.

In the future, we can support emulation of EPT for L1 *always*, even when L0
itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
has some theoretical advantages over the baseline "shadow page tables on
shadow page tables" mode typically used when EPT is not available to L0 -
namely that L2's cr3 changes and page faults can be handled in L0 and do not
need to be propagated to L1. However, currently we do not support this mode,
and it is becoming less interesting as newer processors all support EPT.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   12 
 1 file changed, 12 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:58.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:58.0 +0200
@@ -83,6 +83,10 @@ module_param(fasteoi, bool, S_IRUGO);
 static int __read_mostly nested = 0;
 module_param(nested, bool, S_IRUGO);
 
+/* Whether L0 emulates EPT for its L1 guests. It doesn't mean L1 must use it */
+static int __read_mostly nested_ept = 1;
+module_param(nested_ept, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK \
@@ -875,6 +879,11 @@ static inline bool nested_cpu_has_virtua
return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2642,6 +2651,9 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_ple())
ple_gap = 0;
 
+   if (!nested || !enable_ept)
+   nested_ept = 0;
+
if (nested)
nested_vmx_setup_ctls_msrs();
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/10] nEPT: Nested EPT support for Nested VMX

2011-11-10 Thread Nadav Har&#x27;El
The following patches add nested EPT support to Nested VMX.

Nested EPT means emulating EPT for an L1 guest, allowing it use EPT when
running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
its own cr3 and take its own page faults without either of L0 or L1 getting
involved. In many workloads this significanlty improves L2's performance over
the previous two alternatives (shadow page tables over ept, and shadow page
tables over shadow page tables). Our paper [1] described these three options,
and the advantages of nested EPT ("multidimensional paging").

Nested EPT is enabled by default (if the hardware supports EPT), so users do
not have to do anything special to enjoy the performance improvement that
this patch gives to L2 guests.

Just as a non-scientific, non-representative indication of the kind of
dramatic performance improvement you may see in workloads that have a lot of
context switches and page faults, here is a measurement of the time
an example single-threaded "make" took in L2 (kvm over kvm):

 shadow over shadow: 105 seconds
 ("ept=0" forces this)

 shadow over EPT: 87 seconds
 (the previous default; Can be forced now with "nested_ept=0")

 EPT over EPT: 29 seconds
 (the default after this patch)

Note that the same test on L1 (with EPT) took 25 seconds, so for this example
workload, performance of nested virtualization is now very close to that of
single-level virtualization.


[1] "The Turtles Project: Design and Implementation of Nested Virtualization",
http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf


Patch statistics:
-

 Documentation/virtual/kvm/nested-vmx.txt |4 
 arch/x86/include/asm/kvm_host.h  |1 
 arch/x86/include/asm/vmx.h   |2 
 arch/x86/kvm/mmu.c   |   16 -
 arch/x86/kvm/paging_tmpl.h   |6 
 arch/x86/kvm/vmx.c   |  259 -
 arch/x86/kvm/x86.c   |   11 
 7 files changed, 269 insertions(+), 30 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intel VT-x VMCS doubt

2011-10-16 Thread Nadav Har&#x27;El
On Sat, Oct 15, 2011, Shashank Rachamalla wrote about "Re: Intel VT-x VMCS 
doubt":
> However, I am not sure when the CPU switches between host and guest
> state ? I came across instructions VMRUN and VMRESUME while reading
> which do a VM entry. I think guest state gets loaded when these
> instructions are executed. Someone please clarify ? 

Indeed - on VMLAUNCH or VMRESUME ("VMRUN" is an AMD instruction and not
relevant to your Intel question), the processor, among other things, sets
its state to the "guest state" specifies in the given VMCS. When the processor
decides to "exit" the guest (the host determines, in the VMCS, on which events
such exits happen), the processor loads back the "host state" as written in
the VMCS.

But after replying to your specific question, I'd recommend that before you
continue to ask these questions, you take some time to read an organized
introduction to VT-x. The official description of VMX (VT-x) by Intel can be
found in their Software Developer's Manual, volume 3, starting with Chapter 20
("Introduction to Virtual-Machine Extensions"). You can find it online (for
browsing or downloading everything as one PDF file) in

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-3b-system-programming-manual.html

Volume 2 of the same manual gives additional details on each instructions
like the aforementioned VMLAUNCH and friends.

-- 
Nadav Har'El|Sunday, Oct 16 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I started out with nothing... I still
http://nadav.harel.org.il   |have most of it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intel VT-x VMCS doubt

2011-10-16 Thread Nadav Har&#x27;El
On Sat, Oct 15, 2011, Shashank Rachamalla wrote about "Intel VT-x VMCS doubt":
> Hi 
> 
> After catching up with some reading on Intel VT-x, I could understand
> the use of VMCS ( virtual machine control structure ) which allows a VMM
> to specify instructions and events that cause VM entries and exits. My
> current understanding is that a VMCS structure is maintained by VMM for
> each guest and this structure is used by CPU during a VM entry and VM
> exit.

This is correct.

> However, I am not sure if each CPU core has a separate VMCS
> structure for each guest ? Also, does the number of guest VCPUs
> determine the number of VMCS structures required for a guest ? Can
> someone please help me figure out answers to these questions..

I'm not sure the KVM development mailing list is the best place to ask these
questions, but I'll try to answer anyway.

The VMCS is, at its essense, an ordinary data structure in ordinary memory,
and as you said there is just one per guest, and their number is NOT multiplied
by the number of physical CPUs.

When the hypervisor decides to actually run a particular guest with a
particular VMCS on a particular physical CPU, there are certain things it
needs to do to make this VMCS "ready" for this CPU. One things that needs to
be done is to modify a few host fields in the VMCS, because in Linux a few
registers take different default values in different cores, and the "host
fields" in the VMCS remember the values we need to go back to after the exit.
The second thing that needs to be done is, according to the spec, to VMPTRLD
the VMCS onto the physical CPU (doing VMCLEAR first on a previous CPU, if this
VMCS was previously used on a different CPU). This is necessary because,
according to the spec, each physical CPU (core) might want work with a cached
version of the VMCS - and this cache might be per core.

So to make a long story short: while KVM is running a particular VMCS on a
particular CPU, it slightly modifies this VMCS to fit the needs of this
particular CPU; It *doesn't* save additional copies of the same VMCS for
running on different CPUs. The event of switching a guest from running on
one physical CPU to a different physical CPU is known as "VCPU migration"
(the VCPU "migrates" between two physical CPUs) and KVM handles this case
by setting a few VMCS fields and doing VMCLEAR/VMPTRLD as explained above.

I hope this helps make things clearer...

-- 
Nadav Har'El|Sunday, Oct 16 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Computers are like air conditioners. Both
http://nadav.harel.org.il   |stop working if you open windows.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] nVMX: Add KVM_REQ_IMMEDIATE_EXIT

2011-09-25 Thread Nadav Har&#x27;El
On Fri, Sep 23, 2011, Marcelo Tosatti wrote about "Re: [PATCH 1/2] nVMX: Add 
KVM_REQ_IMMEDIATE_EXIT":
> On Thu, Sep 22, 2011 at 01:52:56PM +0300, Nadav Har'El wrote:
> > This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
> > This bit requests that when next entering the guest, we should run it only
> > for as little as possible, and exit again.
> > 
> > We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
>...
> > @@ -5647,6 +5648,8 @@ static int vcpu_enter_guest(struct kvm_v
> > }
> > if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
> > record_steal_time(vcpu);
> > +   req_immediate_exit =
> > +   kvm_check_request(KVM_REQ_IMMEDIATE_EXIT, vcpu);
>...
> The immediate exit information can be lost if entry decides to bail out.
> You can do 
> 
> req_immediate_exit = kvm_check_request(KVM_REQ_IMMEDIATE_EXIT)
> after preempt_disable()
> and then transfer back the bit in the bail out case in
> if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests

Thanks.

But thinking about this a bit, it seems to me that in my case *losing* this
bit on a canceled entry is the correct thing to do, as turning on this bit was
decided in the injection phase (in enable_irq_window()), and next time, if
the reason to turn on this bit still exists (i.e., L0 has something to inject
to L1, but L2 needs to run), we will turn it on again.

-- 
Nadav Har'El|Sunday, Sep 25 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Guarantee: this email is 100% free of
http://nadav.harel.org.il   |magnetic monopoles, or your money back!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] nVMX: Fix warning-causing idt-vectoring-info behavior

2011-09-22 Thread Nadav Har&#x27;El
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".

Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.

In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.

Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.

Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.

Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-09-22 13:51:31.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-09-22 13:51:31.0 +0300
@@ -3993,11 +3993,12 @@ static void vmx_set_nmi_mask(struct kvm_
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
-   struct vmcs12 *vmcs12;
-   if (to_vmx(vcpu)->nested.nested_run_pending)
+   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   if (to_vmx(vcpu)->nested.nested_run_pending ||
+   (vmcs12->idt_vectoring_info_field &
+VECTORING_INFO_VALID_MASK))
return 0;
nested_vmx_vmexit(vcpu);
-   vmcs12 = get_vmcs12(vcpu);
vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
vmcs12->vm_exit_intr_info = 0;
/* fall through to normal code, but now in L1, not L2 */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] nVMX: Add KVM_REQ_IMMEDIATE_EXIT

2011-09-22 Thread Nadav Har&#x27;El
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.

We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.

Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.

On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c   |   11 +++
 arch/x86/kvm/x86.c   |6 ++
 include/linux/kvm_host.h |1 +
 3 files changed, 14 insertions(+), 4 deletions(-)

--- .before/include/linux/kvm_host.h2011-09-22 13:51:31.0 +0300
+++ .after/include/linux/kvm_host.h 2011-09-22 13:51:31.0 +0300
@@ -48,6 +48,7 @@
 #define KVM_REQ_EVENT 11
 #define KVM_REQ_APF_HALT  12
 #define KVM_REQ_STEAL_UPDATE  13
+#define KVM_REQ_IMMEDIATE_EXIT14
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID0
 
--- .before/arch/x86/kvm/x86.c  2011-09-22 13:51:31.0 +0300
+++ .after/arch/x86/kvm/x86.c   2011-09-22 13:51:31.0 +0300
@@ -5610,6 +5610,7 @@ static int vcpu_enter_guest(struct kvm_v
bool nmi_pending;
bool req_int_win = !irqchip_in_kernel(vcpu->kvm) &&
vcpu->run->request_interrupt_window;
+   bool req_immediate_exit = 0;
 
if (vcpu->requests) {
if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
@@ -5647,6 +5648,8 @@ static int vcpu_enter_guest(struct kvm_v
}
if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
record_steal_time(vcpu);
+   req_immediate_exit =
+   kvm_check_request(KVM_REQ_IMMEDIATE_EXIT, vcpu);
 
}
 
@@ -5706,6 +5709,9 @@ static int vcpu_enter_guest(struct kvm_v
 
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
 
+   if (req_immediate_exit)
+   smp_send_reschedule(vcpu->cpu);
+
kvm_guest_enter();
 
if (unlikely(vcpu->arch.switch_db_regs)) {
--- .before/arch/x86/kvm/vmx.c  2011-09-22 13:51:31.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-09-22 13:51:31.0 +0300
@@ -3858,12 +3858,15 @@ static bool nested_exit_on_intr(struct k
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
u32 cpu_based_vm_exec_control;
-   if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
-   /* We can get here when nested_run_pending caused
-* vmx_interrupt_allowed() to return false. In this case, do
-* nothing - the interrupt will be injected later.
+   if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
+   /*
+* We get here if vmx_interrupt_allowed() said we can't
+* inject to L1 now because L2 must run. Ask L2 to exit
+* right after entry, so we can inject to L1 more promptly.
 */
+   kvm_make_request(KVM_REQ_IMMEDIATE_EXIT, vcpu);
return;
+   }
 
cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] nVMX injection corrections

2011-09-22 Thread Nadav Har&#x27;El
The following two patches solve two injection-related nested VMX issues:

 1. When we must run L2 next (namely on L1's VMLAUNCH/VMRESUME), injection
into L1 was delayed for an unknown amount of time - until L2 exits.
We now force (using a self IPI) an exit immediately after entry to L2,
so that the injection into L1 happens promptly.

 2. "unexpected, valid vectoring info" warnings appeared in L1.
These are fixed by correcting the emulation of concurrent L0->L1 and
L1->L2 injections: We cannot inject into L1 until the injection into L2
has been processed.

Patch statistics:
-

 arch/x86/kvm/vmx.c   |   18 +++---
 arch/x86/kvm/x86.c   |6 ++
 include/linux/kvm_host.h |1 +
 3 files changed, 18 insertions(+), 7 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] nVMX: Fix warning-causing idt-vectoring-info behavior

2011-09-21 Thread Nadav Har&#x27;El
This patch solves two outstanding nested-VMX issues:

1. "unexpected, valid vectoring info" warnings appeared in L1.
These are fixed by correcting the emulation of concurrent L0->L1 and
L1->L2 injections.

2. When we must run L2 next (e.g., on L1's VMLAUNCH/VMRESUME), injection into
L1 was delayed for an unknown amount of time - until L2 exits.
We now force (using a self IPI) an exit immediately after entry to L2,
so that the injection into L1 happens promptly.

Some more details about these issues:

When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".

Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.

In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.

However, to ensure prompt injection to L1, instead of letting L2 run for
a while after entering, we can send a self-IPI, which will ensure that
L2 exits immediately after the (necessary) entry, so we can inject into L1
as soon as possible.

Note that we added this self-IPI not only in the idt-vectoring-info case
above, but in every case where we are forced to enter L2 despite wishing to
inject to L1. This includes the case when L1 just VMLAUNCH/VMRESUMEed L2.

Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.

Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-09-21 13:45:59.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-09-21 13:45:59.0 +0300
@@ -3858,12 +3858,17 @@ static bool nested_exit_on_intr(struct k
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
u32 cpu_based_vm_exec_control;
-   if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
-   /* We can get here when nested_run_pending caused
-* vmx_interrupt_allowed() to return false. In this case, do
-* nothing - the interrupt will be injected later.
+   if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
+   /*
+* We get here if vmx_interrupt_allowed() returned 0 because
+* we must enter L2 now, so we can't inject to L1 now. If we
+* just do nothing, L2 will later exit and we can inject the
+* IRQ to L1 then. But to make L2 exit more promptly, we send
+* a self-IPI, causing L2 to exit right after entry.
 */
+   smp_send_reschedule(vcpu->cpu);
return;
+   }
 
cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3990,11 +3995,12 @@ static void vmx_set_nmi_mask(struct kvm_
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
-   struct vmcs12 *vmcs12;
-   if (to_vmx(vcpu)->nested.nested_run_pending)
+   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   if (to_vmx(vcpu)->nested.nested_run_pending ||
+   (vmcs12->idt_vectoring_info_field &
+VECTORING_INFO_VALID_MASK))
  

Re: [PATCH 3/3] Fix TSC MSR read in nested SVM

2011-08-10 Thread Nadav Har&#x27;El
On Sun, Aug 07, 2011, Joerg Roedel wrote about "Re: [PATCH 3/3] Fix TSC MSR 
read in nested SVM":
> > Joerg, can you review and ack Nadav's SVM patch? TIA
> 
> Tested-by: Joerg Roedel 
> Acked-by: Joerg Roedel 
> 
> Reviewed and tested (didn't apply cleanly, but was easy to fix that up).
> Looks all good so far.

Hi guys, I'm a bit confused how we want to proceed from here.

The patches which I sent appear to fix the original bug (as confirmed by
the two people who reported it), but Zachary warned that it would break
migration of nested SVM while L2 is running. I asked whether migration
works at all while L2 is running (without exiting to L1 first) - and
Marcelo suggested that it doesn't.

If the problem Zachary pointed to is considered serious, I proposed a second
option - to leave the code to *wrongly* return the L1 TSC MSR always, and
check (and warn) in situations where this value is wrongly returned to the
guest, but this will leave qemu to always read the TSC MSR from L1, even when
L2 is running. While I proposed this as a second option, I don't think I
can recommend it.

So what's the verdict? :-)

Thanks,
Nadav.

-- 
Nadav Har'El|   Wednesday, Aug 10 2011, 10 Av 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |It's fortunate I have bad luck - without
http://nadav.harel.org.il   |it I would have no luck at all!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] qemu-kvm-0.15.0

2011-08-10 Thread Nadav Har&#x27;El
On Wed, Aug 10, 2011, Thomas Mittelstaedt wrote about "Re: [ANNOUNCE] 
qemu-kvm-0.15.0":
> Grabbed the version, created a disk image of 5G and tried to load and
> iso via 
> /usr/local/bin/qemu  -hda ~/virtualdisk.img -cdrom
> ~/Downloads/oneiric-desktop-i386-alpha3-lubuntu.iso -boot d -m 1024
> 
> and it is very, very slow. Tried the same with the installed 0.11

Sounds like maybe you are using Qemu without KVM?
Is running qemu with the "--enable-kvm" option better?

"--enable-kvm" used to be the default in qemu-kvm, but maybe this changed
(I haven't been following the discussion in this area).

-- 
Nadav Har'El|   Wednesday, Aug 10 2011, 10 Av 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Always keep your words soft and sweet,
http://nadav.harel.org.il   |just in case you have to eat them.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] postcopy livemigration proposal

2011-08-08 Thread Nadav Har&#x27;El
> >* What's is postcopy livemigration
> >It is is yet another live migration mechanism for Qemu/KVM, which
> >implements the migration technique known as "postcopy" or "lazy"
> >migration. Just after the "migrate" command is invoked, the execution
> >host of a VM is instantaneously switched to a destination host.

Sounds like a cool idea.

> >The benefit is, total migration time is shorter because it transfer
> >a page only once. On the other hand precopy may repeat sending same pages
> >again and again because they can be dirtied.
> >The switching time from the source to the destination is several
> >hunderds mili seconds so that it enables quick load balancing.
> >For details, please refer to the papers.

While these are the obvious benefits, the possible downside (that, as
always, depends on the workload) is the amount of time that the guest
workload runs more slowly than usual, waiting for pages it needs to
continue. There are a whole spectrum between the guest pausing completely
(which would solve all the problems of migration, but is often considered
unacceptible) and running at full-speed. Is it acceptable that the guest
runs at 90% speed during the migration? 50%? 10%?
I guess we could have nothing to lose from having both options, and choosing
the most appropriate technique for each guest!

> That's terrific  (nice video also)!
> Orit and myself had the exact same idea too (now we can't patent it..).

I think new implementation is not the only reason why you cannot patent
this idea :-) Demand-paged migration has actually been discussed (and done)
for nearly a quarter of a century (!) in the area of *process* migration.

The first use I'm aware of was in CMU's Accent 1987 - see [1].
Another paper, [2], written in 1991, discusses how process migration is done
in UCB's Sprite operating system, and evaluates the various alternatives
common at the time (20 years ago), including what it calls "lazy copying"
is more-or-less the same thing as "post copy". Mosix (a project which, in some
sense, is still alive to day) also used some sort of cross between pre-copying
(of dirty pages) and copying on-demand of clean pages (from their backing
store on the source machine).


References
[1] "Attacking the Process Migration Bottleneck"
 http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf
[2]  "Transparent Process Migration: Design Alternatives and the Sprite
 Implementation"
 http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf

> Advantages:
> - Virtual machines are using more and more memory resources ,
> for a virtual machine with very large working set doing live
> migration with reasonable down time is impossible today.

If a guest actually constantly uses (working set) most of its allocated
memory, it will basically be unable to do any significant amount of work
on the destination VM until this large working set is transfered to the
destination. So in this scenario, "post copying" doesn't give any
significant advantages over plain-old "pause guest and send it to the
destination". Or am I missing something?

> Disadvantageous:
> - During the live migration the guest will run slower than in
> today's live migration. We need to remember that even today
> guests suffer from performance penalty on the source during the
> COW stage (memory copy).

I wonder if something like asynchronous page faults can help somewhat with
multi-process guest workloads (and modified (PV) guest OS). 

> - Failure of the source or destination or the network will cause
> us to lose the running virtual machine. Those failures are very
> rare.

How is this different from a VM running on a single machine that fails?
Just that the small probability of failure (roughly) doubles for the
relatively-short duration of the transfer?


-- 
Nadav Har'El|   Monday, Aug  8 2011, 8 Av 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If glory comes after death, I'm not in a
http://nadav.harel.org.il   |hurry. (Latin proverb)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   >